TY - JOUR
T1 - Sequential feature selection and inference using multi-variate random forests
AU - Mayer, Joshua
AU - Rahman, Raziur
AU - Ghosh, Souparno
AU - Pal, Ranadip
N1 - Funding Information:
Research reported in this article was supported by the The National Institute of General Medical Sciences of the National Institute of Health under award number R01GM122084. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Publisher Copyright:
© 2017 The Author. Published by Oxford University Press. All rights reserved. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
PY - 2018/4/15
Y1 - 2018/4/15
N2 - Motivation Random forest (RF) has become a widely popular prediction generating mechanism. Its strength lies in its flexibility, interpretability and ability to handle large number of features, typically larger than the sample size. However, this methodology is of limited use if one wishes to identify statistically significant features. Several ranking schemes are available that provide information on the relative importance of the features, but there is a paucity of general inferential mechanism, particularly in a multi-variate set up. We use the conditional inference tree framework to generate a RF where features are deleted sequentially based on explicit hypothesis testing. The resulting sequential algorithm offers an inferentially justifiable, but model-free, variable selection procedure. Significant features are then used to generate predictive RF. An added advantage of our methodology is that both variable selection and prediction are based on conditional inference framework and hence are coherent. Results We illustrate the performance of our Sequential Multi-Response Feature Selection approach through simulation studies and finally apply this methodology on Genomics of Drug Sensitivity for Cancer dataset to identify genetic characteristics that significantly impact drug sensitivities. Significant set of predictors obtained from our method are further validated from biological perspective.
AB - Motivation Random forest (RF) has become a widely popular prediction generating mechanism. Its strength lies in its flexibility, interpretability and ability to handle large number of features, typically larger than the sample size. However, this methodology is of limited use if one wishes to identify statistically significant features. Several ranking schemes are available that provide information on the relative importance of the features, but there is a paucity of general inferential mechanism, particularly in a multi-variate set up. We use the conditional inference tree framework to generate a RF where features are deleted sequentially based on explicit hypothesis testing. The resulting sequential algorithm offers an inferentially justifiable, but model-free, variable selection procedure. Significant features are then used to generate predictive RF. An added advantage of our methodology is that both variable selection and prediction are based on conditional inference framework and hence are coherent. Results We illustrate the performance of our Sequential Multi-Response Feature Selection approach through simulation studies and finally apply this methodology on Genomics of Drug Sensitivity for Cancer dataset to identify genetic characteristics that significantly impact drug sensitivities. Significant set of predictors obtained from our method are further validated from biological perspective.
UR - http://www.scopus.com/inward/record.url?scp=85046705253&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btx784
DO - 10.1093/bioinformatics/btx784
M3 - Article
C2 - 29267851
AN - SCOPUS:85046705253
VL - 34
SP - 1336
EP - 1344
JO - Bioinformatics
JF - Bioinformatics
SN - 1367-4803
IS - 8
ER -