Variable Selection and Feature Screening in High-dimensional Data

Author:Yan Xiaodong

Supervisor:Tang Niansheng


Degree Year:2017





Variable selection and feature screening are popularly-investigated in regression model.We usually conduct variable selection in fixed-dimensional or diverging-dimensional(p can grow polynomially with the sample size n)covariates and identify the inactive predictors and estimating the nonzero parameters.For ultra-high dimensional covariates(p can grow exponentially with n),we firstly focus on screening the informative predictors.In practical ultra-high dimensional dataset,model fitting through variable selection is available after feature screening.Based on the two popular topics,this article investigates two variable selection procedures in the first two chapters and propose two novel screening methods in the last two chapters.In chapter 1,we note that growing-dimensional data with likelihood unavailable are often encountered in various fields.This paper presents a penalized exponentially tilted likelihood(PETL)for variable selection and parameter estimation for growing dimensional unconditional moment models in the presence of correlation among variables and model misspecification.We investigate the consistent and oracle properties of the PETL estimators of parameters,and show that the constrainedly PETL ratio statistic for testing contrast hypothesis asymptotically follows the central chi-squared distribution.Theoretical results reveal that the PETL approach is robust to model misspecification.We also study high-order asymptotic properties of the proposed PETL estimators.Simulation studies are conducted to investigate the finite performance of the proposed methodologies.An example from the Boston Housing Study is illustratedChapter 2 argues that High-dimensional sparse modeling with likelihood unavailable in censored survival data is of great practical significance.In this article,we utilize certain growing dimensional general estimating equations and propose a penalized generalized empirical likelihood,by combining the folded-concave penalties,for simul taneous variable selection and parameter estimation for the censored survival data in the setting of correlation among variables.We firstly argue the establishment of the general estimating equations of censored survival data and its asymptotic properties with growing dimension.Moreover,we establish the consistency and oracle properties of the penalized generalized empirical likelihood estimators,and show that the penalized generalized empirical likelihood ratio statistic for testing contrast hypothesis asymptotically follows the standard central chi-squared distribution.The conditions of local and restricted global optimal of weighted penalized generalized empirical likelihood estimators are argued.We present an iterative one-step algorithm for efficient implementation based on local linear approximation and rigorously investigating its convergence property.Simulation studies and a real data example illustrate the effectiveness of our proposed methods and its practical use.Chapter 3 proposes a novel model-free screening procedure for ultrahigh dimensional data analysis.By utilizing slicing technique which has been successfully applied to continuous variables,we construct a new index called the fused mean-variance for feature screening.This method has the following merits:(i)it is model-free,i.e.,without specifying regression form of predictors and response variable;(ii)it can be used to analyze various types of variables including discrete,categorical and continuous variables;(iii)it still works well even when the covariates/random errors are heavy-tailed or the predictors are strongly dependent.Under some regularity conditions,we establish the sure screening and rank consistency.Simulation studies are conducted to assess the performance of the proposed approach.A real data is used to illustrate the proposed method.In the last chapter,we propose a Spearman rank correlation screening procedure for ultrahigh dimensional data,two adjusted versions are concerned for non-censored and censored response,respectively.The proposed method,based on the robust rank correlation coefficient between response and predictor variables rather than the Pearson correlation has the following distingushiable merits:(i)It is robust and model-free without specifying any regression form of predictors and response variable;(ii)The sure screening and rank consistency properties can hold under some mild regularity conditions;(iii)It still works well when the covariates or error distribution is heavy-tailed or when the predictors are strongly dependent with each other;(iv)The use of indicator functions in rank correlation screening greatly simplifies the theoretical derivation due to the boundedness and monotonic invariance of the resulting statistics,compared with previous studies on variable screening.Numerical comparison indicates that the proposed approach performs much better than the most existing methods in various models,especially for censored response with high-censoring ratio.We also illustrate our method using mantle cell lymphoma microarray dataset with censored response.