在多元線性回歸中,并不是所用特征越多越好;選擇少量、合適的特征既可以避免過擬合,也可以增加模型解釋度。這里介紹3種方法來選擇特征:最優(yōu)子集選擇
、向前或向后逐步選擇
、交叉驗(yàn)證法
。
最優(yōu)子集選擇
這種方法的思想很簡(jiǎn)單,就是把所有的特征組合都嘗試建模一遍,然后選擇最優(yōu)的模型?;救缦拢?/p>
對(duì)于p個(gè)特征,從k=1到k=p——
從p個(gè)特征中任意選擇k個(gè),建立C(p,k)個(gè)模型,選擇最優(yōu)的一個(gè)(RSS最小或R2最大);
從p個(gè)最優(yōu)模型中選擇一個(gè)最優(yōu)模型(交叉驗(yàn)證誤差、Cp、BIC、Adjusted R2等指標(biāo))。
這種方法優(yōu)勢(shì)很明顯:所有各種可能的情況都嘗遍了,最后選擇的一定是最優(yōu);劣勢(shì)一樣很明顯:當(dāng)p越大時(shí),計(jì)算量也會(huì)越發(fā)明顯地增大(2^p)。因此這種方法只適用于p較小的情況。
以下為R中ISLR
包的Hitters
數(shù)據(jù)集為例,構(gòu)建棒球運(yùn)動(dòng)員的多元線性模型。
> library(ISLR) > Hitters <- na.omit(Hitters) > dim(Hitters) # 除去Salary做為因變量,還剩下19個(gè)特征[1] 263 20> library(leaps) > regfit.full = regsubsets(Salary~.,Hitters,nvmax = 19) #選擇最大19個(gè)特征的全子集選擇模型> reg.summary = summary(regfit.full) # 可看到不同數(shù)量下的特征選擇> plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type = "l") # 特征越多,RSS越小> plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type = "l") > points(which.max(reg.summary$adjr2),reg.summary$adjr2[11],col="red",cex=2,pch=20) # 11個(gè)特征時(shí),Adjusted R2最大> plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp",type = "l") > points(which.min(reg.summary$cp),reg.summary$cp[10],col="red",cex=2,pch=20) # 10個(gè)特征時(shí),Cp最小> plot(reg.summary$bic,xlab="Number of Variables",ylab="BIC",type = "l") > points(which.min(reg.summary$bic),reg.summary$bic[6],col="red",cex=2,pch=20) # 6個(gè)特征時(shí),BIC最小> plot(regfit.full,scale = "r2") #特征越多,R2越大,這不意外> plot(regfit.full,scale = "adjr2") > plot(regfit.full,scale = "Cp") > plot(regfit.full,scale = "bic")
Adjust R2
、Cp
、BIC
是三個(gè)用來評(píng)價(jià)模型的統(tǒng)計(jì)量(定義和公式就不寫了),Adjust R2越接近1說明模型擬合得越好;其他兩個(gè)指標(biāo)則是越小越好。
注意到在這3個(gè)指標(biāo)下,特征選擇的結(jié)果也不同。這里以Adjust R2為例,以它為指標(biāo)選出了11個(gè)特征:
從圖中可見,當(dāng)Adjusted R2最大(當(dāng)然也就比0.5多一點(diǎn),也不怎么樣)時(shí),選出的11個(gè)特征為:AtBat
、Hits
、Walks
、CAtBat
、CRuns
、CRBI
、CWalks
、LeagueN
、DivisionW
、PutOuts
、Assists
。
可以直接查看模型的系數(shù):
> coef(regfit.full,11) (Intercept) AtBat Hits Walks CAtBat 135.7512195 -2.1277482 6.9236994 5.6202755 -0.1389914 CRuns CRBI CWalks LeagueN DivisionW 1.4553310 0.7852528 -0.8228559 43.1116152 -111.1460252 PutOuts Assists 0.2894087 0.2688277
可見這11個(gè)特征與圖中一致,現(xiàn)在特征篩選出來了,系數(shù)也算出來了,模型就已經(jīng)構(gòu)建出來了。
逐步回歸法
這種方法的思想可以概括為“一條路走到黑”,每一次迭代都只能沿著上一次迭代的方向繼續(xù)進(jìn)行,不能反悔,不能丟鍋。以向前逐步回歸為例,基本過程如下:
對(duì)于p個(gè)特征,從k=1到k=p——
從p個(gè)特征中任意選擇k個(gè),建立C(p,k)個(gè)模型,選擇最優(yōu)的一個(gè)(RSS最小或R2最大);
基于上一步的最優(yōu)模型的k個(gè)特征,再選擇加入一個(gè),這樣就可以構(gòu)建p-k個(gè)模型,從中最優(yōu);
重復(fù)以上過程,直到k=p迭代完成;
從p個(gè)模型中選擇最優(yōu)。
向后逐步回歸法類似,只是一開始就用p個(gè)特征建模,之后每迭代一次就舍棄一個(gè)特征是模型更優(yōu)。
這種方法與最優(yōu)子集選擇法的差別在于,最優(yōu)子集選擇法可以選擇任意(k+1)個(gè)特征進(jìn)行建模,而逐步回歸法只能基于之前所選的k個(gè)特征進(jìn)行(k+1)輪建模。所以逐步回歸法不能保證最優(yōu),因?yàn)榍懊娴奶卣鬟x擇中很有可能選中一些不是很重要的特征在后面的迭代中也必須加上,從而就不可能產(chǎn)生最優(yōu)特征組合了。但優(yōu)勢(shì)就是計(jì)算量大大減?。╬*(p+1)/2),因此實(shí)用性更強(qiáng)。
> regfit.fwd = regsubsets(Salary~.,data=Hitters,nvmax = 19,method = "forward") > summary(regfit.fwd) # 顯示向前選擇過程Subset selection object Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward") Selection Algorithm: forward AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits1 ( 1 ) " " " " " " " " " " " " " " " " " " 2 ( 1 ) " " "*" " " " " " " " " " " " " " " 3 ( 1 ) " " "*" " " " " " " " " " " " " " " 4 ( 1 ) " " "*" " " " " " " " " " " " " " " 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " 7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " 9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " 10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " 11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " 12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " 13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " 14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " 15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" 17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts1 ( 1 ) " " " " "*" " " " " " " " " 2 ( 1 ) " " " " "*" " " " " " " " " 3 ( 1 ) " " " " "*" " " " " " " "*" 4 ( 1 ) " " " " "*" " " " " "*" "*" 5 ( 1 ) " " " " "*" " " " " "*" "*" 6 ( 1 ) " " " " "*" " " " " "*" "*" 7 ( 1 ) " " " " "*" "*" " " "*" "*" 8 ( 1 ) " " "*" "*" "*" " " "*" "*" 9 ( 1 ) " " "*" "*" "*" " " "*" "*" 10 ( 1 ) " " "*" "*" "*" " " "*" "*" 11 ( 1 ) " " "*" "*" "*" "*" "*" "*" 12 ( 1 ) " " "*" "*" "*" "*" "*" "*" 13 ( 1 ) " " "*" "*" "*" "*" "*" "*" 14 ( 1 ) " " "*" "*" "*" "*" "*" "*" 15 ( 1 ) " " "*" "*" "*" "*" "*" "*" 16 ( 1 ) " " "*" "*" "*" "*" "*" "*" 17 ( 1 ) " " "*" "*" "*" "*" "*" "*" 18 ( 1 ) " " "*" "*" "*" "*" "*" "*" 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" Assists Errors NewLeagueN1 ( 1 ) " " " " " " 2 ( 1 ) " " " " " " 3 ( 1 ) " " " " " " 4 ( 1 ) " " " " " " 5 ( 1 ) " " " " " " 6 ( 1 ) " " " " " " 7 ( 1 ) " " " " " " 8 ( 1 ) " " " " " " 9 ( 1 ) " " " " " " 10 ( 1 ) "*" " " " " 11 ( 1 ) "*" " " " " 12 ( 1 ) "*" " " " " 13 ( 1 ) "*" "*" " " 14 ( 1 ) "*" "*" " " 15 ( 1 ) "*" "*" " " 16 ( 1 ) "*" "*" " " 17 ( 1 ) "*" "*" "*" 18 ( 1 ) "*" "*" "*" 19 ( 1 ) "*" "*" "*" > regfit.bwd = regsubsets(Salary~.,data=Hitters,nvmax = 19,method = "backward") > summary(regfit.bwd) # 顯示向后選擇過程Subset selection object Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "backward") Selection Algorithm: backward AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits1 ( 1 ) " " " " " " " " " " " " " " " " " " 2 ( 1 ) " " "*" " " " " " " " " " " " " " " 3 ( 1 ) " " "*" " " " " " " " " " " " " " " 4 ( 1 ) "*" "*" " " " " " " " " " " " " " " 5 ( 1 ) "*" "*" " " " " " " "*" " " " " " " 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " 7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " 9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " 10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " 11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " 12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " 13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " 14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " 15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" 17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts1 ( 1 ) " " "*" " " " " " " " " " " 2 ( 1 ) " " "*" " " " " " " " " " " 3 ( 1 ) " " "*" " " " " " " " " "*" 4 ( 1 ) " " "*" " " " " " " " " "*" 5 ( 1 ) " " "*" " " " " " " " " "*" 6 ( 1 ) " " "*" " " " " " " "*" "*" 7 ( 1 ) " " "*" " " "*" " " "*" "*" 8 ( 1 ) " " "*" "*" "*" " " "*" "*" 9 ( 1 ) " " "*" "*" "*" " " "*" "*" 10 ( 1 ) " " "*" "*" "*" " " "*" "*" 11 ( 1 ) " " "*" "*" "*" "*" "*" "*" 12 ( 1 ) " " "*" "*" "*" "*" "*" "*" 13 ( 1 ) " " "*" "*" "*" "*" "*" "*" 14 ( 1 ) " " "*" "*" "*" "*" "*" "*" 15 ( 1 ) " " "*" "*" "*" "*" "*" "*" 16 ( 1 ) " " "*" "*" "*" "*" "*" "*" 17 ( 1 ) " " "*" "*" "*" "*" "*" "*" 18 ( 1 ) " " "*" "*" "*" "*" "*" "*" 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" Assists Errors NewLeagueN1 ( 1 ) " " " " " " 2 ( 1 ) " " " " " " 3 ( 1 ) " " " " " " 4 ( 1 ) " " " " " " 5 ( 1 ) " " " " " " 6 ( 1 ) " " " " " " 7 ( 1 ) " " " " " " 8 ( 1 ) " " " " " " 9 ( 1 ) " " " " " " 10 ( 1 ) "*" " " " " 11 ( 1 ) "*" " " " " 12 ( 1 ) "*" " " " " 13 ( 1 ) "*" "*" " " 14 ( 1 ) "*" "*" " " 15 ( 1 ) "*" "*" " " 16 ( 1 ) "*" "*" " " 17 ( 1 ) "*" "*" "*" 18 ( 1 ) "*" "*" "*" 19 ( 1 ) "*" "*" "*"
需要注意的是,全子集回歸、向前逐步回歸和向后逐步回歸的特征選擇結(jié)果可能不同:
> coef(regfit.full,7) (Intercept) Hits Walks CAtBat CHits 79.4509472 1.2833513 3.2274264 -0.3752350 1.4957073 CHmRun DivisionW PutOuts 1.4420538 -129.9866432 0.2366813 > coef(regfit.fwd,7) (Intercept) AtBat Hits Walks CRBI 109.7873062 -1.9588851 7.4498772 4.9131401 0.8537622 CWalks DivisionW PutOuts -0.3053070 -127.1223928 0.2533404 > coef(regfit.bwd,7) (Intercept) AtBat Hits Walks CRuns 105.6487488 -1.9762838 6.7574914 6.0558691 1.1293095 CWalks DivisionW PutOuts -0.7163346 -116.1692169 0.3028847
交叉驗(yàn)證法
交叉驗(yàn)證法是機(jī)器學(xué)習(xí)中一個(gè)普適的檢驗(yàn)?zāi)P推詈头讲畹姆椒ǎ⒉痪窒抻诰唧w的模型本身。這里介紹一種折中的k折交叉驗(yàn)證法
,過程如下:
將樣本隨機(jī)劃入k(一般取10)個(gè)大小接近的折(fold)
http://www.cnblogs.com/lafengdatascientist/p/7168507.html