优化:一种将grid-search速度提升10倍的方法
Table of Contents
Python 2.7
IDE Pychrm 5.0.3
sci-kit learn 0.18.1
前言
抖了个机灵,不要来打我,这是没有理论依据证明的,只是模型测试出来的确有效 ** ,并且等待时间下降(约)为原来的十分之一** !!刺不刺激,哈哈哈。
原理
基本思想: ** 先找重点在细分,再细分** ,伸缩Flexible你怕不怕。以下简称这种方法为 ** FCV**
不知道CV的请看@MrLevo520–总结:Bias(偏差),Error(误差),Variance(方差)及CV(交叉验证)
伪代码
原理很好理解,直接上伪代码,懒得打字,上手稿
FCV测试时间
以GBDT为例,我测试了下,参数 n_estimators从190到300,max_depth从2到9,CV=3
普通的GridSerachCV总共fit了11073=2310次,耗时1842min,也就是30.7个小时,得出最优参数n_estimators=289,max_depth=3
FCV总共总共fit了345次,跑了166min,也就是2.8小时,得出最优参数n_estimators=256,max_depth=3
时间方面,相差11倍,那么效果呢,请看下面的CV得分
FCV测试效果
选取了GBDT,RF,XGBOST,SVM做了交叉验证比较,同一算法之间保持相同参数。
GBDT的测试结果
clf1 = GradientBoostingClassifier(max_depth=3,n_estimators=289)#.fit(train_data,train_label)
score1 = model_selection.cross_val_score(clf1,train_data,train_label,cv=5)
print score1
-------------------------------------
clf2 = GradientBoostingClassifier(max_depth=3,n_estimators=256)#.fit(train_data,train_label)
score2 = model_selection.cross_val_score(clf2,train_data,train_label,cv=5)
print score2
------------------------------------
# 查看两种方法的交叉验证效果
#传统方法CV=5:[ 0.79807692 0.82038835 0.80684597 0.76108374 0.78163772]
#改进方法CV=5:[ 0.79567308 0.82038835 0.799511 0.76847291 0.78411911]
---------------------------------------
#传统方法CV=10:[ 0.83333333 0.78571429 0.84615385 0.7961165 0.81067961 0.80097087 0.77227723 0.78109453 0.785 0.74111675]
#FCV方法CV=10:[ 0.85238095 0.78095238 0.85096154 0.7961165 0.81553398 0.7961165 0.76732673 0.79104478 0.795 0.75126904]

* 1
* 2
* 3
* 4
* 5
* 6
* 7
* 8
* 9
* 10
* 11
* 12
* 13
* 14
* 15
* 16
* 17
* 18
* 19
Xgboost的测试结果
clf1 = XGBClassifier(max_depth=6,n_estimators=200)#.fit(train_data,train_label)
score1 = model_selection.cross_val_score(clf1,train_data,train_label,cv=5)
print score1
clf2 = XGBClassifier(max_depth=4,n_estimators=292)#.fit(train_data,train_label)
score2 = model_selection.cross_val_score(clf2,train_data,train_label,cv=5)
print score2
-----------------------------
#传统方法CV=5:[ 0.79086538 0.83737864 0.80929095 0.79310345 0.7866005 ]
#FCV方法CV=5:[ 0.80288462 0.84466019 0.8190709 0.79064039 0.78163772]
* 1
* 2
* 3
* 4
* 5
* 6
* 7
* 8
* 9
* 10
* 11
* 12
* 13
RF的测试结果
注:由于RF的特殊性,选择样本的方式和选择特征的方式都随机,所以即使交叉验证,效果也不是稳定的,就像我在服务器上跑多进程和笔记本上跑同一个程序,出来的最佳值一个是247,一个是253,并不是说都会一致。(说的好像GBDT,XGBOOST不是树结构似得0.o)
clf1 = RandomForestClassifier(n_estimators=205)#.fit(train_data,train_label)
score1 = model_selection.cross_val_score(clf1,train_data,train_label,cv=5)
print score1
clf2 = RandomForestClassifier(n_estimators=253)#.fit(train_data,train_label)
score2 = model_selection.cross_val_score(clf2,train_data,train_label,cv=5)
print score2
---------------
#传统方法CV=5: [ 0.77884615 0.82038835 0.78239609 0.79310345 0.74937965]
#FCV方法CV=5:[ 0.75721154 0.83495146 0.79217604 0.79310345 0.75186104]
* 1
* 2
* 3
* 4
* 5
* 6
* 7
* 8
* 9
* 10
* 11
* 12
SVM的测试结果
clf1 = SVC(kernel='rbf', C=16, gamma=0.18)#.fit(train_data,train_label)
score1 = model_selection.cross_val_score(clf1,train_data,train_label,cv=5)
print score1
clf2 = SVC(kernel='rbf', C=94.75, gamma=0.17)#.fit(train_data,train_label)
score2 = model_selection.cross_val_score(clf2,train_data,train_label,cv=5)
print score2
# 默认CV参数搜索(2^-8,2^8),[ 0.88468992 0.88942774 0.88487805 0.8856305 0.8745098 ]
# FCV参数搜索:[ 0.90116279 0.9185257 0.90146341 0.90713587 0.89509804]
* 1
* 2
* 3
* 4
* 5
* 6
* 7
* 8
* 9
* 10
* 11
* 12
FCV优点
- 分割测试阈,极大提高速度,相较于普通的寻优,速度提升Step倍左右,寻优范围越大,提升越明显
- 可根据自己的数据集进行弹性系数设置,默认为0.5Step,设置有效的弹性系数有利于减少遗漏值
FCV缺点
- 有一定几率陷入局部最优
- 需要有一定的阈值经验和步进Step设置经验
FCV代码
以GBDT为例,(RF被我改成多进程了),假设寻找两个最优参数,概念和上面的是一样的,上面的理解了,这里没啥问题的。
# -*- coding: utf-8 -*-
from sklearn import metrics, model_selection
import LoadSolitData
import Confusion_Show as cs
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
def gradient_boosting_classifier(train_x, train_y,Nminedge,Nmaxedge,Nstep,Dminedge,Dmaxedge,Dstep):
model = GradientBoostingClassifier()
param_grid = {'n_estimators': [i for i in range(Nminedge,Nmaxedge,Nstep)], 'max_depth': [j for j in range(Dminedge,Dmaxedge,Dstep)]}
grid_search = GridSearchCV(model, param_grid, n_jobs=1, verbose=1)
grid_search.fit(train_x, train_y)
best_parameters = grid_search.best_estimator_.get_params()
for para, val in best_parameters.items():
print para, val
return best_parameters["n_estimators"],best_parameters["max_depth"]
def FlexSearch(train_x, train_y,Nminedge,Nmaxedge,Nstep,Dminedge,Dmaxedge,Dstep):
NmedEdge,DmedEdge = gradient_boosting_classifier(train_x, train_y,Nminedge,Nmaxedge,Nstep,Dminedge,Dmaxedge,Dstep)
Nminedge = NmedEdge - Nstep-Nstep/2
Nmaxedge = NmedEdge + Nstep+Nstep/2
Dminedge = DmedEdge - Dstep
Dmaxedge = DmedEdge + Dstep
Nstep = Nstep/2
Dstep = Dstep
print "Current bestPara:N-%.2f,D-%.2f"%(NmedEdge,DmedEdge)
print "Next Range:N-%d-%d,D--%d-%d"%(Nminedge,Nmaxedge,Dminedge,Dmaxedge)
print "Next step:N-%.2f,D-%.2f"%(Nstep,Nstep)
if Nstep > 0 :
return FlexSearch(train_x, train_y,Nminedge,Nmaxedge,Nstep,Dminedge,Dmaxedge,Dstep)
else:
print "The bestPara:N-%.2f,D-%.2f"%(NmedEdge,DmedEdge)
return NmedEdge,DmedEdge
if __name__ == '__main__':
train_data,train_label,test_data,test_label = LoadSolitData.TrainSize(1,0.2,[i for i in range(1,14)],"outclass")
#这里数据自己导,我是写在别的子函数里面了。
NmedEdge,DmedEdge = FlexSearch(train_data, train_label, 190, 300, 10, 2, 9, 1)
Classifier = GradientBoostingClassifier(n_estimators=NmedEdge,max_depth=DmedEdge)
Classifier.fit(train_data,train_label)
predict_label = Classifier.predict(test_data)
report = metrics.classification_report(test_label, predict_label)
print report

* 1
* 2
* 3
* 4
* 5
* 6
* 7
* 8
* 9
* 10
* 11
* 12
* 13
* 14
* 15
* 16
* 17
* 18
* 19
* 20
* 21
* 22
* 23
* 24
* 25
* 26
* 27
* 28
* 29
* 30
* 31
* 32
* 33
* 34
* 35
* 36
* 37
* 38
* 39
* 40
* 41
* 42
* 43
* 44
* 45
* 46
* 47
* 48
What’s More
采用多进程+FCV的方法,速度再次提升一个数量级。我假定命名为MFCV方法
不知道什么是多进程和多线程请看
@MrLevo520–Python小白带小白初涉多进程
@MrLevo520–Python小白带小白初涉多线程
MFCV原理
在FCV的基础上,gap = (maxedge- minedge)/k,其中的k为进程个数,把大块切割成小块,然后小块在按照FCV进行计算,最后得出K个最优值,再进行K个最优值选择一下就可以了。
MFCV优点
- 速度更快,多进程并发
- 因为分割片在片上做FCV,然后再集合做CV,一定程度上避免FCV陷入局部最优
MFCV缺点
- 与FCV不同,只能进行单参数的多进程并发寻优,两个参数的现在多进程我还没想到怎么实现,想到再说。
MFCV测试时间
因为只能单参数,所以RF为例,参数 n_estimators从190到330,CV=10,普通GridSerachCV时间为97.2min,而FCV时间为18min,MFCV为14min,这里因为fit总数太少了,所以效果不是非常明显,但是大家可以想象一下,单参数跨度很大的时候是什么效果。。。
MFCV测试效果
clf1 = RandomForestClassifier(n_estimators=205)#.fit(train_data,train_label)
score1 = model_selection.cross_val_score(clf1,train_data,train_label,cv=10)
print score1
clf2 = RandomForestClassifier(n_estimators=239)#.fit(train_data,train_label)
score2 = model_selection.cross_val_score(clf2,train_data,train_label,cv=10)
print score2
#传统方法CV=10:[ 0.84169884 0.85328185 0.85907336 0.84912959 0.82101167 0.8540856 0.84015595 0.87968442 0.84980237 0.83003953]
#MFCV方法CV=10:[ 0.84169884 0.86100386 0.85521236 0.86266925 0.82101167 0.85019455 0.83625731 0.87771203 0.85177866 0.83201581]
* 1
* 2
* 3
* 4
* 5
* 6
* 7
* 8
* 9
* 10
效果和传统方法相比,相差不多,持平,并且速度提升一个数量集,我认为可以采用
MFCV代码
# -*- coding: utf-8 -*-
#! /usr/bin/python
from multiprocessing import Process
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import Confusion_Show as cs
import LoadSolitData
from sklearn.model_selection import GridSearchCV
def write2txt(data,storename):
f = open(storename,"w")
f.write(str(data))
f.close()
def V_candidate(candidatelist,train_x, train_y):
model = RandomForestClassifier()
param_grid = {'n_estimators': [i for i in candidatelist]}
grid_search = GridSearchCV(model, param_grid, n_jobs=1, verbose=1,cv=10)
grid_search.fit(train_x, train_y)
best_parameters = grid_search.best_estimator_.get_params()
for para, val in best_parameters.items():
print para, val
return best_parameters["n_estimators"]
def random_forest_classifier(train_x, train_y,minedge,maxedge,step):
model = RandomForestClassifier()
param_grid = {'n_estimators': [i for i in range(minedge,maxedge,step)]}
grid_search = GridSearchCV(model, param_grid, n_jobs=1, verbose=1,cv=10)
grid_search.fit(train_x, train_y)
best_parameters = grid_search.best_estimator_.get_params()
for para, val in best_parameters.items():
print para, val
return best_parameters["n_estimators"]
def FlexSearch(train_x, train_y,minedge,maxedge,step,storename):
mededge = random_forest_classifier(train_x, train_y,minedge,maxedge,step)
minedge = mededge - step-step/2
maxedge = mededge + step+step/2
step = step/2
print "Current bestPara:",mededge
print "Next Range%d-%d"%(minedge,maxedge)
print "Next step:",step
if step > 0:
return FlexSearch(train_x, train_y,minedge,maxedge,step,storename)
elif step==0:
print "The bestPara:",mededge
write2txt(mededge,"RF_%s.txt"%storename)
def Calc(classifier):
Classifier.fit(train_data,train_label)
predict_label = Classifier.predict(test_data)
predict_label_prob = Classifier.predict_proba(test_data)
total_cor_num = 0.0
dictTotalLabel = {}
dictCorrectLabel = {}
for label_i in range(len(predict_label)):
if predict_label[label_i] == test_label[label_i]:
total_cor_num += 1
if predict_label[label_i] not in dictCorrectLabel: dictCorrectLabel[predict_label[label_i]] = 0
dictCorrectLabel[predict_label[label_i]] += 1.0
if test_label[label_i] not in dictTotalLabel: dictTotalLabel[test_label[label_i]] = 0
dictTotalLabel[test_label[label_i]] += 1.0
accuracy = metrics.accuracy_score(test_label, predict_label) * 100
kappa_score = metrics.cohen_kappa_score(test_label, predict_label)
average_accuracy = 0.0
label_num = 0
for key_i in dictTotalLabel:
try:
average_accuracy += (dictCorrectLabel[key_i] / dictTotalLabel[key_i]) * 100
label_num += 1
except:
average_accuracy = average_accuracy
average_accuracy = average_accuracy / label_num
result = "OA:%.4f;AA:%.4f;KAPPA:%.4f"%(accuracy,average_accuracy,kappa_score)
print result
report = metrics.classification_report(test_label, predict_label)
print report
cm = metrics.confusion_matrix(test_label, predict_label)
cs.ConfusionMatrixPng(cm,['1','2','3','4','5','6','7','8','9','10','11','12','13'])
if __name__ == '__main__':
train_data,train_label,test_data,test_label = LoadSolitData.TrainSize(1,0.5,[i for i in range(1,14)],"outclass") # 数据集自导自己的
minedge = 180
maxedge = 330
step = 10
gap = (maxedge-minedge)/3
subprocess = []
for i in range(1,4):#设置了3个进程
p = Process(target=FlexSearch,args=(train_data,train_label,minedge+(i-1)*gap,minedge+i*gap,step,i))
subprocess.append(p)
for j in subprocess:
j.start()
for k in subprocess:
k.join()
candidatelist = []
for i in range(1,4):
with open("RF_%s.txt"%i) as f:
candidatelist.append(int(f.readlines()[0].strip()))
print "candidatelist:",candidatelist
best_n_estimators = V_candidate(candidatelist, train_data,train_label)
print "best_n_estimators",best_n_estimators
Classifier = RandomForestClassifier(n_estimators=best_n_estimators)
Calc(Classifier)

* 1
* 2
* 3
* 4
* 5
* 6
* 7
* 8
* 9
* 10
* 11
* 12
* 13
* 14
* 15
* 16
* 17
* 18
* 19
* 20
* 21
* 22
* 23
* 24
* 25
* 26
* 27
* 28
* 29
* 30
* 31
* 32
* 33
* 34
* 35
* 36
* 37
* 38
* 39
* 40
* 41
* 42
* 43
* 44
* 45
* 46
* 47
* 48
* 49
* 50
* 51
* 52
* 53
* 54
* 55
* 56
* 57
* 58
* 59
* 60
* 61
* 62
* 63
* 64
* 65
* 66
* 67
* 68
* 69
* 70
* 71
* 72
* 73
* 74
* 75
* 76
* 77
* 78
* 79
* 80
* 81
* 82
* 83
* 84
* 85
* 86
* 87
* 88
* 89
* 90
* 91
* 92
* 93
* 94
* 95
* 96
* 97
* 98
* 99
* 100
* 101
* 102
* 103
* 104
* 105
* 106
* 107
* 108
* 109
* 110
* 111
* 112
* 113
* 114
* 115
* 116
* 117
* 118
* 119
* 120
总结
在单参数寻优上可以选用MFCV,多参数寻优上选择FCV,经过CV得分可知,效果与普通遍历查询保持一致,但是速度提升效果明显,越是广阈查询,提升越明显 ,已通过ISO9001认证,请放心食用。。。。。。算法非常简单啊,但万一哪天火了呢,所以我要保留所有权利,转载请注明。(我能说Bootstrap这么简单的一种思想是怎么统治统计学的么,BTW,GridSerach思想本来也很简单呢)
附录1–FCV手稿
手稿1
手稿2
附录2
如果你也想尝试xgboost这个非常好用的分类器,强烈建议你采用FCV进行调参,因为xgboost借助OpenMP,能自动利用单机CPU的多核进行并行计算,就像这样,一运行就陷入烤鸡模式,所以尽量减少调参时候的运行次数极为重要!
如何安装Xgboost请看@MrLevo520–解决:win10_x64 xgboost python安装所遇到问题
致谢
@abcjennifer–python并行调参——scikit-learn
grid_search
@MrLevo520–解决:win10_x64 xgboost python安装所遇到问题
@wzmsltw–XGBoost-
Python完全调参指南-参数解释篇
@u010414589–xgboost
调参经验
@MrLevo520–总结:Bias(偏差),Error(误差),Variance(方差)及CV(交叉验证)
@MrLevo520–Python小白带小白初涉多进程
@MrLevo520–Python小白带小白初涉多线程
0 评论