天池-半导体质量预测
最近跟着做天池的比赛,将比赛过程中遇到的问题记录如下:
1.特征的选择?
特征选择的方法: 1) 嵌入式 2) 过滤式 3) 封装式
1)数据清洗:
- 1.筛选掉重复的列
- 2.对于类别类型特征,利用sklearn编码(one-hot, Label Encoder等)
- 3.使用平均值填充完后再去除冗余列(方差为0,列重复)
清洗过后,特征从原来的8000多维降到了3400多维.
- 4.特征中存在全为NaN值的,也去掉这些列
总结:数据清洗过后,总的特征维数维3342维;随机森林MSE为:0.03612
2)特征选择:
- 嵌入式: 根据模型来分析特征的重要性,最常见的方式为正则化来做特征选择.
- 过滤式: 评估单个特征和结果之间的相关程度,排序留下Top相关的特征部分.(缺点:没有考虑到特征之间的关联作用,可能把有用的关联特征误踢掉)
- 封装式: 把特征选择看作一个特征子集搜索问题,筛选各种特征子集,用模型评估效果.
处理方法:
- 过滤式:使用单个随机森林得到的featureimportances排序后保留了44个特征,为
[‘310X207’, ‘210X158’, ‘311X7’, ‘330X1132’, ‘220X13’, ‘310X149’, ‘750X883’, ‘210X207’, ‘312X144’, ‘210X192’, ‘312X61’, ‘312X66’, ‘440AX77’, ‘220X197’, ‘310X153’, ‘330X1190’, ‘344X252’, ‘310X33’, ‘210X174’, ‘440AX95’, ‘312X777’, ‘330X102’, ‘440AX187’, ‘340X161’, ‘312X55’, ‘330X590’, ‘210X89’, ‘330X1129’, ‘210X164’, ‘210X188’, ‘330X1146’, ‘310X119’, ‘360X1049’, ‘440AX182’, ‘750X640’, ‘440AX65’, ‘312X789’, ‘311X154’, ‘310X43’, ‘312X782’, ‘312X555’, ‘420X4’, ‘312X785’, ‘210X229’]
- 包裹式:利用随机森林的性能作为评价指标筛选出200个特征
set([‘440AX98’, ‘310X207’, ‘210X158’, ‘261X641’, ‘261X269’, ‘312X61’, ‘312X66’, ‘220X197’, ‘520X317’, ‘400X151’, ‘400X150’, ‘400X153’, ‘330X594’, ‘330X590’, ‘210X164’, ‘210X8’, ‘420X4’, ‘330X1223’, ‘310X117’, ‘261X524’, ‘310X119’, ‘261X607’, ‘750X640’, ‘210X126’, ‘311X154’, ‘312X555’, ‘261X689’, ‘520X245’, ‘261X477’, ‘750X883’, ‘330X589’, ‘261X590’, ‘300X8’, ‘261X591’, ‘261X468’, ‘440AX77’, ‘300X3’, ‘220X179’, ‘330X1190’, ‘220X177’, ‘220X176’, ‘220X175’, ‘220X174’, ‘220X173’, ‘220X172’, ‘220X171’, ‘220X170’, ‘261X464’, ‘TOOL (#2)’, ‘330X351’, ‘330X102’, ‘330X355’, ‘330X354’, ‘330X1049’, ‘330X1042’, ‘312X57’, ‘312X55’, ‘210X89’, ‘330X1040’, ‘330X1043’, ‘330X1129’, ‘261X460’, ‘330X1044’, ‘261X462’, ‘330X1046’, ‘210X188’, ‘330X353’, ‘360X1049’, ‘440AX66’, ‘440AX67’, ‘440AX64’, ‘440AX65’, ‘330X135’, ‘330X134’, ‘312X144’, ‘330X133’, ‘330X132’, ‘330X139’, ‘312X782’, ‘210X174’, ‘312X785’, ‘312X789’, ‘261X608’, ‘261X609’, ‘520X312’, ‘520X313’, ‘520X314’, ‘330X1228’, ‘420X33’, ‘330X1132’, ‘261X600’, ‘261X601’, ‘330X641’, ‘330X1226’, ‘330X1221’, ‘330X1220’, ‘210X206’, ‘210X207’, ‘261X598’, ‘261X599’, ‘340X105’, ‘340X107’, ‘210X190’, ‘210X191’, ‘210X192’, ‘261X593’, ‘261X594’, ‘261X596’, ‘261X597’, ‘220X557’, ‘220X551’, ‘310X37’, ‘310X36’, ‘310X34’, ‘310X33’, ‘261X268’, ‘310X31’, ‘310X30’, ‘440AX95’, ‘210X3’, ‘210X4’, ‘210X5’, ‘210X6’, ‘210X7’, ‘312X777’, ‘210X9’, ‘261X260’, ‘261X261’, ‘261X266’, ‘261X267’, ‘261X264’, ‘261X265’, ‘520X246’, ‘520X247’, ‘261X736’, ‘261X737’, ‘520X242’, ‘520X243’, ‘310X153’, ‘344X252’, ‘440AX90’, ‘261X262’, ‘330X1146’, ‘440AX182’, ‘440AX187’, ‘261X687’, ‘261X688’, ‘310X43’, ‘330X157’, ‘330X404’, ‘261X512’, ‘261X513’, ‘330X401’, ‘520X55’, ‘330X403’, ‘261X517’, ‘261X518’, ‘261X519’, ‘311X7’, ‘330X409’, ‘330X159’, ‘330X158’, ‘330X461’, ‘520X333’, ‘220X13’, ‘310X149’, ‘520X244’, ‘261X338’, ‘330X1249’, ‘330X1248’, ‘300X7’, ‘261X330’, ‘261X331’, ‘340X161’, ‘261X333’, ‘330X1247’, ‘344X121’, ‘261X336’, ‘330X1244’, ‘520X240’, ‘330X1230’, ‘520X241’, ‘330X1241’, ‘261X335’, ‘220X535’, ‘210X129’, ‘210X128’, ‘220X531’, ‘220X530’, ‘210X125’, ‘210X124’, ‘210X127’, ‘261X526’, ‘210X121’, ‘210X120’, ‘210X123’, ‘261X230’, ‘261X592’, ‘440AX123’, ‘261X742’, ‘440AX99’, ‘311X83’, ‘220X178’, ‘330X535’, ‘210X229’]
1) 提取特征后,xgboost的mse为0.0325341683406
2) 单个随机森林的5折交叉验证的平均mse为0.0288353227614
(max_depth=None,n_estimators=160,min_samples_leaf=2,max_features=n_features)
使用模型的featuresimportances选择的特征和rfe做交集得到的特征为:
[‘210X158’, ‘330X1228’, ‘330X1132’, ‘220X13’, ‘310X149’, ‘750X883’, ‘330X589’, ‘210X207’, ‘440AX77’, ‘312X66’, ‘210X192’, ‘330X1190’, ‘310X33’, ‘312X555’, ‘310X31’, ‘310X30’, ‘440AX95’, ‘210X6’, ‘210X8’, ‘330X102’, ‘340X161’, ‘312X57’, ‘310X153’, ‘330X590’, ‘210X89’, ‘330X1129’, ‘210X164’, ‘312X777’, ‘210X188’, ‘330X1146’, ‘310X119’, ‘750X640’, ‘311X7’, ‘312X144’, ‘310X43’, ‘312X782’, ‘210X174’, ‘420X4’, ‘210X229’, ‘312X785’, ‘312X789’]
2.缺失值的处理?
- 使用任意数值填充
- 使用平均值填充
3.维数降维?
4.模型的选择?
- Random Forest
- GBDT(Gradient Boosting Decision Tree)
这里记录下GBDT的发展过程: Regression Decision Tree -> Boosting Decision Tree -> Gradient Boosting Decision Tree,GBDT利用加法模型和前向分步法实现学习的优化过程.GBDT是一个基于迭代累加的决策树算法,它通过构造一组弱的学习器(树),并把多颗决策树的结果累加起来作为最终的预测输出。 缺点:1) 计算复杂度高 2) 不适合高维稀疏特征
3.Xgboost
xgboost是boosting Tree的一个很牛的实现:
- 显示地把树模型复杂度作为正则项加到优化目标中
- 公式推导中用到了二阶导数,用了二阶泰勒展开
- 实现了分裂点寻找近似算法
- 利用了特征的稀疏性
- 并行计算
xgboost的训练速度远远快于传统的GBDT,10倍量级.
总结:重新选用xgboost模型,参数如下,mse为0.0320532717482
123456789101112 params={'booster':'gbtree','objective': 'reg:linear','eval_metric': 'rmse','max_depth':4,'lambda':6,'subsample':0.75,'colsample_bytree':1,'min_child_weight':1,'eta': 0.04,'seed':0,'nthread':8,'silent':0}
5.实践过程
1.特征选择过程:去除全为Nan的列,去除Nan值个数大于200的列,去除object列,去除重复的列,选择Pearson相关系数>0.2的列,最后共得到5600多维特征.
这一步很粗糙,改进:
- 1) 加入object的列
- 2)特征维数继续筛减:可以试一下PCA降维
- 3)时间列属性的加入
2.模型的选择:单模型线性回归线下mse:0.0388左右,而线上为0.0446.之前用随机森林回归预测,线下0.0297,而线上0.0493.从这个现象结合线下数据只有500条是否可以得出线下和线上数据并不是分布相同,或者说差异较大,而且线下训练可能存在过拟合.2017.12.24将三种回归模型加权平均融合提交结果.