算法实践1.2模型构建之集成模型

任务说明

在上次实验中，使用了逻辑回归、SVM和决策树三个模型，本次实验的目的是使用更高级的模型，并添加更多评价指标。本次实验会构建随机森林、GBDT、XGBoost和LightGBM这4个模型，并对每一个模型进行评分。

实验过程

1.导入需要用到的包

import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import lightgbm as lgb
from xgboost import XGBClassifier
import matplotlib.pyplot as plt
from sklearn import metrics

2.读入数据，划分训练集和测试集，跟上次一样3/7分。

data = pd.read_csv('./data_all.csv', engine='python')

y = data['status']
X = data.drop(['status'], axis=1)
print('The shape of X: ', X.shape)
print('proportion of label 1: ', len(y[y == 1])/len(y))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2018)
print('For train, proportion of label 1: ', len(y_train[y_train == 1])/len(y_train))
print('For test, proportion of label 1: ', len(y_test[y_test == 1])/len(y_train))

为了更好地看到数据的分布，这里直接计算类别1所占的比例。输出结果如下：

The shape of X:  (4754, 84)
proportion of label 1:  0.2509465713083719
For train, proportion of label 1:  0.25067628494138866
For test, proportion of label 1:  0.2515767344078486

3.构建四个模型并评估：随机森林、GBDT、XGBoost、LightGBM。

为了更详细地了解各个模型的参数，这里使用了官方文档的所有默认参数，后面会对其进行详细解释。（官方文档地址可以看文末的参考资料）

rf_model = RandomForestClassifier(n_estimators='warn', criterion='gini', max_depth=None, min_samples_split=2,
                                  min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto',
                                  max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None,
                                  bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0,
                                  warm_start=False, class_weight=None)

gbdt_model = GradientBoostingClassifier(loss='deviance', learning_rate=0.1, n_estimators=100, subsample=1.0,
                                        criterion='friedman_mse', min_samples_split=2, min_samples_leaf=1,
                                        min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0,
                                        min_impurity_split=None, init=None, random_state=None, max_features=None,
                                        verbose=0, max_leaf_nodes=None, warm_start=False, presort='auto',
                                        validation_fraction=0.1, n_iter_no_change=None, tol=0.0001)

xg_model = XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True, objective='binary:logistic',
                         booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0,
                         subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1,
                         scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None)

lgb_model = lgb.LGBMClassifier(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=100,
                               subsample_for_bin=200000, objective=None, class_weight=None, min_split_gain=0.0,
                               min_child_weight=0.001, min_child_samples=20, subsample=1.0, subsample_freq=0,
                               colsample_bytree=1.0, reg_alpha=0.0, reg_lambda=0.0, random_state=None, n_jobs=-1,
                               silent=True, importance_type='split')

因为模型的拟合、评估步骤基本上都是一样的，所以可以用一个字典存储模型，用for循环来对每个模型评估。并且建立DataFrame存储模型结果，便于最终输出比较。代码和上一次实验的代码差不多，就不再详细解释了：

models = {'RF': rf_model,
          'GBDT': gbdt_model,
          'XGBoost': xg_model,
          'LightGBM': lgb_model}

df_result = pd.DataFrame(columns=('Model', 'Accuracy', 'Precision', 'Recall', 'F1 score', 'AUC'))
row = 0
for name, clf in models.items():
    clf.fit(X_train, y_train)
    y_test_pred = clf.predict(X_test)

    acc = metrics.accuracy_score(y_test, y_test_pred)
    p = metrics.precision_score(y_test, y_test_pred)
    r = metrics.recall_score(y_test, y_test_pred)
    f1 = metrics.f1_score(y_test, y_test_pred)

    y_test_proba = clf.predict_proba(X_test)
    fpr, tpr, thresholds = metrics.roc_curve(y_test, y_test_proba[:, 1])
    auc = metrics.auc(fpr, tpr)

    df_result.loc[row] = [name, acc, p, r, f1, auc]
    print(df_result.loc[row])
    row += 1

为了更直观地看到模型效果，此次实验中加入了对模型的ROC曲线的刻画：

plt.figure()
lw = 2
# 模型的ROC曲线
plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve (area = %0.2f)' % auc)
# 画对角线
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
# 固定横轴和纵轴的范围
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic of '+name)
plt.legend(loc="lower right")
plt.show()

4.模型结果展示。

随机森林：

Model              RF
Accuracy     0.775753
Precision    0.646617
Recall       0.239554
F1 score     0.349593
AUC          0.710814

GBDT：

Model            GBDT
Accuracy     0.779257
Precision    0.605769
Recall       0.350975
F1 score     0.444444
AUC          0.762629

XGBoost：

Model         XGBoost
Accuracy     0.785564
Precision    0.630542
Recall       0.356546
F1 score     0.455516
AUC          0.771363

LightGBM：

Model        LightGBM
Accuracy     0.770147
Precision    0.570136
Recall       0.350975
F1 score     0.434483
AUC          0.757402

各个模型的效果比较：

      Model  Accuracy  Precision    Recall  F1 score       AUC
0        RF  0.775753   0.646617  0.239554  0.349593  0.710814
1      GBDT  0.779257   0.605769  0.350975  0.444444  0.762629
2   XGBoost  0.785564   0.630542  0.356546  0.455516  0.771363
3  LightGBM  0.770147   0.570136  0.350975  0.434483  0.757402

这几个模型都是基于树建立的，后三个模型又比随机森林更加复杂一些。可以发现，不管看哪个指标，XGBoost的效果几乎都是最好的。XGBoost出现之后，在各种比赛中被广泛运用还是有它的道理的。不过目前LightGBM似乎比XGBoost受到人们更多的追捧，当然这也不能代表两个模型哪个一定更好，都是有其适用场景的。还有调参也是个“技术活儿”啊ε=(´ο｀*)))

最后放一个各个模型ROC曲线的直观对比图：

5. 模型参数解释说明

首先是随机森林：

RandomForestClassifier(n_estimators='warn', criterion='gini', max_depth=None, min_samples_split=2,
                                  min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto',
                                  max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None,
                                  bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0,
                                  warm_start=False, class_weight=None)

n_estimators:integer, optional (default=10) 森林中树的个数。 criterion:string, optional (default=”gini”) 评估划分方式的函数，支持基尼指数“gini”和信息增益“entropy”。 max_depth:integer or None, optional (default=None) 树的最大深度。如果是None，节点会扩展直到所有叶子节点都是纯的或者所有叶子节点包含的样本少于min_samples_split个为止。 min_samples_split:int, float, optional (default=2) 内部节点再划分所需最小样本数 min_samples_leaf:int, float, optional (default=1) 叶子节点含有的最少样本。 min_weight_fraction_leaf:float, optional (default=0.) 叶子节点最小的样本权重和。 max_features:int, float, string or None, optional (default=”auto”) 寻找最好划分时需要考虑的最大的特征数量或特征数量的比例。 max_leaf_nodes:int or None, optional (default=None) 最大叶子节点数 min_impurity_decrease:float, optional (default=0.) 节点划分最小的不纯度下降值 min_impurity_split:float, (default=1e-7) 节点划分最小不纯度 bootstrap:boolean, optional (default=True) 在建立树时是否用bootstrap采样方法。 oob_score:bool (default=False) 在评估泛化准确率时是否使用包外样本。 n_jobs:int or None, optional (default=None) 并行使用的job个数。 random_state:int, RandomState instance or None, optional (default=None) 随机器对象 verbose:int, optional (default=0) 是否显示任务进程 warm_start:bool, optional (default=False) 是否使用前一步的结果继续拟合。 class_weight:dict, list of dicts, “balanced”, “balanced_subsample” or None, optional (default=None) 给每个类指定权重，形式是：{class_label: weight}.

然后是GBDT的参数，有一些重复的参数就不再解释了：

gbdt_model = GradientBoostingClassifier(loss='deviance', learning_rate=0.1, n_estimators=100, subsample=1.0,
                                        criterion='friedman_mse', min_samples_split=2, min_samples_leaf=1,
                                        min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0,
                                        min_impurity_split=None, init=None, random_state=None, max_features=None,
                                        verbose=0, max_leaf_nodes=None, warm_start=False, presort='auto',
                                        validation_fraction=0.1, n_iter_no_change=None, tol=0.0001)

loss:{‘deviance’, ‘exponential’}, optional (default=’deviance’) 优化时使用的损失函数。对于输出概率值的分类来说，‘deviance’等价于logistic regression，对于损失‘exponential’梯度提升使用了AdaBoost算法。 learning_rate:float, optional (default=0.1) 学习速率 n_estimators:int (default=100) 学习器的最大迭代次数 subsample:float, optional (default=1.0) 子采样的比例 criterion:string, optional (default=”friedman_mse”) 评估划分质量的函数。选择有：“friedman_mse”，“mse”，“mae”。 init:estimator, optional 初始化的时候的弱学习器 presort:bool or ‘auto’, optional (default=’auto’) 是否对数据进行预排序，以加速划分 validation_fraction:float, optional, default 0.1 训练数据中抽出一部分作为早停的验证集，这个参数是抽出的比例。 n_iter_no_change:int, default None 当验证集分数在n_iter_no_change次迭代中没有提高时，停止训练。 tol:float, optional, default 1e-4 早停的容忍度。

XGBoost的参数说明：

xg_model = XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True, objective='binary:logistic',
                         booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0,
                         subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1,
                         scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None)

silent (boolean) – 是否输出运行的中间过程 objective (string or callable) – 目标函数 booster (string) – 使用哪个基分类器：gbtree, gblinear or dart. nthread (int) – 并行线程个数 gamma (float) – 惩罚项系数，指定节点分裂所需的最小损失函数下降值。 min_child_weight (int) – 孩子节点中最小的样本权重和。 max_delta_step (int) – Maximum delta step we allow each tree’s weight estimation to be. subsample (float) – 样本采样的比例 colsample_bytree (float) – 特征采样的比例 colsample_bylevel (float) – 对于每个划分在每个水平上的样本采样的比例 reg_alpha (float (xgb's alpha)) – L1正则化项前的系数 reg_lambda (float (xgb's lambda)) – L2正则化前的系数 scale_pos_weight (float) – 正样本的权重 base_score –所有实例的初始预测分数 seed (int) – 随机数种子（已丢弃这个参数） random_state (int) – 随机器的选择 missing (float, optional) – 数据中哪些值需要被视为缺失值 importance_type (string, default "gain") –特征重要性的评估类型 :“gain”, “weight”, “cover”, “total_gain” or “total_cover”.

LightGBM的参数解释：

lgb_model = lgb.LGBMClassifier(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=100,
                               subsample_for_bin=200000, objective=None, class_weight=None, min_split_gain=0.0,
                               min_child_weight=0.001, min_child_samples=20, subsample=1.0, subsample_freq=0,
                               colsample_bytree=1.0, reg_alpha=0.0, reg_lambda=0.0, random_state=None, n_jobs=-1,
                               silent=True, importance_type='split')

boosting_type (string, optional (default='gbdt')) 包括‘gbdt’‘dart’‘goss’‘rf’ subsample_for_bin (int, optional (default=200000)) 创建bin的样本个数. objective (string, callable or None, optional (default=None)) 目标函数。选项包括‘regression’‘binary’‘multiclass’'lambdarank’ class_weight (dict, 'balanced' or None, optional (default=None)) 类别的权重 min_split_gain (float, optional (default=0.)) 最小分割增益 min_child_weight (float, optional (default=1e-3)) 叶子节点的最小权重和 min_child_samples (int, optional (default=20)) 叶子节点的最小样本数 subsample_freq (int, optional (default=0)) 采样的频率 importance_type (string, optional (default='split')) 评估特征重要性的类型。'split'代表特征在模型中被划分的次数，'gain'代表使用该特征划分的增益的总和。

参考资料：

1.随机森林： 3.2.4.3.1. sklearn.ensemble.RandomForestClassifier - scikit-learn 0.20.2 documentation

2.GBDT： 3.2.4.3.5. sklearn.ensemble.GradientBoostingClassifier - scikit-learn 0.20.2 documentation

3.XGBoost： Python API Reference

4.LightGBM： Python API - LightGBM documentation

5.ROC曲线：Receiver Operating Characteristic (ROC)

Previous算法实践1.1模型构建 Next算法实践2.1数据预处理

Last updated 3 years ago

Was this helpful?