算法实践2.1数据预处理

任务：数据预处理

对数据进行探索和分析：

数据类型的分析
无关特征删除
数据类型转换
缺失值处理
...以及你能想到和借鉴的数据分析处理

1.查看数据

首先读入数据，查看数据的维度，并且输出5行感受一下~

import pandas as pd
data = pd.read_csv('./data_all.csv', engine='python')
print(data.shape)
data.head()

输出：

(4754, 85)

low_volume_percent

middle_volume_percent

take_amount_in_later_12_month_highest

trans_amount_increase_rate_lately

trans_activity_month

trans_activity_day

transd_mcc

trans_days_interval_filter

trans_days_interval

regional_mobility

...

consfin_product_count

consfin_max_limit

consfin_avg_limit

latest_query_day

loans_latest_day

reg_preference_for_trad

latest_query_time_month

latest_query_time_weekday

loans_latest_time_month

loans_latest_time_weekday

0.01

0.99

0.90

0.55

0.313

17.0

27.0

26.0

3.0

...

2.0

1200.0

12.0

18.0

4.0

2.0

4.0

3.0

0.02

0.94

2000

1.28

1.00

0.458

19.0

30.0

14.0

4.0

...

6.0

22800.0

9360.0

4.0

2.0

5.0

3.0

5.0

0.04

0.96

1.00

0.114

13.0

68.0

22.0

1.0

...

1.0

4200.0

2.0

6.0

5.0

1.0

0.00

0.96

2000

0.13

0.57

0.777

22.0

14.0

6.0

3.0

...

5.0

30000.0

12180.0

2.0

4.0

5.0

3.0

0.01

0.99

0.46

1.00

0.175

13.0

66.0

42.0

1.0

...

2.0

8400.0

8250.0

22.0

120.0

4.0

6.0

1.0

6.0

5 rows × 85 columns

可见一共是4754行，85列。所有的列似乎都是数值型的，为了更确定，我们用info()函数看看数据类型：

data.info()

输出：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4754 entries, 0 to 4753
Data columns (total 85 columns):
low_volume_percent                            4754 non-null float64
middle_volume_percent                         4754 non-null float64
take_amount_in_later_12_month_highest         4754 non-null int64
trans_amount_increase_rate_lately             4754 non-null float64
trans_activity_month                          4754 non-null float64
trans_activity_day                            4754 non-null float64
transd_mcc                                    4754 non-null float64
trans_days_interval_filter                    4754 non-null float64
trans_days_interval                           4754 non-null float64
regional_mobility                             4754 non-null float64
repayment_capability                          4754 non-null int64
is_high_user                                  4754 non-null int64
number_of_trans_from_2011                     4754 non-null float64
first_transaction_time                        4754 non-null float64
historical_trans_amount                       4754 non-null int64
historical_trans_day                          4754 non-null float64
rank_trad_1_month                             4754 non-null float64
trans_amount_3_month                          4754 non-null int64
avg_consume_less_12_valid_month               4754 non-null float64
abs                                           4754 non-null int64
top_trans_count_last_1_month                  4754 non-null float64
avg_price_last_12_month                       4754 non-null int64
avg_price_top_last_12_valid_month             4754 non-null float64
trans_top_time_last_1_month                   4754 non-null float64
trans_top_time_last_6_month                   4754 non-null float64
consume_top_time_last_1_month                 4754 non-null float64
consume_top_time_last_6_month                 4754 non-null float64
cross_consume_count_last_1_month              4754 non-null float64
trans_fail_top_count_enum_last_1_month        4754 non-null float64
trans_fail_top_count_enum_last_6_month        4754 non-null float64
trans_fail_top_count_enum_last_12_month       4754 non-null float64
consume_mini_time_last_1_month                4754 non-null float64
max_cumulative_consume_later_1_month          4754 non-null int64
max_consume_count_later_6_month               4754 non-null float64
railway_consume_count_last_12_month           4754 non-null float64
pawns_auctions_trusts_consume_last_1_month    4754 non-null int64
pawns_auctions_trusts_consume_last_6_month    4754 non-null int64
jewelry_consume_count_last_6_month            4754 non-null float64
status                                        4754 non-null int64
first_transaction_day                         4754 non-null float64
trans_day_last_12_month                       4754 non-null float64
apply_score                                   4754 non-null float64
apply_credibility                             4754 non-null float64
query_org_count                               4754 non-null float64
query_finance_count                           4754 non-null float64
query_cash_count                              4754 non-null float64
query_sum_count                               4754 non-null float64
latest_one_month_apply                        4754 non-null float64
latest_three_month_apply                      4754 non-null float64
latest_six_month_apply                        4754 non-null float64
loans_score                                   4754 non-null float64
loans_credibility_behavior                    4754 non-null float64
loans_count                                   4754 non-null float64
loans_settle_count                            4754 non-null float64
loans_overdue_count                           4754 non-null float64
loans_org_count_behavior                      4754 non-null float64
consfin_org_count_behavior                    4754 non-null float64
loans_cash_count                              4754 non-null float64
latest_one_month_loan                         4754 non-null float64
latest_three_month_loan                       4754 non-null float64
latest_six_month_loan                         4754 non-null float64
history_suc_fee                               4754 non-null float64
history_fail_fee                              4754 non-null float64
latest_one_month_suc                          4754 non-null float64
latest_one_month_fail                         4754 non-null float64
loans_long_time                               4754 non-null float64
loans_credit_limit                            4754 non-null float64
loans_credibility_limit                       4754 non-null float64
loans_org_count_current                       4754 non-null float64
loans_product_count                           4754 non-null float64
loans_max_limit                               4754 non-null float64
loans_avg_limit                               4754 non-null float64
consfin_credit_limit                          4754 non-null float64
consfin_credibility                           4754 non-null float64
consfin_org_count_current                     4754 non-null float64
consfin_product_count                         4754 non-null float64
consfin_max_limit                             4754 non-null float64
consfin_avg_limit                             4754 non-null float64
latest_query_day                              4754 non-null float64
loans_latest_day                              4754 non-null float64
reg_preference_for_trad                       4754 non-null int64
latest_query_time_month                       4754 non-null float64
latest_query_time_weekday                     4754 non-null float64
loans_latest_time_month                       4754 non-null float64
loans_latest_time_weekday                     4754 non-null float64
dtypes: float64(73), int64(12)
memory usage: 3.1 MB

85列属性中，有73列是float型，12列是int型，并没有字符串型的属性，所以无需做额外处理。

而且可以看到每个特征都没有缺失值。也可以用另一个函数查看数据中缺失值的个数：

data.isnull().sum(axis=0)

输出：

low_volume_percent                        0
middle_volume_percent                     0
take_amount_in_later_12_month_highest     0
trans_amount_increase_rate_lately         0
trans_activity_month                      0
trans_activity_day                        0
transd_mcc                                0
trans_days_interval_filter                0
trans_days_interval                       0
regional_mobility                         0
repayment_capability                      0
is_high_user                              0
number_of_trans_from_2011                 0
first_transaction_time                    0
historical_trans_amount                   0
historical_trans_day                      0
rank_trad_1_month                         0
trans_amount_3_month                      0
avg_consume_less_12_valid_month           0
abs                                       0
top_trans_count_last_1_month              0
avg_price_last_12_month                   0
avg_price_top_last_12_valid_month         0
trans_top_time_last_1_month               0
trans_top_time_last_6_month               0
consume_top_time_last_1_month             0
consume_top_time_last_6_month             0
cross_consume_count_last_1_month          0
trans_fail_top_count_enum_last_1_month    0
trans_fail_top_count_enum_last_6_month    0
                                         ..
loans_org_count_behavior                  0
consfin_org_count_behavior                0
loans_cash_count                          0
latest_one_month_loan                     0
latest_three_month_loan                   0
latest_six_month_loan                     0
history_suc_fee                           0
history_fail_fee                          0
latest_one_month_suc                      0
latest_one_month_fail                     0
loans_long_time                           0
loans_credit_limit                        0
loans_credibility_limit                   0
loans_org_count_current                   0
loans_product_count                       0
loans_max_limit                           0
loans_avg_limit                           0
consfin_credit_limit                      0
consfin_credibility                       0
consfin_org_count_current                 0
consfin_product_count                     0
consfin_max_limit                         0
consfin_avg_limit                         0
latest_query_day                          0
loans_latest_day                          0
reg_preference_for_trad                   0
latest_query_time_month                   0
latest_query_time_weekday                 0
loans_latest_time_month                   0
loans_latest_time_weekday                 0
Length: 85, dtype: int64

由于属性太多，这里没有显示全，可以将上述结果存起来，然后只输出有缺失值的列：

data_isnull = data.isnull().sum(axis=0)
data_isnull[data_isnull != 0]

输出：

Series([], dtype: int64)

series为空，说明所有列都是无缺失值的。

2.属性转换

虽然读取的数据中所有属性的数据类型都是数值型的，但实际上可能存在一些类别型属性，比如status是0和1两个类别，但是pandas会将其处理成数值的0和1。需要对这种情况进行额外处理。

对于这种类型的数据有两种处理方式，一种是one-hot encoding，把属性的每个取值对应成类似于00001这种形式。比如status有两个取值0和1，那么就可以编码成01和10。sklearn中有相应的函数，也可以用pandas的get_dummies(df)函数。另一种方式是label encoding，只要给每个属性的取值一个唯一的数字即可，比如某个属性是weekday，可以取周一到周日七个值，用label encoding就是0-7这七个数字即可。sklearn中也有相应的函数。

下面我们来看看这个数据中有没有需要转换的属性。

一般这种属性都是int型的，所以我们只看int型的那些属性都有多少个取值。在第一步中我们看到数据类型就是两种：float64和int64，可以用select_dtypes函数选出那些int64的列：

import numpy as np
column = data.select_dtypes(include=[np.int64]).columns
print(column)
for col in column:
    print(col)
    print(data[col].unique().shape)

输出：

Index(['take_amount_in_later_12_month_highest', 'repayment_capability',
       'is_high_user', 'historical_trans_amount', 'trans_amount_3_month',
       'abs', 'avg_price_last_12_month',
       'max_cumulative_consume_later_1_month',
       'pawns_auctions_trusts_consume_last_1_month',
       'pawns_auctions_trusts_consume_last_6_month', 'status',
       'reg_preference_for_trad'],
      dtype='object')
take_amount_in_later_12_month_highest
(166,)
repayment_capability
(2390,)
is_high_user
(2,)
historical_trans_amount
(4524,)
trans_amount_3_month
(3524,)
abs
(1697,)
avg_price_last_12_month
(330,)
max_cumulative_consume_later_1_month
(863,)
pawns_auctions_trusts_consume_last_1_month
(572,)
pawns_auctions_trusts_consume_last_6_month
(2730,)
status
(2,)
reg_preference_for_trad
(5,)

根据属性的取值个数和属性名称大致可以判定该属性是否需要转换。看起来is_high_user和reg_preference_for_trad可能需要转换。

is_high_user只有两个取值，用value_counts函数看看每个取值有多少个：

data['is_high_user'].value_counts()

输出：

0    4701
1      53
Name: is_high_user, dtype: int64

is_high_user取值是0和1，只有两个取值的属性其实也无需转换。

再来看看另一个属性reg_preference_for_trad：

0    3405
1    1064
2     150
3     131
4       4
Name: reg_preference_for_trad, dtype: int64

reg_preference_for_trad有5个取值，如果要用one-hot encoding方式就要转换，如果用label encoding方式就无需转换。为了展示one-hot encoding的用法，这里我们用one-hot方式。

代码如下：

regcols = ['reg_preference_for_trad_' + str(i) for i in range(5)]
print(regcols)

tmpdf = pd.get_dummies(data['reg_preference_for_trad'].replace('nan', np.nan))
tmpdf.columns = regcols
data[regcols] = tmpdf

这样就多出了5列，分别代表0-4这4种取值。

['reg_preference_for_trad_0', 'reg_preference_for_trad_1', 'reg_preference_for_trad_2', 'reg_preference_for_trad_3', 'reg_preference_for_trad_4']

3.剔除不相关特征

由于属性列太多，不方便一一解释说明（况且我也不知道每个字段的含义ε=(´ο｀*)))）。属性列这么多，是不是都有用呢？这就跟我们的目标有关了。从之前的实验可以知道，我们希望预测用户贷款是否会逾期（对应的字段就是status），所以可以看每个属性跟status是否有关。

在这之前需要将X和y提取出来：

y = data['status']
X = data.drop(['status'], axis=1)

sklearn提供了特征选择的模块feature_selection，里面包含了一些函数，这里使用的是特征方差提取方式，默认方差为0的特征会删除。

from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold()
print(X.shape)
X = selector.fit_transform(X)
print(X.shape)

输出：

(4754, 89)
(4754, 89)

没有剔除任何特征。。说明这里面每个特征多少还是有点用处，只是用处大小的不同罢了。

可以输出每个特征的方差看看：

selector.variances_

array([1.72347223e-03, 2.09733428e-02, 1.53943134e+07, 4.81481182e+05,
       3.87692950e-02, 2.89599675e-02, 2.00186059e+01, 5.15710959e+02,
       2.71270743e+02, 7.92284964e-01, 2.72612789e+09, 1.10242173e-02,
       1.01111554e+02, 2.19082523e+08, 1.02694235e+11, 9.93174477e+03,
       6.96454281e-02, 1.03500977e+10, 1.93296547e+00, 7.29256913e+08,
       1.22878638e-01, 5.86439064e+05, 9.86150936e-03, 2.83157843e+01,
       1.67784749e+02, 2.97955539e+01, 1.72043674e+02, 5.03120265e+00,
       3.64002171e+00, 1.98057119e+01, 2.25829209e+01, 1.39411696e+11,
       1.16906146e+08, 3.22683811e+01, 2.28337441e-01, 4.37714017e+07,
       7.94572765e+08, 4.06031438e-02, 2.88337722e+05, 3.63453287e+02,
       2.51783949e+03, 1.63248201e+01, 4.78827044e+01, 1.36140674e+01,
       6.51328333e+00, 1.22335903e+02, 2.02883578e+01, 5.63612962e+01,
       8.29306545e+01, 3.62433789e+03, 4.68725564e+00, 5.84720597e+02,
       4.51090812e+02, 9.62979328e+00, 5.40022384e+01, 8.46934738e+00,
       2.86055044e+01, 2.15115553e+00, 1.16604483e+01, 1.15677503e+02,
       8.68487650e+02, 6.08384690e+02, 3.63337054e+00, 1.43106069e+01,
       1.23543567e+03, 4.71113041e+05, 1.10618434e+02, 2.86055044e+01,
       3.29804734e+01, 2.10664740e+06, 5.13126727e+05, 5.58729026e+07,
       1.98297162e+02, 8.46934738e+00, 1.11852954e+01, 1.92755652e+08,
       3.39896154e+07, 1.33199490e+03, 2.83489836e+03, 4.72394037e-01,
       1.77858966e+00, 3.73234062e+00, 8.92465960e+00, 3.59356702e+00,
       2.03240714e-01, 1.73719927e-01, 3.05568245e-02, 2.67964236e-02,
       8.40688770e-04])

有些特征的方差很大，有些特征的方差很小，其实看看特征的范围就大概能理解这一点了，方差大的特征大多是因为取值范围大，也不能完全靠这个说明特征与标签的相关性大小。可以设置阈值剔除一些不重要的特征，这里就不剔除了。

4.模型应用

根据之前的实验，选出一个模型看看效果即可。这里选择用虽然简单但效果好的逻辑回归模型。

def get_metric(clf, X, y_true):
    y_pred = clf.predict(X)
    y_proba = clf.predict_proba(X)

    acc = metrics.accuracy_score(y_true, y_pred)
    p = metrics.precision_score(y_true, y_pred)
    r = metrics.recall_score(y_true, y_pred)
    f1 = metrics.f1_score(y_true, y_pred)
    fpr, tpr, thresholds = metrics.roc_curve(y_true, y_proba[:, 1])
    auc = metrics.auc(fpr, tpr)
    return acc, p, r, f1, fpr, tpr, auc


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2018)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

clf = LogisticRegressionCV(class_weight='balanced', max_iter=1000)

clf.fit(X_train, y_train)
acc, p, r, f1, fpr_train, tpr_train, auc_train = get_metric(clf, X_train, y_train)
print("train accuracy:{:.2%}, precision:{:.2%}, recall:{:.2%}, F1:{:.2}".format(acc, p, r, f1))
acc, p, r, f1, fpr_test, tpr_test, auc_test = get_metric(clf, X_test, y_test)
print("test accuracy:{:.2%}, precision:{:.2%}, recall:{:.2%}, F1:{:.2}".format(acc, p, r, f1))

plt.figure()
lw = 2
plt.plot(fpr_train, tpr_train, color='darkorange', lw=lw, label='train (AUC:%0.2f)' % auc_train)
plt.plot(fpr_test, tpr_test, color='cornflowerblue', lw=lw, label='test (AUC:%0.2f)' % auc_test)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic of LR')
plt.legend(loc="lower right")
plt.show()

输出：

train accuracy:76.28%, precision:52.01%, recall:69.78%, F1:0.6
test accuracy:72.32%, precision:46.51%, recall:66.85%, F1:0.55

Previous算法实践1.2模型构建之集成模型 Next李宏毅机器学习

Last updated 3 years ago

Was this helpful?