逻辑回归：信贷违规预测！

相关文章推荐
逼格高的抽屉 · 使用python处理百万条数据分享 - ...· 2 年前 ·
鼻子大的红金鱼 · html2canvas截图后引入的字体样式丢 ...· 2 年前 ·
爱旅游的手套 · python读取和解析含有多个json对象的 ...· 2 年前 ·
果断的小虾米 · 将字节转换成浮点数？· 2 年前 ·
大家好，我是Peter~
今天给大家带来一篇新的kaggle数据建模文章： 信贷违约预测
本文是基于一份 Lendingclub 上面的信贷数据，根据贷款人的历史数据和贷款信息，建立一个 逻辑回归 模型来预测借款人是否会偿还他们的贷款。
导图显示本文的主要工作，其中 缺失值处理 和 特征工程 是重点。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as ms
import plotly.express as px
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import plotly.offline as pyo
pyo.init_notebook_mode()
sns.set_style('darkgrid')
from sklearn.decomposition import PCA
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score as f1
from sklearn.metrics import confusion_matrix
plt.rc('figure',figsize=(18,9))
import warnings
warnings.filterwarnings("ignore")
数据基本信息
数据来自 LendingClub，它是美国一家个人对个人贷款公司，总部位于加利福尼亚州旧金山，号称是全球最大的个人对个人(p2p)借贷平台。
查看数据量、字段类型等：
缺失值处理
缺失值处理是本次数据处理的一项重要工作。由于存在字符型和数值型的变量，而且这两种类型下的数据都存在缺失值，所以在分开处理。
缺失值可视化
In [6]:
# 4、缺失值情况
df.isnull().sum().sort_values(ascending=False)
Out[6]:
id                     105451
hardship_start_date    105451
member_id              105451
hardship_type          105451
hardship_reason        105451
open_acc_6m                 0
tot_cur_bal                 0
tot_coll_amt                0
acc_now_delinq              0
max_bal_bc                  0
Length: 137, dtype: int64
In [7]:
ms.bar(df,color="blue")
plt.show()
如果取值是1，说明是完整的字段，没有缺失值；可以看到还是存在很多字段有缺失值。
删除全部缺失字段
1、部分字段是全部缺失的，将它们进行删除：
In [8]:
# 方法1：使用dropna的how参数
# data.dropna(axis=1, how='all')
In [9]:
# 在这里我们使用下面的【方法2】
# 1、查看每个字段的缺失值比例
isnull_col_percent = df.isnull().sum(axis=0).sort_values(ascending=False) / float(len(df)) 
isnull_col_percent
Out[9]:
id                     1.0
hardship_start_date    1.0
member_id              1.0
hardship_type          1.0
hardship_reason        1.0
open_acc_6m            0.0
tot_cur_bal            0.0
tot_coll_amt           0.0
acc_now_delinq         0.0
max_bal_bc             0.0
Length: 137, dtype: float64
In [10]:
isnull_col_percent[:20]  # 前20个
Out[10]:
id                                            1.000000
hardship_start_date                           1.000000
member_id                                     1.000000
hardship_type                                 1.000000
hardship_reason                               1.000000
hardship_status                               1.000000
deferral_term                                 1.000000
hardship_amount                               1.000000
hardship_end_date                             1.000000
url                                           1.000000
payment_plan_start_date                       1.000000
hardship_length                               1.000000
hardship_dpd                                  1.000000
hardship_loan_status                          1.000000
orig_projected_additional_accrued_interest    1.000000
hardship_payoff_balance_amount                1.000000
desc                                          1.000000
hardship_last_payment_amount                  1.000000
sec_app_mths_since_last_major_derog           0.977866
sec_app_revol_util                            0.936188
dtype: float64
在上面的结果中，比例为1的字段就是全部缺失的（前面18个字段），对于建模无用，我们将其删除：
In [11]:
no_use_col = df.columns[df.isnull().all()].tolist()
no_use_col
Out[11]:
['id', 'member_id', 'url', 'desc', 'hardship_type', 'hardship_reason', 'hardship_status', 'deferral_term', 'hardship_amount', 'hardship_start_date', 'hardship_end_date', 'payment_plan_start_date', 'hardship_length', 'hardship_dpd', 'hardship_loan_status', 'orig_projected_additional_accrued_interest', 'hardship_payoff_balance_amount', 'hardship_last_payment_amount']
In [12]:
df.drop(no_use_col,axis=1,inplace=True)
In [13]:
df.shape  # 137-18 = 119，成功删除18个属性
Out[13]:
(105451, 119)
删除缺失值在30%以上的字段
2、有些字段是部分缺失值；缺失量过大也会对建模造成影响。
在这里我们设置删除缺失值在30%（可以设定其他值）以上的字段（按照列的方向来统计）
In [14]:
# 最大的缺失值比例高达97.78%
df.isnull().sum(axis=0).sort_values(ascending=False) / float(len(df))
Out[14]:
sec_app_mths_since_last_major_derog    0.977866
sec_app_revol_util                     0.936188
revol_bal_joint                        0.935458
sec_app_earliest_cr_line               0.935458
sec_app_inq_last_6mths                 0.935458
total_rec_int                          0.000000
total_rec_prncp                        0.000000
total_pymnt_inv                        0.000000
total_pymnt                            0.000000
hardship_flag                          0.000000
Length: 119, dtype: float64
In [15]:
thresh = len(df)*0.3 # 阀值（缺失值数量）
# 某列数据缺失的数量超过阀值就被删除
df.dropna(thresh=thresh, axis=1, inplace=True) 
In [16]:
df.shape
Out[16]:
(105451, 102)
In [17]:
df.isnull().sum(axis=0).sort_values(ascending=False) / float(len(df))
Out[17]:
mths_since_recent_revol_delinq    0.648405
mths_since_last_delinq            0.484765
il_util                           0.126884
mths_since_recent_inq             0.113313
emp_title                         0.064314
total_pymnt_inv                   0.000000
total_pymnt                       0.000000
out_prncp_inv                     0.000000
out_prncp                         0.000000
hardship_flag                     0.000000
Length: 102, dtype: float64
新数据（删除缺失值）
新数据缺失值可视化
In [18]:
ms.bar(df,color="blue")
plt.show()
和之前的对比，发现缺失值的字段少了很多
处理后的数据量：只剩下102个字段
In [19]:
df.shape
Out[19]:
(105451, 102)
字段类型统计
In [20]:
df.dtypes.value_counts()
Out[20]:
int64      54
float64    25
object     23
dtype: int64
唯一值字段
3、某些字段的取值是唯一的，比如全部是0，这样的字段对于建模也是无用的，我们要将这些字段进行删除。
首先找出这些字段：使用Pandas中的nunique函数来确定取值是否唯一。
In [21]:
nunique_data = df.apply(pd.Series.nunique) != 1   #  不唯一的字段
nunique_data
Out[21]:
loan_amnt                      True
funded_amnt                    True
funded_amnt_inv                True
term                           True
int_rate                       True
tot_hi_cred_lim                True
total_bal_ex_mort              True
total_bc_limit                 True
total_il_high_credit_limit     True
hardship_flag                 False
Length: 102, dtype: bool
In [22]:
# df1：删除唯一值字段后的数据
df1 = df.loc[:, nunique_data]
df1.shape
Out[22]:
(105451, 97)
字段从102变成了97，说明还是存在5个取值唯一的字段
In [23]:
# 再次查看字段类型
df1.dtypes.value_counts()
Out[23]:
int64      51
float64    25
object     21
dtype: int64
缺失值处理-分类型变量
针对分类型变量（取值为字符串类型）的缺失值处理，一般是采用类型编码、独热码、因子化等操作
In [24]:
object_df = df1.select_dtypes(include="object")
object_df.columns
Out[24]:
Index(['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length',       'home_ownership', 'verification_status', 'issue_d', 'loan_status',       'purpose', 'title'




    
, 'zip_code', 'addr_state', 'earliest_cr_line',       'revol_util', 'initial_list_status', 'last_pymnt_d', 'next_pymnt_d',       'last_credit_pull_d', 'application_type'],
      dtype='object')
比如字段grade的取值是真实的Object类型：
In [25]:
# grade
df1["grade"].value_counts()
Out[25]:
C    36880
B    31301
A    17898
D    12015
E     4847
F     1725
G      785
Name: grade, dtype: int64
下面出现的字段虽然是object类型，但是实际上行它们的取值仍存在一定的大小关系
In [26]:
# int_rate
df1["int_rate"].value_counts()
Out[26]:
16.02%    4956
5.32%     4365
15.05%    4195
14.08%    4168
12.74%    4155
30.75%     119
30.49%     113
30.74%      98
21.49%      89
6.00%        4
Name: int_rate, Length: 65, dtype: int64
In [27]:
# revol_util
df1["revol_util"].value_counts()
Out[27]:
0%         468
60%        244
64%        230
50%        227
53%        220
116.20%      1
110.10%      1
106.10%      1
125%         1
113%         1
Name: revol_util, Length: 1076, dtype: int64
分类型变量转成数值
将百分比的数据转成浮点型：
In [28]:
# 将右侧的%去掉，同时强制转成float类型
df1["int_rate"] = df1["int_rate"].str.rstrip("%").astype("float")
df1["revol_util"] = df1["revol_util"].str.rstrip("%").astype("float")
关于年份的处理：转成数值，并且10+直接用10代替，<1的部分直接用0代替：
In [29]:
# emp_length
df1["emp_length"].value_counts()
Out[29]:
10+ years    35438
2 years       9914
< 1 year      9542
3 years       8495
1 year        7034
4 years       6453
5 years       6382
6 years       4526
7 years       3847
9 years       3828
8 years       3295
Name: emp_length, dtype: int64
In [30]:
df1["emp_length"].isnull().sum()  # 存在缺失值
Out[30]:
现将缺失值填充为<1 year：
In [31]:
df1["emp_length"] = df1["emp_length"].fillna("< 1 year")
In [32]:
# 1、先将year或years去掉
df1["emp_length"] = df1["emp_length"].apply(lambda x: x.split("y")[0].rstrip(" "))
In [33]:
df1["emp_length"].value_counts()
Out[33]:
10+    35438
< 1    16239
2       9914
3       8495
1       7034
4       6453
5       6382
6       4526
7       3847
9       3828
8       3295
Name: emp_length, dtype: int64
In [34]:
# 2、转成数值型
df1["emp_length"] = df1["emp_length"].apply(lambda x: "10" if x == "10+" else x)
df1["emp_length"] = df1["emp_length"].str.replace("< 1","0")
df1["emp_length"] = df1["emp_length"].astype("float")
In [35]:
df1["emp_length"].value_counts()
Out[35]:
10.0    35438
0.0     16239
2.0      9914
3.0      8495
1.0      7034
4.0      6453
5.0      6382
6.0      4526
7.0      3847
9.0      3828
8.0      3295
Name: emp_length, dtype: int64
分类型变量可视化
In [36]:
object_df = df1.select_dtypes(include="object")
object_df.columns
Out[36]:
Index(['term', 'grade', 'sub_grade', 'emp_title', 'home_ownership',       'verification_status', 'issue_d', 'loan_status', 'purpose', 'title',       'zip_code', 'addr_state', 'earliest_cr_line', 'initial_list_status',       'last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d',       'application_type'],
      dtype='object')
In [37]:
ms.matrix(object_df)
plt.show()
主要是两个字段缺失值：emp_title、next_pymnt_d
缺失值填充
In [38]:
object_df.isnull().sum()
Out[38]:
term                      0
grade                     0
sub_grade                 0
emp_title              6782
home_ownership            0
verification_status       0
issue_d                   0
loan_status               0
purpose                   0
title                     0
zip_code                  0
addr_state                0
earliest_cr_line          0
initial_list_status       0
last_pymnt_d            145
next_pymnt_d           3921
last_credit_pull_d        3
application_type          0
dtype: int64
在这里我们直接填充为“Unknown”：
In [39]:
object_df.fillna("Unknown",inplace=True)
再次查看发现已经没有缺失值：
In [40]:
缺失值处理——数值型变量
In [41]:
number_df = df1.select_dtypes(exclude="object")
number_df.columns
Out[41]:
Index(['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'int_rate',       'installment', 'emp_length', 'annual_inc', 'dti', 'delinq_2yrs',       'inq_last_6mths', 'mths_since_last_delinq', 'open_acc', 'pub_rec',       'revol_bal', 'revol_util', 'total_acc', 'out_prncp', 'out_prncp_inv',       'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',       'total_rec_late_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med',       'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal', 'open_acc_6m',       'open_il_6m', 'open_il_12m', 'open_il_24m', 'mths_since_rcnt_il',       'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc',       'all_util', 'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m',       'acc_open_past_24mths', 'avg_cur_bal', 'bc_open_to_buy', 'bc_util',       'chargeoff_within_12_mths'




    
, 'delinq_amnt', 'mo_sin_old_il_acct',       'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl',       'mort_acc', 'mths_since_recent_bc', 'mths_since_recent_inq',       'mths_since_recent_revol_delinq', 'num_accts_ever_120_pd',       'num_actv_bc_tl', 'num_actv_rev_tl', 'num_bc_sats', 'num_bc_tl',       'num_il_tl', 'num_op_rev_tl', 'num_rev_accts', 'num_rev_tl_bal_gt_0',       'num_sats', 'num_tl_120dpd_2m', 'num_tl_30dpd', 'num_tl_90g_dpd_24m',       'num_tl_op_past_12m', 'pct_tl_nvr_dlq', 'percent_bc_gt_75',       'pub_rec_bankruptcies', 'tax_liens', 'tot_hi_cred_lim',       'total_bal_ex_mort', 'total_bc_limit', 'total_il_high_credit_limit'],
      dtype='object')
In [42]:
number_df.shape
Out[42]:
(105451, 79)
缺失值可视化
In [43]:
ms.matrix(number_df)
plt.show()
如何判断某行是否全部为空
查看缺失值情况
In [48]:
number_df.isnull().sum().sort_values(ascending=False)
Out[48]:
mths_since_recent_revol_delinq    68375
mths_since_last_delinq            51119
il_util                           13380
mths_since_recent_inq             11949
num_tl_120dpd_2m                   5278
tot_cur_bal                           0
tot_coll_amt                          0
acc_now_delinq                        0
collections_12_mths_ex_med            0
total_il_high_credit_limit            0
Length: 79, dtype: int64
找出存在缺失值的字段
In [49]:
# 存在缺失值的为True，否则为False
number_df.isnull().sum() > 0
Out[49]:
loan_amnt                     False
funded_amnt                   False
funded_amnt_inv               False
int_rate                      False
installment                   False
tax_liens                     False
tot_hi_cred_lim               False
total_bal_ex_mort             False
total_bc_limit                False
total_il_high_credit_limit    False
Length: 79, dtype: bool
In [50]:
number_df.columns[number_df.isnull().sum() > 0]
Out[50]:
Index(['dti', 'mths_since_last_delinq', 'revol_util', 'mths_since_rcnt_il',       'il_util', 'all_util', 'avg_cur_bal', 'bc_open_to_buy', 'bc_util',       'mo_sin_old_il_acct', 'mths_since_recent_bc', 'mths_since_recent_inq',       'mths_since_recent_revol_delinq', 'num_tl_120dpd_2m',       'percent_bc_gt_75'],
      dtype='object')
缺失值填充
对于数值型变量的缺失值，我们采用每列的均值来填充
In [51]:
# 方法1
for col in number_df.columns[number_df.isnull().sum() > 0]:
    mean_val = number_df[col].mean()  # 每列的均值
    number_df[col].fillna(mean_val, inplace=True)  # 均值填充
In [52]:
# 方法2
# 老版本报错
# from sklearn.preprocessing import Imputer
# 使用新版本
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
In [53]:
number_df.isnull().sum().sort_values(ascending=False)
Out[53]:
loan_amnt                         0
mo_sin_old_rev_tl_op              0
num_accts_ever_120_pd             0
mths_since_recent_revol_delinq    0
mths_since_recent_inq             0
acc_now_delinq                    0
collections_12_mths_ex_med        0
last_pymnt_amnt                   0
total_rec_late_fee                0
total_il_high_credit_limit        0
Length: 79, dtype: int64
新建模数据new_data
手动过滤字段
有些字段对建模是无用的，即存在冗余的字段信息，我们需要手动删除。
In [54]:
object_df.isnull().sum()
Out[54]:
term                   0
grade                  0
sub_grade              0
emp_title              0
home_ownership         0
verification_status    0
issue_d                0
loan_status            0
purpose                0
title                  0
zip_code               0
addr_state             0
earliest_cr_line       0
initial_list_status    0
last_pymnt_d           0
next_pymnt_d           0
last_credit_pull_d     0
application_type       0
dtype: int64
In [55]:
object_df["application_type"].value_counts()  
Out[55]:
INDIVIDUAL    98619
JOINT          6813
DIRECT_PAY       19
Name: application_type, dtype: int64
term：申请人的贷款偿还期数，通常是36或者60，对于建模没有指导意义，保留（1）
grade和sub_grade：存在重复信息，选择删除字段sub_gracde，保留grade(12)
emp_title：不能反映借款人的收入或者资产情况，可删除
home_ownership：房屋所在状态，保留(3)
verification_status：收入是否被LC验证、未验证或收入来源是否被验证，保留(4)
issue_d：贷款发行时间，对于是否违规预测也没有指导意义，可删除
loan_status：贷款的当前状态，保留（5）
purpose：贷款目的，保留（6）
title：和purpose重复，可删除
zip_code：地址邮编，对于建模无效
addr_state：申请人的地址所在州，建模无效
earliest_cr_line：借款人最早公布的信用额度开通的月份，对于是否违规模型搭建无意义，考虑删除
initial_list_status：贷款的初始上市状态，可谓W或者F，保留（7）
last_pymnt_d、next_pymnt_d、last_credit_pull_d：贷款后的信息，对于是否违规预测为意义，可删除
application_type：指示贷款是单独申请还是与两个共同借款人的联合申请，保留（8）
最终留下8个有用的字段信息：
In [56]:
object_df = object_df[["term","grade","home_ownership","verification_status","loan_status",
                       "purpose","initial_list_status","application_type"]]
object_df
将两个数据合并起来：
In [57]:
new_data = pd.concat([object_df,number_df],axis=1)
new_data.head()
In [58]:
new_data.dtypes
Out[58]:
term                          object
grade                         object
home_ownership                object
verification_status           object
loan_status                   object
tax_liens                      int64
tot_hi_cred_lim                int64
total_bal_ex_mort              int64
total_bc_limit                 int64
total_il_high_credit_limit     int64
Length: 87, dtype: object
In [59]:
new_data.shape
Out[59]:
(105451, 87)
描述统计信息
针对数值型字段的描述统计信息，查看最值，四分位数等：
In [60]:
new_data.describe()
In [61]:
installment / (annual_inc / 12)：表示每个月的还款金额占据月收入的比例，数值越大，还款压力越大
# installment：每月的还款金额
# annual_inc：申请人年度总收入
# 新特征
new_data["installment_percent"] = new_data[["installment","annual_inc"]].apply(lambda x: x.installment / (x.annual_inc / 12), axis=1).replace([np.inf,0])
new_data.head()
Out[61]:
贷款状态编码与可视化
In [62]:
new_data["loan_status"].value_counts()
Out[62]:
Current               99850
Fully Paid             3896
In Grace Period         932
Late (31-120 days)      436
Late (16-30 days)       312
Charged Off              25
Name: loan_status, dtype: int64
In [63]:
# 违约=1, 正常=0
new_data["loan_status"] = new_data["loan_status"].apply(lambda x: 0 if (x == "Fully Paid" or x == "Current") else 1)
In [64]:
new_data["loan_status"].value_counts()
Out[64]:
0    103746
1      1705
Name: loan_status, dtype: int64
In [65]:
# 贷款状态柱状图统计
fig, axs = plt.subplots(1,2,figsize=(12,6))
sns.countplot(x='loan_status',
              data=new_data,
              ax=axs[0])
axs[0].set_title("Frequency of each Loan Status(Bar)")
# 饼图：统计每个类型的占比
new_data['loan_status'].value_counts().plot(x=None,
                                            y=None, 
                                            kind='pie', 
                                            ax=axs[1],
                                            autopct='%1.2f%%')
axs[1].set_title("Percentage of each Loan status(Pie)")
plt.show()
可以看到违规和没有违规的比例差别是很大的，后面会通过采样的方法来解决。
有序变量编码
部分字段的取值是存在一定的顺序关系。比如服装的尺码大小，"XS"、"S"、"M"、"L"等，它们的取值本身是有大小关系的。
在这里我们实施硬编码：
In [66]:
new_data.select_dtypes("object")
无序特征编码
采用的独热码，通过get_dummies函数来实现
In [68]:
df1 = new_data.select_dtypes("object")  # 字符型
df2 = new_data.select_dtypes(exclude=["object"])   # 非字符型
df1.head()
Out[68]:
特征缩放：标准化
针对数值型特征(数据df2)的特征缩放，将数据规范到一定范围内，便于后期加快算法的收敛速度
In [70]:
# 目标变量load_status 
Y = df2[["loan_status"]]




    

Y.head()
Out[70]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss_data = ss.fit_transform(df3)
df4 = pd.DataFrame(ss_data, columns=df3.columns)
df4.head()
SMOTE采样（重点）
可以看到违规和没有违规的比例差别是很大的，也就是说样本及其不均衡，常用的处理方式有两种：
过采样（oversampling），增加正样本使得正、负样本数目接近，然后再进行学习
欠采样（undersampling），去除一些负样本使得正、负样本数目接近，然后再进行学习。
下面通过欠采样方法来处理：
In [75]:
positive_data = new_df[new_df["loan_status"] == 1]  # 违约
negetive_data = new_df[new_df["loan_status"] == 0]  # 未违约
In [76]:
print("length of positive_data: ",len(positive_data))
print("length of negetive_data: ",len(negetive_data))
length of positive_data:  1705
length of negetive_data:  103746
In [77]:
selected_data = negetive_data.sample(len(positive_data))
selected_data.shape
Out[77]:
(1705, 104)
下面生成的new_data就是最终用于建模的数据：
In [78]:
new_data = pd.concat([positive_data,selected_data],axis=0)
new_data  
特征和目标分离
In [79]:
y = new_data[["loan_status"]]
X = new_data.drop("loan_status", axis=1)
训练集和测试集切分
In [80]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)
逻辑回归分类器
基于逻辑回归的分类器
In [81]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression() # 构建逻辑回归分类器
clf.fit(X_train, y_train)
Out[81]:
LogisticRegression()
In [82]:
# 进行预测
predict = clf.predict(X_test) 
predict[:5]
Out[82]:
array([1, 1, 0, 1, 1])
预测准确率
查看预测的准确率：
In [83]:
from sklearn.metrics import accuracy_score
accuracy_score(predict, y_test)
Out[83]:
0.7067448680351907
混淆矩阵的信息：
In [84]:
from sklearn.metrics import confusion_matrix
m = confusion_matrix(y_test, predict)
Out[84]:
array([[262,  87],
       [113, 220]])
In [85]:
# 混淆矩阵可视化
plt.figure(figsize=(12,8))
sns.heatmap(m) 
plt.show()
上面可视化的结果表示：颜色越深代表的人数越多，也就是真阳性的人数是最多的；而假阳性是最少的
roc曲线
In [86]:
from sklearn.metrics import roc_curve, auc
false_positive_rate, true_positive_rate,thresholds = roc_curve(y_test, predict)
roc_auc = auc(false_positive_rate,true_positive_rate)
roc_auc
Out[86]:
0.7056884965194421
In [87]:
import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))
plt.title("ROC")  # Receiver Operating Characteristic
plt.plot(false_positive_rate,
         true_positive_rate,
         color="red",
         label="AUC = %0.2f"%roc_auc
plt.legend(loc="lower right")
plt.plot([0,1],[0,1],linestyle="--")
plt.axis("tight")
# 真阳性：预测类别为1的positive；预测正确True
plt.ylabel("True Positive Rate")
# 假阳性：预测类别为1的positive；预测错误False
plt.xlabel("False Positive Rate")
Out[87]:
Text(0.5, 0, 'False Positive Rate')
整体方案的ROC值达到了71%，还是有一定的提升空间。后续可以优化的点：
特征属性过多：可以考虑降维或者特征的多重线性检验，找出更具有价值的特征
离散型变量的编码：目前是统一的标准化处理，没有编码工作；后续可以考虑加入编码工作，比如：因子化、特征分箱等
建模优化：尝试使用不同的分类模型，以及模型融合方法
参数调优：可以考虑做一个参数的调优，比如网格搜索、随机搜索等等
        LeetCode
          数据结构与算法 #18 下跳棋，极富想象力的同向双指针模拟
 这道题是 LeetCode 上的 1040. 移动石子直到连续 II，难度是 Meduium，难度分是 2455。虽然是 Medium 题，但是打 Hard 标签一点也不为过。长期作为中等题的难度第一
  150
 
 
        Aurora_Trip
      
    Leetcode 24. 两两交换链表中的节点
 Leetcode 24. 两两交换链表中的节点 涉及的点: 三指针(前中后指针), 链表(纯粹的链表-单链表)
  529
 
 
        MobotStone
      
    AIGC：一把多功能的宝刀，如何使用取决于业务
 在过去的几个月里，你可能已经对“AI能降本增效”这样的标题感到麻木，甚至有些厌倦。但是，作为一名专注于教育行业的投资人，我对AI的发展和应用仍然充满好奇和热情。我走访了全球各地的教育工作者、AI研究员
  325
 
 
        houchenll
      
    知识图谱表示综述




    
 知识表示是指将知识以某种形式表示出来，以便计算机能够理解和处理。它是将现实世界中的信息、概念、实体、关系等转化为计算机可以处理的数据结构或符号形式的过程。
  301
 
 
        houchenll
      
    嵌入表示模型简介
 嵌入表示模型是一种将高维数据映射到低维连续向量空间的方法，用于捕捉数据的语义和特征。它通过学习数据之间的相似性和关系，将数据表示为紧凑且具有语义含义的向量。
  300
 
 
        亚马逊云开发者
      
    机器学习洞察 | 分布式训练让机器学习更加快速准确
 机器学习能够基于数据发现一般化规律的优势日益突显，我们看到有越来越多的开发者关注如何训练出更快速、更准确的机器学习模型，而分布式训练 (Distributed Training) 则能够大幅加速这一进
  210
 
 
        阿里云大数据AI技术
      
 14小时前
 
      
    【ACL 2023】面向轻量化文图检索的Dual-Encoder模型蒸馏算法ConaCLIP
 ConaCLIP针对轻量化的图文检索任务进行设计，是一种通过全连接的知识交互图学习方式将知识从dual-encoder大模型中蒸馏到dual-encoder小模型的算法。
  273
 
 
        LeetCode
      
    LeetCode 周赛 353（2023/07/09）看似没考 LIS 最长递增子序列，好像又考了
 往期回顾：LeetCode 单周赛第 352 场 · 一场关于子数组的专题周赛 T1. 找出最大的可达成数字（Easy） 标签：模拟 T2. 达到末尾下标所需的最大跳跃次数（Medium） 标签：动态
  284
 




    
 
        harykali
      
    编译原理：寄存器分配算法概述
 编译原理中寄存器分配算法的简单介绍，主要介绍了基于图着色的寄存器分配算法，基于弦图的寄存器分配算法以及基于线性扫描的寄存器分配算法
  800
 
 
        LeetCode
      
    LeetCode 双周赛 108（2023/07/08）渐入佳境
 往期回顾：LeetCode 单周赛第 351 场 · 一场关于子数组的专题周赛 双周赛 108 概览 T1. 最长交替子序列（Easy） 标签：模拟、同向双指针 T2. 重新放置石块（Medium） 
  273
 
 
        简绘Android
      
    技巧篇｜周末使用Clion愉快刷题 Leetcode(C++)吧
 如果使用clion搭建一套舒服的leetcode刷题环境，轻松调试和运行提交代码，一站式刷题，不需要leetcode会员
  1946
 
 
        阿里云大数据AI技术
      
    NLP领域再创佳绩！阿里云机器学习平台 PAI 多篇论文入选 ACL 2023
 阿里云机器学习平台PAI主导的多篇论文在 ACL 2023 Industry Track 上入选。
  462
 
 
        songscsc
        Python
      
    基于深度学习的高精度Caltech行人检测系统
 基于深度学习的Caltech行人检测系统可用于日常生活中检测与定位Caltech行人目标，利用深度学习算法可实现图片、视频、摄像头等方式的目标检测识别，另外支持结果可视化与图片或视频检测结果的导出
  453
 




    
 
      
    【APFN】从大佬论文中探索如何分析改进金字塔网络
 在计算机视觉领域，金字塔网络是一种经典而有效的图像处理方法，旨在实现**多尺度的特征提取和图像分析。通过构建不同尺度的图像金字塔，金字塔网络能够从粗糙到精细地分析图像内容，为许多视觉任务提供了强大的基
  395
 
 
      
    LLaMA模型指令微调 字节跳动多模态视频大模型 Valley 论文详解
 Valley: Video Assistant with Large Language model Enhanced abilitY 近期基于LLaMA微调的模型有很多，Alpaca，Vicuna都是
  451
 
 
      
    详解数仓中的数据分层：ODS、DWD、DWM、DWS、ADS
 大家好，我是Peter~ 今天带来的是一篇关于数据仓库和数据分层的文章。自己看了很多的文章，面试的时候也曾被问过很多次，所以整理出来分享给大家。
  2.4w
 
 
        数据可视化
      
    Plotly-express-17-图例legend和标题设置（二）
 🏆技术专题第三期|数据可视化的那些事......
  1.0w
 
 
        数据可视化
      
    不容忽视的30个数据可视化小技巧
 大家好，我是Peter~ 经常会通过可视化的方式来呈现数据。这里给大家总结了数据可视化制作的30个小技巧，通过列举一些容易被忽略的常见错误，最终能够快速提升和巩固你的可视化制作水平（来自DataHun
  9149
 
 
        pandas
      
    Pandas使用经典案例
 小而全的Pandas数据分析案例 写过很多关于Pandas的文章，本文开展了一个简单的综合使用，主要分为： 如何自行模拟数据 多种数据处理方式 数据统计与可视化 用户RFM模型 用户复购周期 构建数据
  6756
 
 
        pandas
      
    使用Pandas实现SQL的group_concat
 大家好，我是Peter~ 本文是对比SQL学习Pandas的第三篇文章，主要讲解的是如何利用pandas来实现SQL中的group_concat操作。 group_concat SQL或者MySQL中
  6579
 
 
        数据分析师
      
   私信