转自:
https://mp.weixin.qq.com/s/9gEfkiZyZkoIgwRCYISQgQ
LightGBM是基于XGBoost的一款可以快速并行的树模型框架,内部集成了多种集成学习思路,在代码实现上对XGBoost的节点划分进行了改进,内存占用更低训练速度更快。
LightGBM官网:https://lightgbm.readthedocs.io/en/latest/
参数介绍:https://lightgbm.readthedocs.io/en/latest/Parameters.html
本文内容如下,原始代码获取方式见文末。
-
1 安装方法
-
2 调用方法
-
2.1 定义数据集
-
2.2 模型训练
-
2.3 模型保存与加载
-
2.4 查看特征重要性
-
2.5 继续训练
-
2.6 动态调整模型超参数
-
2.7 自定义损失函数
-
2.8 调参方法
1 安装方法
LightGBM的安装非常简单,在Linux下很方便的就可以开启GPU训练。可以优先选用从pip安装,如果失败再从源码安装。
git clone --recursive https://github.com/microsoft/LightGBM ;
cd LightGBM
mkdir build ; cd build
cmake ..
# 开启MPI通信机制,训练更快
# cmake -DUSE_MPI=ON ..
# GPU版本,训练更快
# cmake -DUSE_GPU=1 ..
make -j4
# 默认版本
pip install lightgbm
# MPI版本
pip install lightgbm --install-option=--mpi
# GPU版本
pip install lightgbm --install-option=--gpu
2 调用方法
在Python语言中LightGBM提供了两种调用方式,分为为原生的API和Scikit-learn API,两种方式都可以完成训练和验证。当然原生的API更加灵活,看个人习惯来进行选择。
2.1 定义数据集
df_train = pd.read_csv('https://cdn.coggle.club/LightGBM/examples/binary_classification/binary.train', header=None, sep='\t')
df_test = pd.read_csv('https://cdn.coggle.club/LightGBM/examples/binary_classification/binary.test', header=None, sep='\t')
W_train = pd.read_csv('https://cdn.coggle.club/LightGBM/examples/binary_classification/binary.train.weight', header=None)[0]
W_test = pd.read_csv('https://cdn.coggle.club/LightGBM/examples/binary_classification/binary.test.weight', header=None)[0]
y_train = df_train[0]
y_test = df_test[0]
X_train = df_train.drop(0, axis=1)
X_test = df_test.drop(0, axis=1)
num_train, num_feature = X_train.shape
# create dataset for lightgbm
# if you want to re-use data, remember to set free_raw_data=False
lgb_train = lgb.Dataset(X_train, y_train,
weight=W_train, free_raw_data=False)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,
weight=W_test, free_raw_data=False)
2.2 模型训练
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'binary_logloss',
'num_leaves': 31,
'learning_rate': 0.05,
'feature_fraction': 0.9,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'verbose': 0
# generate feature names
feature_name = ['feature_' + str(col) for col in range(num_feature)]
gbm = lgb.train(params,
lgb_train,
num_boost_round=10,
valid_sets=lgb_train, # eval training data
feature_name=feature_name,
categorical_feature=[21])
2.3 模型保存与加载
# save model to file
gbm.save_model('model.txt')
print('Dumping model to JSON...')
model_json = gbm.dump_model()
with open('model.json', 'w+') as f:
json.dump(model_json, f, indent=4)
2.4 查看特征重要性
# feature names
print('Feature names:', gbm.feature_name())
# feature importances
print('Feature importances:', list(gbm.feature_importance()))
2.5 继续训练
# continue training
# init_model accepts:
# 1. model file name
# 2. Booster()
gbm = lgb.train(params,
lgb_train,
num_boost_round=10,
init_model='model.txt',
valid_sets=lgb_eval)
print('Finished 10 - 20 rounds with model file...')
2.6 动态调整模型超参数
# decay learning rates
# learning_rates accepts:
# 1. list/tuple with length = num_boost_round
# 2. function(curr_iter)
gbm = lgb.train(params,
lgb_train,
num_boost_round=10,
init_model=gbm,
learning_rates=lambda iter: 0.05 * (0.99 ** iter),
valid_sets=lgb_eval)
print('Finished 20 - 30 rounds with decay learning rates...')
# change other parameters during training
gbm = lgb.train(params,
lgb_train,
num_boost_round=10,
init_model=gbm,
valid_sets=lgb_eval,
callbacks=[lgb.reset_parameter(bagging_fraction=[0.7] * 5 + [0.6] * 5)])
print('Finished 30 - 40 rounds with changing bagging_fraction...')
2.7 自定义损失函数
# self-defined objective function
# f(preds: array, train_data: Dataset) -> grad: array, hess: array
# log likelihood loss
def loglikelihood(preds, train_data):
labels = train_data.get_label()
preds = 1. / (1. + np.exp(-preds))
grad = preds - labels
hess = preds * (1. - preds)
return grad, hess
# self-defined eval metric
# f(preds: array, train_data: Dataset) -> name: string, eval_result: float, is_higher_better: bool
# binary error
# NOTE: when you do customized loss function, the default prediction value is margin
# This may make built-in evalution metric calculate wrong results
# For example, we are doing log likelihood loss, the prediction is score before logistic transformation
# Keep this in mind when you use the customization
def binary_error(preds, train_data):
labels = train_data.get_label()
preds = 1. / (1. + np.exp(-preds))
return 'error', np.mean(labels != (preds > 0.5)), False
gbm = lgb.train(params,
lgb_train,
num_boost_round=10,
init_model=gbm,
fobj=loglikelihood,
feval=binary_error,
valid_sets=lgb_eval)
print('Finished 40 - 50 rounds with self-defined objective function and eval metric...')
2.8 调参方法
For Faster Speed
-
Use bagging by setting bagging_fraction
and bagging_freq
-
Use feature sub-sampling by setting feature_fraction
-
Use small max_bin
-
Use save_binary
to speed up data loading in future learning
-
Use parallel learning, refer to Parallel Learning Guide <./Parallel-Learning-Guide.rst>
__
For Better Accuracy
-
Use large max_bin
(may be slower)
-
Use small learning_rate
with large num_iterations
-
Use large num_leaves
(may cause over-fitting)
-
Use bigger training data
-
Try dart
Deal with Over-fitting
-
Use small max_bin
-
Use small num_leaves
-
Use min_data_in_leaf
and min_sum_hessian_in_leaf
-
Use bagging by set bagging_fraction
and bagging_freq
-
Use feature sub-sampling by set feature_fraction
-
Use bigger training data
-
Try lambda_l1
, lambda_l2
and min_gain_to_split
for regularization
-
Try max_depth
to avoid growing deep tree
-
Try extra_trees
-
Try increasing path_smooth
lg = lgb.LGBMClassifier(silent=False)
param_dist = {"max_depth": [4,5, 7],
"learning_rate" : [0.01,0.05,0.1],
"num_leaves": [300,900,1200],
"n_estimators": [50, 100, 150]
grid_search = GridSearchCV(lg, n_jobs=-1, param_grid=param_dist, cv = 5, scoring="roc_auc", verbose=5)
grid_search.fit(train,y_train)
grid_search.best_estimator_, grid_search.best_score_
贝叶斯优化
import warnings
import time
warnings.filterwarnings("ignore")
from bayes_opt import BayesianOptimization
def lgb_eval(max_depth, learning_rate, num_leaves, n_estimators):
params = {
"metric" : 'auc'
params['max_depth'] = int(max(max_depth, 1))
params['learning_rate'] = np.clip(0, 1, learning_rate)
params['num_leaves'] = int(max(num_leaves, 1))
params['n_estimators'] = int(max(n_estimators, 1))
cv_result = lgb.cv(params, d_train, nfold=5, seed=0, verbose_eval =200,stratified=False)
return 1.0 * np.array(cv_result['auc-mean']).max()
lgbBO = BayesianOptimization(lgb_eval, {'max_depth': (4, 8),
'learning_rate': (0.05, 0.2),
'num_leaves' : (20,1500),
'n_estimators': (5, 200)}, random_state=0)
lgbBO.maximize(init_points=5, n_iter=50,acq='ei')
print(lgbBO.max)
原文链接:https://mp.weixin.qq.com/s/9gEfkiZyZkoIgwRCYISQgQ
LightGBM模型的加载通常会使用外置的joblib接口来实现,但有时候我们也能看到使用其自带接口save_model来实现。但网上对于这一接口下保存的模型该如何读取,并没有相关详尽的介绍。本文将给出保存后后该如何读取。
LightGBM 是微软开发的 boosting 集成模型,和 XGBoost 一样是对 GBDT 的优化和高效实现,原理有一些相似之处,但它很多方面比 XGBoost 有着更为优秀的表现。官方给出的这个工具库模型的优势如下:
更快的训练效率
低内存使用
更高的准确率
支持并行化学习
可处理大规模数据
支持直接使用 category 特征
下图是一组实验数据,LightGBM比XGBoost快将近 101010 倍,内存占用率大约为XGBoost的 1/61/61/6,并且准确率也有提
# 模型训练
gbm = lgb.train(params, lgb_train, num_boost_round=20, valid_sets=lgb_eval, early_stopping_rounds=5)
# 模型保存
gbm.save_model('model.txt')
# 模型加载
gbm =...
lightgbm在train的时候有callback的接口,我们需要将训练过程的损失下降情况进行记录就需要这个接口。本文笔者就是以记录训练迭代过程的损失为出发点,写一个简单的lightgbm中callback的使用方法。
来源:机器学习初学者
本文约11000字,建议阅读20分钟本文为你介绍数据挖掘神器 LightGBM 。LightGBM 是微软开发的 boosting 集成模型,和 XGBoost 一样是对 GBDT 的优化和高效实现,原理有一些相似之处,但它很多方面比 XGBoost 有着更为优秀的表现。1.LightGBM安装LightGBM作为常见的强大Python机器学习工具库,安装也比较简单。这些系统...
针对 leaf-wise 树的参数优化:
num_leaves:控制了叶节点的数目。它是控制树模型复杂度的主要参数。
如果是level-wise,则该参数为2{depth}其中depth为树的深度。但是当叶子数量相同时,leaf-wise的树要远远深过level-wise树,非常容易导致过拟合。因此应该让num_leaves小于2{depth}。在leaf-wise树中,并不存在depth的概念。因为不存在一个从leaves到depth的合理映射。
min_data_in_leaf:每个叶节点的最少样本数量
lightgbm建模,在其内置的比较少,如用于二分类的任务只有binary,最多再搭配class_weight来惩罚不同类别的损失函数。但我们可以自定义损失函数,只要损失函数可以求二阶导。
一、原生形式使用lightgbm(import lightgbm as lgb)
# 模型训练
gbm = lgb.train(params, lgb_train, num_boost_round=20, valid_sets=lgb_eval, early_stopping_rounds=5)
# 模型保存
gbm.save_model('model.txt')
# 模型加载
gbm = lgb.Booster(model_file='model.txt')
# 模型预测
y_pred = g