SGDによる線形分類モデル — NLP クックブック

相关文章推荐

健壮的墨镜 · 分店資訊｜台北信義新天地｜新光三越百貨· 10 月前 ·

暗恋学妹的饼干 · 加强科技创新应用！上海养老科技产业如何推动发展？· 1 年前 ·

低调的海豚 · Android ...· 1 年前 ·

急躁的打火机 · 将 Windows Admin ...· 1 年前 ·

斯文的跑步机 · 谁在鼓噪俄乌战争爆发？-中青在线· 1 年前 ·

sklearn.linear_model.SGDClassifier は、確率的勾配降下法 (SGD) を使った線形分類モデルを提供しています。 SGDClassifier の loss と penalty を変えることで、SGDでの最適化による SVM やロジスティック回帰を使うことができます。

SGDについてはscikit-learnの公式ドキュメントが詳しいです。

https://scikit-learn.org/stable/modules/sgd.html#sgd

ドキュメント内に数式による目的関数との対応も書かれています。

ここでは、 SGDClassifier の使い方のレシピをまとめます。

SGDClassifierでは次の式が目的関数になります。

E(w, b) = \frac{1}{n} \sum_{i=1}^{n} L(y_i, f(x_i)) + \alpha R(w)

この \(L\) と loss , \(R\) を penalty パラメータで設定することで、目的関数を定めます。

データとモジュールのロード

import pandas as pd
from sklearn import model_selection
data = pd.read_csv("input/pn_same_judge_preprocessed.csv")
train, test = model_selection.train_test_split(data, test_size=0.1, random_state=0)
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import PrecisionRecallDisplay
pipe_svm = Pipeline([
    ("vect", TfidfVectorizer(tokenizer=str.split)),
    ("clf", SGDClassifier(random_state=0)),
Pipeline(steps=[('vect',
                 TfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)),
                ('clf', SGDClassifier(random_state=0))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. 
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PipelinePipeline(steps=[('vect',
                 TfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)),
                ('clf', SGDClassifier(random_state=0))])
TfidfVectorizerTfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)
SGDClassifierSGDClassifier(random_state=0)
score_svm = pipe_svm.decision_function(test["tokens"])
PrecisionRecallDisplay.from_predictions(
    y_true=test["label_num"],
    y_pred=score_svm,
    name="Online SVM",
pipe_svm = Pipeline([
    ("vect", TfidfVectorizer(tokenizer=str.split)),
    ("clf", SGDClassifier(loss="hinge", penalty="l2", alpha=1e-3, random_state=42, max_iter=5, tol=None)),
Pipeline(steps=[('vect',
                 TfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)),
                ('clf',
                 SGDClassifier(alpha=0.001, max_iter=5, random_state=42,
                               tol=None))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. 
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PipelinePipeline(steps=[('vect',
                 TfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)),
                ('clf',
                 SGDClassifier(alpha=0.001, max_iter=5, random_state=42,
                               tol=None))])
TfidfVectorizerTfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)
SGDClassifierSGDClassifier(alpha=0.001, max_iter=5, random_state=42, tol=None)
score_svm = pipe_svm.decision_function(test["tokens"])
PrecisionRecallDisplay.from_predictions(
    y_true=test["label_num"],
    y_pred=score_svm,
    name="Online SVM",
SGDClassifierで loss を log_loss にすることで、以下の目的関数を最適化するロジスティック回帰モデルに対応します。
(2)¶\[\begin{align}
L(y_i, f(x_i)) &= \log(1 + \exp(- y_i f(x_i)) \\
R(w) &= ||w||_2^2
\end{align}\]
学習してみましょう。
pipe_log = Pipeline([
    ("vect", TfidfVectorizer(tokenizer=str.split)),
    ("clf", SGDClassifier(loss="log_loss", random_state=0)),
Pipeline(steps=[('vect',
                 TfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)),
                ('clf', SGDClassifier(loss='log_loss', random_state=0))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. 
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PipelinePipeline(steps=[('vect',
                 TfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)),
                ('clf', SGDClassifier(loss='log_loss', random_state=0))])
TfidfVectorizerTfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)
SGDClassifierSGDClassifier(loss='log_loss', random_state=0)
score_log = pipe_log.predict_proba(test["tokens"])[:,1]
PrecisionRecallDisplay.from_predictions(
    y_true=test["label_num"],
    y_pred=score_log,
    name="Logistic regression",