相关文章推荐
温暖的饭卡  ·  ls mac ...·  1 月前    · 
孤独的脆皮肠  ·  10.AutoMapper ...·  1 年前    · 

这是一个文本分类的系列专题,将采用不同的方法有简单到复杂实现文本分类。
使用Stanford sentiment treebank 电影评论数据集 (Socher et al. 2013). 数据集可以从这里下载
链接: 数据集
提取码:yeqw
代码请参考: 文本分类

第一个最简单的模型: 词向量平均模型(Word Average Model)

词向量平均模型

我们用X = {x_1, x_2,x_3…x_n}表示一个句子,x_t是句子中的第t个单词,我们使用emb来表示单词的embedding函数,也就是说 emb(x)返回一个d维度的词向量。
首先我们定义一个word_averaging 句子encoder:
p o s = σ ( W T h a v g )

\sigma是逻辑斯蒂函数, w 是一个d维向量。如果,pos>=0.5分类器就返回正面的情感,否则就返回负面情感.

在训练的时候我们使用binary log loss。整个模型的参数就是embedding函数 emb 和向量 w 。注意词向量的维度 d 和 w 的维度必须相同。有些单词可能在DEV和TEST中出现,但是没有在TRAIN当中出现。针对这些单词,我们可以随机生成一个词向量(一个特殊的UNK词向量)。不过在初始化词向量的时候,注意不要初始化太大的范围,否则这些unknown words的norm太大可能会导致模型效果变差(所以这里我们将词向量初始化为-0.1到0.1之间的随机数)

import random
from collections import Counter
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
USE_CUDA = torch.cuda.is_available()
device = torch.device('cuda' if USE_CUDA else 'cpu')
with open('senti.train.tsv','r') as rf:
    lines = rf.readlines()
print(lines[:10])
 

[‘hide new secretions from the parental units\t0\n’, ‘contains no wit , only labored gags\t0\n’, ‘that loves its characters and communicates something rather beautiful about human nature\t1\n’, ‘remains utterly satisfied to remain the same throughout\t0\n’, ‘on the worst revenge-of-the-nerds clich茅s the filmmakers could dredge up\t0\n’, “that 's far too tragic to merit such superficial treatment\t0\n”, ‘demonstrates that the director of such Hollywood blockbusters as Patriot Games can still turn out a small , personal film with an emotional wallop .\t1\n’, ‘of saucy\t1\n’, “a depressed fifteen-year-old 's suicidal poetry\t0\n”, “are more deeply thought through than in most ` right-thinking ’ films\t1\n”]

def read_corpus(path):
    sentences = []
    labels = []
    with open(path,'r', encoding='utf-8') as f:
        for line in f:
            sentence, label = line.split('\t')
            sentences.append(sentence.lower().split())
            labels.append(label[0])
    return sentences, labels
train_path,dev_path,test_path = 'senti.train.tsv','senti.dev.tsv','senti.test.tsv'
train_sentences, train_labels = read_corpus(train_path)
dev_sentences, dev_labels = read_corpus(dev_path)
test_sentences, test_labels = read_corpus(test_path)
print(len(train_sentences)), print(len(train_labels))
 

67349
67349

train_sentences[1], train_labels[1]
 

([‘contains’, ‘no’, ‘wit’, ‘,’, ‘only’, ‘labored’, ‘gags’], ‘0’)

def build_vocab(sentences, word_size=20000):
    c = Counter()
    for sent in sentences:
        for word in sent:
            c[word] += 1
    print('文本总单词量为:',len(c))
    words_most_common = c.most_common(word_size)
    ## adding unk, pad
    idx2word = ['<pad>','<unk>'] + [item[0] for item in words_most_common]
    word2dix = {w:i for i, w in enumerate(idx2word)}
    return idx2word, word2dix
WORD_SIZE=20000
idx2word, word2dix = build_vocab(train_sentences, word_size=WORD_SIZE)
 

文本总单词量为: 14828

idx2word[:10]
 

[’’, ‘’, ‘the’, ‘,’, ‘a’, ‘and’, ‘of’, ‘.’, ‘to’, “'s”]

构造batch

def numeralization(sentences, labels, word2idx):
    '把word list表示的句子转成 index 表示的列表'
    numeral_sent = [[word2dix.get(w, word2dix['<unk>']) for w in s] for s in sentences]
    numeral_label =[int(label) for label in labels]
    return list(zip(numeral_sent, numeral_label))
num_train_data = numeralization(train_sentences, train_labels, word2dix)
num_test_data = numeralization(test_sentences, test_labels, word2dix)
num_dev_data = numeralization(dev_sentences, dev_labels, word2dix)
def convert2tensor(batch_sentences):
    '将batch数据转成tensor,这里主要是为了padding'
    lengths = [len(s) for s in batch_sentences]
    max_len = max(lengths)
    batch_size = len(batch_sentences)
    batch = torch.zeros(batch_size, max_len, dtype=torch.long)
    for i, l in enumerate(lengths):
        batch[i, :l] = torch.tensor(batch_sentences[i])
    return batch
def generate_batch(numeral_sentences_labels, batch_size=32):
    '''将list index 数据 分成batch '''
    batches = []
    num_sample = len(numeral_sentences_labels)
    random.shuffle(numeral_sentences_labels)
    numeral_sent = [n[0] for n in numeral_sentences_labels]
    numeral_label = [n[1] for n in numeral_sentences_labels]
    for start in range(0, num_sample, batch_size):
        end = start + batch_size
        if end > num_sample:
            batch_sentences = numeral_sent[start : num_sample]
            batch_labels = numeral_label[start : num_sample]
            batch_sent_tensor = convert2tensor(batch_sentences)
            batch_label_tensor = torch.tensor(batch_labels, dtype=torch.float)
        else:
            batch_sentences = numeral_sent[start : end]
            batch_labels = numeral_label[start : end]
            batch_sent_tensor = convert2tensor(batch_sentences)
            batch_label_tensor = torch.tensor(batch_labels, dtype=torch.float)
        batches.append((batch_sent_tensor.cuda(), batch_label_tensor.cuda()))
    return batches
train_data = generate_batch(num_train_data)
a = train_data[4]
text,label=a
 

tensor([[ 2, 1470, 0, …, 0, 0, 0],
[ 3789, 0, 0, …, 0, 0, 0],
[ 2056, 15, 283, …, 0, 0, 0],
…,
[11711, 3, 12789, …, 42, 2365, 7],
[ 1484, 524, 0, …, 0, 0, 0],
[ 308, 11, 10, …, 0, 0, 0]], device=‘cuda:0’)

class AVGModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, output_size, pad_idx):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx)
        initrange = 0.1
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc = nn.Linear(embed_dim, output_size)
    def forward(self, text):
        ## [batch_size, seq_len]->[batch_size, seq_len, embed_dim]
        embed = self.embedding(text)
        ## attention
        ##[batch_size, seq_len, embed_dim]->[batch_size, embed_dim]
        pooled = F.avg_pool2d(embed, (embed.size(1),1)).squeeze(1)
        ## [batch_size, embed_dim]->[batch_size, output_size]
        out = self.fc(pooled)
        return out
    def get_embed_weigth(self):
        return self.embedding.weight.data
VOCAB_SIZE = len(word2dix)
EMBEDDING_DIM = 100
OUTPUT_SIZE = 1
PAD_IDX = word2dix['<pad>']
model = AVGModel(vocab_size=VOCAB_SIZE,
                 embed_dim=EMBEDDING_DIM,
                 output_size=OUTPUT_SIZE, 
                 pad_idx=PAD_IDX)
model.to(device)
 

AVGModel(
(embedding): Embedding(14830, 100, padding_idx=0)
(fc): Linear(in_features=100, out_features=1, bias=True)
)

定义损失函数 和优化函数

criterion = nn.BCEWithLogitsLoss()
criterion = criterion.to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-4)
def get_accuracy(output, label):
    ## output: batch_size 
    y_hat = torch.round(torch.sigmoid(output)) ## 将output 转成0和1
    correct = (y_hat == label).float()
    acc = correct.sum()/len(correct)
    return acc
def evaluate(batch_data, model, criterion, get_accuracy):
    model.eval()
    num_epoch = epoch_loss = epoch_acc = 0
    with torch.no_grad():
        for text, label in batch_data:
            out = model(text).squeeze(1)
            loss = criterion(out, label)
            acc = get_accuracy(out, label)
            num_epoch +=1 
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    return epoch_loss/num_epoch, epoch_acc/num_epoch          
def train(batch_data, model, criterion, optimizer, get_accuracy):
    model.train()
    num_epoch = epoch_loss = epoch_acc = 0
    for text, label in batch_data:
        model.zero_grad()
        out = model(text).squeeze(1)
        loss = criterion(out, label)
        acc = get_accuracy(out, label)
        loss.backward()
        optimizer.step()
        num_epoch +=1 
        epoch_loss += loss.item()
        epoch_acc += acc.item()
    return epoch_loss/num_epoch, epoch_acc/num_epoch
NUM_EPOCH = 30
best_valid_acc = -1
dev_data = generate_batch(num_dev_data)
for epoch in range(NUM_EPOCH):
    train_data = generate_batch(num_train_data)
    train_loss, train_acc = train(train_data, model, criterion, optimizer, get_accuracy)
    valid_loss, valid_acc = evaluate(dev_data, model, criterion, get_accuracy)
    if valid_acc > best_valid_acc:
        best_valid_acc = valid_acc
        torch.save(model.state_dict(),'avg-model.pt')
    print(f'Epoch: {epoch+1:02} :')
    print(f'\t Train Loss: {train_loss:.4f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Valid Loss: {valid_loss:.4f} | Valid Acc: {valid_acc*100:.2f}%')
 

Epoch: 01 :
Train Loss: 0.1558 | Train Acc: 94.39%
Valid Loss: 0.6171 | Valid Acc: 82.25%
Epoch: 02 :
Train Loss: 0.1550 | Train Acc: 94.45%
Valid Loss: 0.6319 | Valid Acc: 81.47%
Epoch: 03 :
Train Loss: 0.1526 | Train Acc: 94.53%
Valid Loss: 0.6300 | Valid Acc: 82.59%
Epoch: 04 :
Train Loss: 0.1510 | Train Acc: 94.60%
Valid Loss: 0.6502 | Valid Acc: 81.25%
Epoch: 05 :
Train Loss: 0.1495 | Train Acc: 94.64%
Valid Loss: 0.6515 | Valid Acc: 82.37%

model.load_state_dict(torch.load('avg-model.pt'))
 

<All keys matched successfully

test_data = generate_batch(num_test_data)
test_loss, test_acc = evaluate(test_data, model, criterion, get_accuracy)
print(f'Test Loss: {test_loss:.4f} |  Test Acc: {test_acc*100:.2f}%')
 

Test Loss: 0.5369 | Test Acc: 81.23%

打印词向量

embed = model.get_embed_weigth()
embed_norm = torch.norm(embed, p=None, dim=1)
sort_embed_norm, sort_embed_norm_idx = embed_norm.sort()
print('norm 最小的30个单词:')
for idx in sort_embed_norm_idx[:30].tolist():
    print(idx2word[idx], end=' / ')
 

norm 最小的30个单词:
par / holiday / pastiche / seedy / e-graveyard / quieter / home / captain / keeps / possibly / urge / aching / career / album / code / elegy / peculiar / squint / handheld / blown / quite / cops / miss / the / blush / judd / trip / appointed / make / themselves /

print('norm 最大的30个单词:')
for idx in sort_embed_norm_idx[-30:].tolist():
    print(idx2word[idx], end=' / ')
 

norm 最大的30个单词:
wonderfully / lousy / unlikable / choppy / badly / splendid / worst / dazzling / outstanding / inept / listless / lacking / playful / mesmerizing / unnecessary / amazing / stunning / irritating / unimaginative / refreshingly / heartwarming / devoid / riveting / suffers / tiresome / pointless / thought-provoking / poorly / mess / unfunny /

norm 最大的30个单词都是和电影评价相关的词语

norm 最小的30个单词 都是和对电影情感评价无关的词语

功能描述:默认首选标签为第一个“时间”【读者可以根据代码修改默认的标签】,能够增加标签。 点击不同的标签可以进行标签切换。在正文部分能够根据输入的文本txt,或者内置的html文件进行标注,对选中的词语或文段打上标签【体现在背景颜色和文本节点的‘title’属性】。可以再次点击已经标注的内容进行取消标注。最终标注的结果将会以一个对象数组的形式保存,读者可以自行对被标注的内容进行一系列操作。标注结果形如: Proxy {0: {…}, 1: {…}, 2: {…}, 3: {…}, 4: {…} using System.Runtime.CompilerServices; using System.Text; using System.Text.RegularExpressions; using System.Web.Script.Serialization; 一,概述: 这个DataHelper 类是基于我上个博客里发的SQLDataAccess 这个类做的一个简单的封装,为了结合自己的实体类和数据操作而产生的。 这里面用了 属性类,反射。还有 数据类型的方法扩展。(入门的同学可以看看。) 这里面有几个地方要注意下,一个是GetEntity<T> 方法里的ModelDataAttribute 对象,是我自己写的一个实体属性类。 ... # Read data from files train = pd.read_csv( "data/labeledTrainData.tsv", header=0, delimi... 实体分类相关论文阅读第一篇:An Attentive Neural Architecture for Fine-grained Entity Type Classification这篇论文来自UCL自然语言处理实验室,发表于2016年。细粒度实体分类是在构建知识图谱过程中非常重要的内容,关于实体分类相关的文献也比较多,也有不少分类方法,但是我们如何在非结构化的文本中确定出一个我们想要的细粒度实体,... reload(sys) #zzh说这种方法不好,不要再用了!!! 可是真的很好用啊 QAQ sys.setdefaultencoding('utf-8') impo... 这是一个文本分类的系列专题,将采用不同的方法有简单到复杂实现文本分类。 使用Stanford sentiment treebank 电影评论数据集 (Socher et al. 2013). 数据集可以从这里下载 链接:数据集 提取码:yeqw 本文的方法是在WordAverageModel的基础上加上Attention 机制 有Attention的词向量平均模型 Word Average Mod... 这是一个文本分类的系列专题,将采用不同的方法有简单到复杂实现文本分类。 使用Stanford sentiment treebank 电影评论数据集 (Socher et al. 2013). 数据集可以从这里下载 链接:数据集 提取码:yeqw 代码请参考:文本分类 文本分类 之 self attention 机制 在前面 word average modelword average wit... 自从Mikolov在他2013年的论文“Efficient Estimation of Word Representation in Vector Space”提出词向量的概念后,NLP领域仿佛一下子进入了embedding的世界,Sentence2Vec、Doc2Vec、Everything2Vec。词向量基于语言模型的假设——“一个词的含义可以由它的上下文推断得出“,提出了词的Distr... 1)平均词向量平均词向量就是将句子中所有词的word embedding相加取平均,得到的向量就当做最终的sentence embedding。这种方法的缺点是认为句子中的所有词对于表达句子含义同样重要。 2)TF-IDF加权平均词向量: TFIDF加权平均词向量就...