利用python实现Transformer

小豆子

二次元星人

背景
模型框架

背景

减少顺序计算的目标构成了扩展神经GPU，ByteNet和ConvS2S的基础，所有这些都使用卷积神经网络作为基本构建块， 并行计算 所有输入和输出位置的隐藏表示。在这些模型中，关联来自两个任意输入或输出位置的信号所需的操作数量随着位置之间距离的增长而增长，对于ConvS2S呈线性增长，对于ByteNet呈对数增长。这使得学习远程位置之间的依赖性变得更加困难。在 transformer 中，这将被减少到 恒定的操作次数 ，尽管由于平均注意力加权位置而导致有效分辨率降低，这是我们与多头注意力相抵消的效果。

自我注意力(self-attention) ，有时称为内部关注是关联机制，通过关联单个序列的不同位置来计算序列的表示。 自我注意力 已经成功地用于各种任务，包括阅读理解，抽象概括，文本蕴涵和学习任务独立的句子表示。端到端存储器网络基于 循环注意机制 而不是 序列对齐重复 ，并且已经证明在简单语言问答和语言建模任务上表现良好。

然而，据我们所知， transformer 是第一个完全依靠 自我注意力(self-attention) 的转换模型来计算其输入和输出的表示，而不使用序列对齐的RNN或卷积。

模型框架

大多数竞争性神经序列转导模型具有编码器 - 解码器结构。这里，编码器将符号表示的输入序列(x1，…，xn)(x1，…，xn) 映射到连续表示序列z=(z1，…，zn)z=(z1，…，zn)。给定z，解码器一次一个元素地生成输出序列(y1，…，ym)(y1，…，ym)。在每个步骤中，模型是自动回归的，在生成下一个字符时会利用之前生成的符号作为附加输入。

数据预处理

## Try another code from https://blog.csdn.net/weixin_40605573/article/details/111995240
## Field explanation is from https://blog.csdn.net/bqw18744018044/article/details/109150802
import torch
import torch.nn as nn
import torch.optim as optim
import torchtext
from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import spacy
import numpy as np
import random
import math
import time
SEED = 1234
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')
# 创建分词方法，这些可以传递给TorchText并将句子作为字符串接收并将句子作为标记列表返回
# 在论文中发现，扭转输入顺序是有益的，认为“在数据中引入了许多短期依赖关系，使优化问题变得更加容易”
# 在德文（输入端）进行了扭转
def tokenize_de(text):
    Tokenizes German text from a string into a list of strings
    return [tok.text for tok in spacy_de.tokenizer(text)]
def tokenize_en(text):
    Tokenizes English text from a string into a list of strings
    return [tok.text for tok in spacy_en.tokenizer(text)]
# 基于 Field 中定义的标记器(tokenizer)标记 TranslationDataset 中的每个句子
# Field, 指定每个句子的预处理方式
SRC = Field(tokenize = tokenize_de,init_token = '<sos>',eos_token = '<eos>',lower = True, batch_first = True)
TRG = Field(tokenize = tokenize_en,init_token = '<sos>',eos_token = '<eos>',lower = True, batch_first = True)
# 使用英文和德文的平行语料库Multi30k dataset，使用这个语料库加载成为训练、验证、测试数据
# exts指定使用哪种语言作为源和目标（源首先），字段指定用于源和目标的字段
train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'), 
                                                    fields = (SRC, TRG))
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)
# print(f"Number of training examples: {len(train_data.examples)}")
# print(f"Number of validation examples: {len(valid_data.examples)}")
# print(f"Number of testing examples: {len(test_data.examples)}")
# print(vars(train_data.examples[1]))
# build_vocab 方法现在允许我们创建与每种语言相关联的词汇表(vocabulary)
device = torch.device('cuda:1' if torch.cuda.is_available() else 'cpu')
BATCH_SIZE = 128
# BucketIterator, 定义一个迭代器，用于将相似长度的样本组织到同一个batch中
# 在为每个新的回合(new epoch)生产新的随机batch时，最小化所需的填充量(padding)
train_iterator, valid_iterator, test_iterator = BucketIterator.splits((train_data, valid_data, test_data), batch_size = BATCH_SIZE,device = device)

Encoder

class Encoder(nn.Module):
    def __init__(self, input_dim, hid_dim, n_layers, n_heads, pf_dim, dropout, device, max_length=100):
        super().__init__()
        self.device = device
        self.tok_embedding = nn.Embedding(input_dim, hid_dim)
        self.pos_embedding = nn.Embedding(max_length, hid_dim)
        self.layers = nn.ModuleList([EncoderLayer(hid_dim,n_heads,pf_dim,dropout,device) for _ in range(n_layers)])
        self.dropout = nn.Dropout(dropout)
        self.scale = torch.sqrt(torch.FloatTensor([hid_dim])).to(device)
    def forward(self, src, src_mask):
        # src = [batch size, src len]
        # src_mask = [batch size, 1, 1, src len]
        batch_size = src.shape[0]
        src_len = src.shape[1]
        pos = torch.arange(0, src_len).unsqueeze(0).repeat(batch_size, 1).to(self.device)
        # pos = [batch size, src len]
        src = self.dropout((self.tok_embedding(src) * self.scale) + self.pos_embedding(pos))
        # src = [batch size, src len, hid dim]
        for layer in self.layers:
            src = layer(src, src_mask)
        # src = [batch size, src len, hid dim]
        return src
# ??nn.Embedding
# src_len = 10
# batch_size = 256
# a = torch.arange(0, src_len).unsqueeze(0).repeat(batch_size, 2)
# ??nn.LayerNorm
# # input = torch.randn(20, 5, 10, 10)
# # # # With Learnable Parameters
# # # m = nn.LayerNorm(input.size()[1:])
# # # # Without Learnable Parameters
# # # m = nn.LayerNorm(input.size()[1:], elementwise_affine=False)
# # # # Normalize over last two dimensions
# # # m = nn.LayerNorm([10, 10])
# # # # Normalize over last dimension of size 10
# # # m = nn.LayerNorm(10)
# # # Activating the module
# # output = m(input)
# # print(output.shape)

Encoder layer

class EncoderLayer(nn.Module):
    def __init__(self, hid_dim, n_heads, pf_dim, dropout, device):
        super().__init__()
        self.self_attn_layer_norm = nn.LayerNorm(hid_dim)
        self.ff_layer_norm = nn.LayerNorm(hid_dim)
        self.self_attention = MultiHeadAttentionLayer(hid_dim, n_heads, dropout, device)
        self.positionwise_feedforward = PositionwiseFeedforwardLayer(hid_dim, pf_dim, dropout)
        self.dropout = nn.Dropout(dropout)
    def forward(self, src, src_mask):
        # src = [batch size, src len, hid dim]
        # src_mask = [batch size, 1, 1, src len] 
        # self attention
        _src, _ = self.self_attention(src, src, src, src_mask)
        # dropout, residual connection and layer norm
        src = self.self_attn_layer_norm(src + self.dropout(_src))
        # src = [batch size, src len, hid dim]
        # positionwise feedforward
        _src = self.positionwise_feedforward(src)
        # dropout, residual and layer norm
        src = self.ff_layer_norm(src + self.dropout(_src))
        # src = [batch size, src len, hid dim]
        return src

self-attention

Attention(Q,K,V)=softmax(QKTdk−−√)V(1)(1)Attention(Q,K,V)=softmax(QKTdk)V

Multi-Head Attention相当于 h 个不同的self-attention的集成（ensemble），在这里我们以 h=8 举例说明。Multi-Head Attention的输出分成3步：将数据 X 分别输入到图13所示的8个self-attention中，得到8个加权后的特征矩阵Zi,i∈1,2,…,8Zi,i∈1,2,…,8。将8个ZiZi 按列拼成一个大的特征矩阵；特征矩阵经过一层全连接后得到输出 Z 。整个过程如图14所示：

class MultiHeadAttentionLayer(nn.Module):
    def __init__(self, hid_dim, n_heads, dropout, device):
        super().__init__()
        assert hid_dim % n_heads == 0
        self.hid_dim = hid_dim
        self.n_heads = n_heads
        self.head_dim = hid_dim // n_heads
        self.fc_q = nn.Linear(hid_dim, hid_dim)
        self.fc_k = nn.Linear(hid_dim, hid_dim)
        self.fc_v = nn.Linear(hid_dim, hid_dim)
        self.fc_o = nn.Linear(hid_dim, hid_dim)
        self.dropout = nn.Dropout(dropout)
        self.scale = torch.sqrt(torch.FloatTensor([self.head_dim])).to(device)
    def forward(self, query, key, value, mask = None):
        batch_size = query.shape[0]
        #query = [batch size, query len, hid dim]
        #key = [batch size, key len, hid dim]
        #value = [batch size, value len, hid dim]
        Q = self.fc_q(query)
        K = self.fc_k(key)
        V = self.fc_v(value)
        #Q = [batch size, query len, hid dim]
        #K = [batch size, key len, hid dim]
        #V = [batch size, value len, hid dim]
        Q = Q.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        K = K.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        V = V.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        #Q = [batch size, n heads, query len, head dim]
        #K = [batch size, n heads, key len, head dim]
        #V = [batch size, n heads, value len, head dim]
        energy = torch.matmul(Q, K.permute(0, 1, 3, 2)) / self.scale
        #energy = [batch size, n heads, query len, key len]
        if mask is not None:
            energy = energy.masked_fill(mask == 0, -1e10)
        attention = torch.softmax(energy, dim = -1)
        #attention = [batch size, n heads, query len, key len]
        x = torch.matmul(self.dropout(attention), V)
        #x = [batch size, n heads, query len, head dim]
        x = x.permute(0, 2, 1, 3).contiguous()
        #x = [batch size, query len, n heads, head dim]
        x = x.view(batch_size, -1, self.hid_dim)
        #x = [batch size, query len, hid dim]
        x = self.fc_o(x)
        #x = [batch size, query len, hid dim]
        return x, attention

positionwise-feedforward

'''
编码器层内的另一个主要块是position-wise feedforward层，这块相对于多头注意力层来说简单些。
输入从hid_dim转换为pf_dim，其中pf_dim通常比hid_dim大很多。原来的Transformer使用hid_dim 512和pf_dim 2048。
ReLU激活函数和dropout在被转换回hid_dim表示之前被应用。
为什么使用这个?不幸的是，论文上从来没有解释过。
BERT使用的是GELU激活函数。他们为什么要用 GELU?同样，它也没有得到解释。
class PositionwiseFeedforwardLayer(nn.Module):
    def __init__(self, hid_dim, pf_dim, dropout):
        super().__init__()
        self.fc_1 = nn.Linear(hid_dim, pf_dim)
        self.fc_2 = nn.Linear(pf_dim, hid_dim)
        self.dropout = nn.Dropout(dropout)
    def forward(self, x):
        #x = [batch size, seq len, hid dim]
        x = self.dropout(torch.relu(self.fc_1(x)))
        #x = [batch size, seq len, pf dim]
        x = self.fc_2(x)
        #x = [batch size, seq len, hid dim]
        return x

Decoder

class Decoder(nn.Module):
    def __init__(self, output_dim, hid_dim, n_layers, n_heads, pf_dim, dropout, device, max_length=100):
        super().__init__()
        self.device = device
        self.tok_embedding = nn.Embedding(output_dim, hid_dim)
        self.pos_embedding = nn.Embedding(max_length, hid_dim)
        self.layers = nn.ModuleList([DecoderLayer(hid_dim,
                                                  n_heads,
                                                  pf_dim,
                                                  dropout,
                                                  device)
                                     for _ in range(n_layers)])
        self.fc_out = nn.Linear(hid_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        self.scale = torch.sqrt(torch.FloatTensor([hid_dim])).to(device)
    def forward(self, trg, enc_src, trg_mask, src_mask):
        # trg = [batch size, trg len]
        # enc_src = [batch size, src len, hid dim]
        # trg_mask = [batch size, 1, trg len, trg len]
        # src_mask = [batch size, 1, 1, src len]
        batch_size = trg.shape[0]
        trg_len = trg.shape[1]
        pos = torch.arange(0, trg_len).unsqueeze(0).repeat(batch_size, 1).to(self.device)
        # pos = [batch size, trg len]
        trg = self.dropout((self.tok_embedding(trg) * self.scale) + self.pos_embedding(pos))
        # trg = [batch size, trg len, hid dim]
        for layer in self.layers:
            trg, attention = layer(trg, enc_src, trg_mask, src_mask)
        # trg = [batch size, trg len, hid dim]
        # attention = [batch size, n heads, trg len, src len]
        output = self.fc_out(trg)
        # output = [batch size, trg len, output dim]
        return output, attention

Decoder layer

class DecoderLayer(nn.Module):
    def __init__(self, hid_dim, n_heads, pf_dim, dropout, device):
        super().__init__()
        self.self_attn_layer_norm = nn.LayerNorm(hid_dim)
        self.enc_attn_layer_norm = nn.LayerNorm(hid_dim)
        self.ff_layer_norm = nn.LayerNorm(hid_dim)
        self.self_attention = MultiHeadAttentionLayer(hid_dim, n_heads, dropout, device)
        self.encoder_attention = MultiHeadAttentionLayer(hid_dim, n_heads, dropout, device)
        self.positionwise_feedforward = PositionwiseFeedforwardLayer(hid_dim,
                                                                     pf_dim,
                                                                     dropout)
        self.dropout = nn.Dropout(dropout)
    def forward(self, trg, enc_src, trg_mask, src_mask):
        # trg = [batch size, trg len, hid dim]
        # enc_src = [batch size, src len, hid dim]
        # trg_mask = [batch size, 1, trg len, trg len]
        # src_mask = [batch size, 1, 1, src len]
        # self attention
        _trg, _ = self.self_attention(trg, trg, trg, trg_mask)
        # dropout, residual connection and layer norm
        trg = self.self_attn_layer_norm(trg + self.dropout(_trg))
        # trg = [batch size, trg len, hid dim]
        # encoder attention
        _trg, attention = self.encoder_attention(trg, enc_src, enc_src, src_mask)
        # dropout, residual connection and layer norm
        trg = self.enc_attn_layer_norm(trg + self.dropout(_trg))
        # trg = [batch size, trg len, hid dim]
        # positionwise feedforward
        _trg = self.positionwise_feedforward(trg)
        # dropout, residual and layer norm
        trg = self.ff_layer_norm(trg + self.dropout(_trg))
        # trg = [batch size, trg len, hid dim]
        # attention = [batch size, n heads, trg len, src len]
        return trg, attention

Seq2Seq, Transformer封装

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, src_pad_idx, trg_pad_idx, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_pad_idx = src_pad_idx
        self.trg_pad_idx = trg_pad_idx
        self.device = device
    def make_src_mask(self, src):
        # src = [batch size, src len]
        src_mask = (src != self.src_pad_idx).unsqueeze(1).unsqueeze(2)
        # src_mask = [batch size, 1, 1, src len]
        return src_mask
    def make_trg_mask(self, trg):
        # trg = [batch size, trg len]
        trg_pad_mask = (trg != self.trg_pad_idx).unsqueeze(1).unsqueeze(2)
        # trg_pad_mask = [batch size, 1, 1, trg len]
        trg_len = trg.shape[1]
        trg_sub_mask = torch.tril(torch.ones((trg_len, trg_len), device=self.device)).bool()
        # trg_sub_mask = [trg len, trg len]
        trg_mask = trg_pad_mask & trg_sub_mask
        # trg_mask = [batch size, 1, trg len, trg len]
        return trg_mask
    def forward(self, src, trg):
        # src = [batch size, src len]
        # trg = [batch size, trg len]
        src_mask = self.make_src_mask(src)
        trg_mask = self.make_trg_mask(trg)
        # src_mask = [batch size, 1, 1, src len]
        # trg_mask = [batch size, 1, trg len, trg len]
        enc_src = self.encoder(src, src_mask)
        # enc_src = [batch size, src len, hid dim]
        output, attention = self.decoder(trg, enc_src, trg_mask, src_mask)
        # output = [batch size, trg len, output dim]
        # attention = [batch size, n heads, trg len, src len]
        return output, attention

实例化模型

INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
HID_DIM = 256
ENC_LAYERS = DEC_LAYERS = 3
ENC_HEADS = DEC_HEADS = 8
ENC_PF_DIM = DEC_PF_DIM = 512
ENC_DROPOUT = DEC_DROPOUT = 0.1
enc = Encoder(INPUT_DIM, 
              HID_DIM, 
              ENC_LAYERS, 
              ENC_HEADS, 
              ENC_PF_DIM, 
              ENC_DROPOUT, 
              device)
dec = Decoder(OUTPUT_DIM, 
              HID_DIM, 
              DEC_LAYERS, 
              DEC_HEADS, 
              DEC_PF_DIM, 
              DEC_DROPOUT, 
              device)
SRC_PAD_IDX = SRC.vocab.stoi[SRC.pad_token]
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
del model
model = Seq2Seq(enc, dec, SRC_PAD_IDX, TRG_PAD_IDX, device).to(device)

初始化参数

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {count_parameters(model):,} trainable parameters')
def initialize_weights(m):
    if hasattr(m, 'weight') and m.weight.dim() > 1:
        nn.init.xavier_uniform_(m.weight.data)
model.apply(initialize_weights)

训练和评估模型

LEARNING_RATE = 0.0005
optimizer = torch.optim.Adam(model.parameters(), lr = LEARNING_RATE)
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)
def train(model, iterator, optimizer, criterion, clip):
    model.train()
    epoch_loss = 0
    for i, batch in enumerate(iterator):
        src = batch.src
        trg = batch.trg
        optimizer.zero_grad()
        output, _ = model(src, trg[:,:-1])
        #output = [batch size, trg len - 1, output dim]
        #trg = [batch size, trg len]
        output_dim = output.shape[-1]
        output = output.contiguous().view(-1, output_dim)
        trg = trg[:,1:].contiguous().view(-1)
        #output = [batch size * trg len - 1, output dim]
        #trg = [batch size * trg len - 1]
        loss = criterion(output, trg)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item()
    return epoch_loss / len(iterator)
def evaluate(model, iterator, criterion):
    model.eval()
    epoch_loss = 0
    with torch.no_grad():
        for i, batch in enumerate(iterator):
            src = batch.src
            trg = batch.trg
            output, _ = model(src, trg[:,:-1])
            #output = [batch size, trg len - 1, output dim]
            #trg = [batch size, trg len]
            output_dim = output.shape[-1]
            output = output.contiguous().view(-1, output_dim)
            trg = trg[:,1:].contiguous().view(-1)
            #output = [batch size * trg len - 1, output dim]
            #trg = [batch size * trg len - 1]
            loss = criterion(output, trg)
            epoch_loss += loss.item()
    return epoch_loss / len(iterator)

开始训练

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs
N_EPOCHS = 10
CLIP = 1
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
    start_time = time.time()
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut6-model.pt')
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

训练结果

| Epoch: 001 | Time: 0m 53s| Train Loss: 5.947 | Train PPL: 382.509 | Val. Loss: 4.110 | Val. PPL:  60.939 |
| Epoch: 002 | Time: 0m 53s| Train Loss: 3.772 | Train PPL:  43.474 | Val. Loss: 3.196 | Val. PPL:  24.446 |
| Epoch: 003 | Time: 0m 53s| Train Loss: 3.127 | Train PPL:  22.811 | Val. Loss: 2.806 | Val. PPL:  16.538 |
| Epoch: 004 | Time: 0m 54s| Train Loss: 2.762 | Train PPL:  15.824 | Val. Loss: 2.570 | Val. PPL:  13.060 |
| Epoch: 005 | Time: 0m 53s| Train Loss: 2.507 | Train PPL:  12.263 | Val. Loss: 2.413 | Val. PPL:  11.162 |
| Epoch: 006 | Time: 0m 53s| Train Loss: 2.313 | Train PPL:  10.104 | Val. Loss: 2.323 | Val. PPL:  10.209 |
| Epoch: 007 | Time: 0m 54s| Train Loss: 2.186 | Train PPL:   8.901 | Val. Loss: 2.310 | Val. PPL:  10.072 |
| Epoch: 008 | Time: 0m 53s| Train Loss: 2.103 | Train PPL:   8.191 | Val. Loss: 2.283 | Val. PPL:   9.807 |
| Epoch: 009 | Time: 0m 53s| Train Loss: 2.057 | Train PPL:   7.820 | Val. Loss: 2.307 | Val. PPL:  10.043 |