MATLAB: 用LSTM网络预测《冰与火之歌》
这篇文章主要在于介绍如何用MATLAB进行深度学习,进而利用训练的网络模型生成文本。对于“预测”的结果,不发表意见(其实就是不准)。
1,数据读取
首先需要准备好要训练的文本,《冰与火之歌》第1-5卷。我从网上找的,删除其中的中文信息(汉字、汉字标点等)。注意,确保文本的编码方式为UTF-8,以免出现乱码。如果不是的话,可以用vs code 另存。
以下代码仅供参考。
function [iceAndFire,XTrain,YTrain] = readData()
parNum = 0; % 段落数
newlineChar = compose("\x00B6");
spaceChar = compose("\x00B7");
endofTextChar = compose("\x2403");
tabChar = char(compose("\x3000"));
for ii = 1:5
fid = fopen(['冰与火之歌' num2str(ii) '.txt'],'rt','n','UTF-8');
while(~feof(fid))
tmp = fgets(fid);
% 去掉制表符
tmp(tmp == tabChar) = [];
% 去掉行首的空格
while length(tmp)>=20 && tmp(1) == ' '
tmp(1) = [];
if length(tmp)>=20 %舍弃字符数小于20的行
% categorical函数无法识别换行符和空格
% 需要先将换行符替换成特定字符。
% 如果文本中还有别的无法识别的字符,建议手动删除
tmp = replace(tmp,[newline " "],[newlineChar spaceChar]);
parNum = parNum+1;
iceAndFire{parNum} = tmp;
charShifted = [cellstr(tmp(2:end)')' endofTextChar];
XTrain{parNum} = double(tmp);
YTrain{parNum} = categorical(charShifted);
seqLength(parNum) = size(XTrain{parNum},2);
fclose(fid);
end
注意:确保YTrain中没有无法识别的字符,也就是,不存在<undefine>值。可通过下面代码看到YTrain中有哪些值。
unique([YTrain{:}]);
最后,除去少于20个字符的段落,(主要是标题),总共有37945段。每段字符从20到2000+不等。
看一下词云图
wordcloud(iceAndFire)
2,搭建网络并训练
主要包含一个wordembedding层,一个包含400个神经元的lstm层,dropout,全连接层等
function net = createAndTrainNet(XTrain,YTrain,numWords)
inputSize = size(XTrain{1},1);
numClasses = numel(categories([YTrain{:}]));
layers = [sequenceInputLayer(inputSize)
wordEmbeddingLayer(300,numWords)
lstmLayer(400,'OutputMode','sequence')
dropoutLayer(0.2)
fullyConnectedLayer(numClasses)
softmaxLayer
classificationLayer];
options = trainingOptions('adam', ...
'MiniBatchSize',32,...
'ExecutionEnvironment','cpu',...
'InitialLearnRate',0.01, ...
'GradientThreshold',1, ...
'Shuffle','never', ...
'Plots','training-progress', ...
'Verbose',false);
net = trainNetwork(XTrain,YTrain,layers,options);
end
训练了一个晚上,大约10个小时,回来看发现loss和Accuracy基本没变了,就给手动停了。
3.利用训练好的模型进行预测
下面的生成函数需要指定首字符,也可以从所有训练文本的首字符中随机挑取。
function [genText,net] = predictNewPara(net,firstChar)
% 一个段落的最大字符数
maxLength = 2000;
genText = firstChar;
X = double(char(firstChar));
% 词汇表
vocabulary = string(net.Layers(end).ClassNames);
newlineChar = compose("\x00B6");
spaceChar = compose("\x00B7");
endOfTextChar = compose("\x2403");
while strlength(genText) < maxLength
[net,charScores] = predictAndUpdateState(net,X,'ExecutionEnvironment','cpu');
% charScores作为权重,从词汇表中随机抽取字符
newChar = datasample(vocabulary,1,'Weights',charScores);
% 如果生成了文本结束符,就结束生成
if newChar == endOfTextChar
break;
genText = genText + newChar;
X = double(char(newChar));
% 换回换行符和空格
genText = replace(genText,[newlineChar spaceChar],[newline " "]);
end
下面是第一个字符为"J"生成的文本。(没错,我想看看Jon的故事)
Jon felt slain and shield blade, a sleeve, sharp short hovels, the point of blood staggering over the low voy of that. Winterfell cammons wobbles could get for him, but that was nothing of that Archer with a lust, and others deepened bathing about wights on the deserter. Once the longhall lumber, he supposed that he had skined you sister in winters for the other we can oldtend was slow cuts among the wedding, gutsing it over back in children.
再来一段,这次是Jaime
Jaime and Tyrion woke after the green from the night from behind the ice of a man's bow. Arya Tully with bees and black green emptiers led them and a traders with fellow-tower as the pogot while Margaery began to rust.
注意:事实上,这并不是续写,最多只能算是模仿马丁的写作风格,生成了一段《冰与火之歌》相关的文本而已。
最后,上一下主程序代码
% 读取文本文件并转换
[iceAndFire,XTrain,YTrain] = readData();
numWords = max([iceAndFire{:}]);
% 创建并训练网络
net = createAndTrainNet(XTrain,YTrain,numWords);
% 生成新文本
fid = fopen('new.txt','a+t','n','UTF-8');
for ii = 1:10