MATLAB文本分析:03:从文件中提取文本数据
此示例说明如何从文本、HTML、Microsoft® Word、PDF、CSV 和 Microsoft Excel® 文件中提取文本数据并将其导入 MATLAB® 进行分析。
通常,将文本数据导入 MATLAB 的最简单方法是使用函数
extractFileText
。此函数从文本、PDF、HTML 和 Microsoft Word 文件中提取文本数据。
要从 CSV 和 Microsoft Excel 文件导入文本,请使用
readtable
。
要从 HTML 代码中提取文本,请使用
extractHTMLText
。
要从 PDF 表单中读取数据,请使用
readPDFFormData
。
文本文件
用
extractFileText
,从
sonnets.txt
中提取文本。文件
sonnets.txt
包含纯文本形式的莎士比亚十四行诗。
filename = "sonnets.txt";
str = extractFileText(filename);
通过提取两个标题
I
和
II
之间的文本,来查看第一首十四行诗。
start = " I" + newline;
fin = " II";
sonnet1 = extractBetween(str,start,fin)
sonnet1 =
From fairest creatures we desire increase,
That thereby beauty's rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory:
But thou, contracted to thine own bright eyes,
Feed'st thy light's flame with self-substantial fuel,
Making a famine where abundance lies,
Thy self thy foe, to thy sweet self too cruel:
Thou that art now the world's fresh ornament,
And only herald to the gaudy spring,
Within thine own bud buriest thy content,
And tender churl mak'st waste in niggarding:
Pity the world, or else this glutton be,
To eat the world's due, by the grave and thee.
"
对于包含由换行符分隔的多个文档的文本文件,请使用
readlines
函数。
filename = "multilineSonnets.txt";
str = readlines(filename)
str = 3×1 string
"From fairest creatures we desire increase, That thereby beauty's rose might never die, But as the riper should by time decease, His tender heir might bear his memory: But thou, contracted to thine own bright eyes, Feed'st thy light's flame with self-substantial fuel, Making a famine where abundance lies, Thy self thy foe, to thy sweet self too cruel: Thou that art now the world's fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content, And tender churl mak'st waste in niggarding: Pity the world, or else this glutton be, To eat the world's due, by the grave and thee."
"When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tatter'd weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say, within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserv'd thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse,' Proving his beauty by succession thine! This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold."
"Look in thy glass and tell the face thou viewest Now is the time that face should form another; Whose fresh repair if now thou not renewest, Thou dost beguile the world, unbless some mother. For where is she so fair whose unear'd womb Disdains the tillage of thy husbandry? Or who is he so fond will be the tomb, Of his self-love to stop posterity? Thou art thy mother's glass and she in thee Calls back the lovely April of her prime; So thou through windows of thine age shalt see, Despite of wrinkles this thy golden time. But if thou live, remember'd not to be, Die single and thine image dies with thee."
微软 Word 文档
从
sonnets.docx
中,使用函数
extractFileText
提取文本。微软 Word 文档
exampleSonnets.docx
中包含莎士比亚的十四行诗。
filename = "exampleSonnets.docx";
str = extractFileText(filename);
通过提取两个标题
II
和
III
之间的文本,来查看第二首十四行诗。
start = " II" + newline;
fin = " III";
sonnet2 = extractBetween(str,start,fin)
sonnet2 =
When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a tatter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.
"
示例的微软 Word 文档在每行之间使用两个换行符。要用单个换行符替换这些字符,请使用
replace
函数。
sonnet2 = replace(sonnet2,[newline newline],newline)
sonnet2 =
When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a tatter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.
"
PDF文件
从 PDF 文档中提取文本,从 PDF 表单中提取数据。
PDF文档
用函数
extractFileText
,从
sonnets.pdf
中提取文本。文件
exampleSonnets.pdf
包含 PDF 格式的莎士比亚十四行诗。
filename = "exampleSonnets.pdf";
str = extractFileText(filename);
通过提取两个标题
III
和
IV
之间的文本来查看第三首十四行诗。此 PDF 在每个换行符前都有一个空格。
start = " III " + newline;
fin = "IV";
sonnet3 = extractBetween(str,start,fin)
sonnet3 =
Look in thy glass and tell the face thou viewest
Now is the time that face should form another;
Whose fresh repair if now thou not renewest,
Thou dost beguile the world, unbless some mother.
For where is she so fair whose unear'd womb
Disdains the tillage of thy husbandry?
Or who is he so fond will be the tomb,
Of his self-love to stop posterity?
Thou art thy mother's glass and she in thee
Calls back the lovely April of her prime;
So thou through windows of thine age shalt see,
Despite of wrinkles this thy golden time.
But if thou live, remember'd not to be,
Die single and thine image dies with thee.
"
PDF表单
要从 PDF 表单中读取文本数据,使用
readPDFFormData
函数。该函数返回一个结构体,其中包含来自 PDF 表单字段的数据。
filename = "weatherReportForm1.pdf";
data = readPDFFormData(filename)
data = struct with fields:
event_type: "Thunderstorm Wind"
event_narrative: "Large tree down between Plantersville and Nettleton."
HTML
从 HTML 文件、HTML 代码和 Web 中提取文本。
HTML文件
要从已保存的 HTML 文件中提取文本数据,请使用
extractFileText
。
filename = "exampleSonnets.html";
str = extractFileText(filename);
通过提取两个标题
IV
和
V
之间的文本来查看第四首十四行诗。
start = newline + "IV" + newline;
fin = newline + "V" + newline;
sonnet4 = extractBetween(str,start,fin)
sonnet4 =
Unthrifty loveliness, why dost thou spend
Upon thy self thy beauty's legacy?
Nature's bequest gives nothing, but doth lend,
And being frank she lends to those are free:
Then, beauteous niggard, why dost thou abuse
The bounteous largess given thee to give?
Profitless usurer, why dost thou use
So great a sum of sums, yet canst not live?
For having traffic with thy self alone,
Thou of thy self thy sweet self dost deceive:
Then how when nature calls thee to be gone,
What acceptable audit canst thou leave?
Thy unused beauty must be tombed with thee,
Which, used, lives th' executor to be.
"
HTML 代码
要从包含 HTML 代码的字符串中提取文本数据,请使用
extractHTMLText
.
code = "<html><body><h1>THE SONNETS</h1><p>by William Shakespeare</p></body></html>";
str = extractHTMLText(code)
str =
"THE SONNETS
by William Shakespeare"
来自网络
要从网页中提取文本数据,首先使用
webread
读取 HTML 代码,然后使用
extractHTMLText
。
url = "https://www.mathworks.com/help/textanalytics";
code = webread(url);
str = extractHTMLText(code)
str =
'Text Analytics Toolbox
Analyze and model text data
Release Notes
PDF Documentation
Release Notes
PDF Documentation
Text Analytics Toolbox™ provides algorithms and visualizations for preprocessing, analyzing, and modeling text data. Models created with the toolbox can be used in applications such as sentiment analysis, predictive maintenance, and topic modeling.
Text Analytics Toolbox includes tools for processing raw text from sources such as equipment logs, news feeds, surveys, operator reports, and social media. You can extract text from popular file formats, preprocess raw text, extract individual words, convert text into numerical representations, and build statistical models.
Using machine learning techniques such as LSA, LDA, and word embeddings, you can find clusters and create features from high-dimensional text datasets. Features created with Text Analytics Toolbox can be combined with features from other data sources to build machine learning models that take advantage of textual, numeric, and other types of data.
Get Started
Learn the basics of Text Analytics Toolbox
Text Data Preparation
Import text data into MATLAB® and preprocess it for analysis
Modeling and Prediction
Develop predictive models using topic models and word embeddings
Display and Presentation
Visualize text data and models using word clouds and text scatter plots
Language Support
Information on language support in Text Analytics Toolbox'
解析 HTML 代码
要查找 HTML 代码的特定元素,请使用
htmlTree
和
findElement
解析代码。解析 HTML 代码并找到所有超链接。超链接是具有元素名称
A
的节点。
tree = htmlTree(code);
selector = "A";
subtrees = findElement(tree,selector);
查看前 10 个子树并使用
extractHTMLText
提取文本。
subtrees(1:10)
ans =
10×1 htmlTree:
<A class="skip_link sr-only" href="#content_container">Skip to content</A>
<A href="https://www.mathworks.com?s_tid=gn_logo" class="svg_link navbar-brand"><IMG src="/images/responsive/global/pic-header-mathworks-logo.svg" class="mw_logo" alt="MathWorks"/></A>
<A href="https://www.mathworks.com/products.html?s_tid=gn_ps">Products</A>
<A href="https://www.mathworks.com/solutions.html?s_tid=gn_sol">Solutions</A>
<A href="https://www.mathworks.com/academia.html?s_tid=gn_acad">Academia</A>
<A href="https://www.mathworks.com/support.html?s_tid=gn_supp">Support</A>
<A href="https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc">Community</A>
<A href="https://www.mathworks.com/company/events.html?s_tid=gn_ev">Events</A>
<A href="https://www.mathworks.com/products/get-matlab.html?s_tid=gn_getml">Get MATLAB</A>
<A href="https://www.mathworks.com?s_tid=gn_logo" class="svg_link pull-left"><IMG src="/images/responsive/global/pic-header-mathworks-logo.svg" class="mw_logo" alt="MathWorks"/></A>
str = extractHTMLText(subtrees);
查看前 10 个超链接的提取文本。
str(1:10)
ans = 10×1 string
"Skip to content"
"Products"
"Solutions"
"Academia"
"Support"
"Community"
"Events"
"Get MATLAB"
""
要获取链接目标,请使用
getAttributes
并指定属性
"href"
(超链接引用)。获取前 10 个子树的链接目标。
attr = "href";
str = getAttribute(subtrees(1:10),attr)
str = 10×1 string
"#content_container"
"https://www.mathworks.com?s_tid=gn_logo"
"https://www.mathworks.com/products.html?s_tid=gn_ps"
"https://www.mathworks.com/solutions.html?s_tid=gn_sol"
"https://www.mathworks.com/academia.html?s_tid=gn_acad"
"https://www.mathworks.com/support.html?s_tid=gn_supp"
"https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc"
"https://www.mathworks.com/company/events.html?s_tid=gn_ev"
"https://www.mathworks.com/products/get-matlab.html?s_tid=gn_getml"
"https://www.mathworks.com?s_tid=gn_logo"
CSV 和 微软 Excel 文件
要从 CSV 和 微软 Excel 文件中提取文本数据,请使用
readtable
,可以从它返回的表中提取文本数据。
使用函数
readtable
,从
factoryReposts.csv
中提取表数据,并查看表的前几行。
T = readtable('factoryReports.csv','TextType','string');
head(T)
ans=8×5 table
Description Category Urgency Resolution Cost
_____________________________________________________________________ ____________________ ________ ____________________ _____
"Items are occasionally getting stuck in the scanner spools." "Mechanical Failure" "Medium" "Readjust Machine" 45
"Loud rattling and banging sounds are coming from assembler pistons." "Mechanical Failure" "Medium" "Readjust Machine" 35
"There are cuts to the power when starting the plant." "Electronic Failure" "High" "Full Replacement" 16200
"Fried capacitors in the assembler." "Electronic Failure" "High" "Replace Components" 352
"Mixer tripped the fuses." "Electronic Failure" "Low" "Add to Watch List" 55
"Burst pipe in the constructing agent is spraying coolant." "Leak" "High" "Replace Components" 371
"A fuse is blown in the mixer." "Electronic Failure" "Low" "Replace Components" 441
"Things continue to tumble off of the belt." "Mechanical Failure" "Low" "Readjust Machine" 38
从
event_narrative
列中提取文本数据,并查看前几个字符串。
str = T.Description;
str(1:10)
ans = 10×1 string
"Items are occasionally getting stuck in the scanner spools."
"Loud rattling and banging sounds are coming from assembler pistons."
"There are cuts to the power when starting the plant."
"Fried capacitors in the assembler."
"Mixer tripped the fuses."
"Burst pipe in the constructing agent is spraying coolant."
"A fuse is blown in the mixer."
"Things continue to tumble off of the belt."
"Falling items from the conveyor belt."
"The scanner reel is split, it will soon begin to curve."
从多个文件中提取文本
如果您的文本数据包含在文件夹中的多个文件中,则您可以使用文件datastore,将文本数据导入 MATLAB。
为示例十四行诗文本文件创建文件datastore。示例文件名为“
exampleSonnetN.txt
”,其中
N
是十四行诗的编号。使用通配符“*”指定文件名以查找此结构的所有文件名。要将读取函数指定为
extractFileText
,使用函数句柄,请将此函数输入到
fileDatastore
。
location = fullfile(matlabroot,"examples","textanalytics","data","exampleSonnet*.txt");
fds = fileDatastore(location,'ReadFcn',@extractFileText)
fds =
FileDatastore with properties:
Files: {
' ...\matlab\examples\textanalytics\data\exampleSonnet1.txt';
' ...\matlab\examples\textanalytics\data\exampleSonnet2.txt';
' ...\matlab\examples\textanalytics\data\exampleSonnet3.txt'
... and 2 more
Folders: {
' ...\matlab\examples\textanalytics\data'
UniformRead: 0
ReadMode: 'file'
BlockSize: Inf
PreviewFcn: @extractFileText
SupportedOutputFormats: ["txt" "csv" "xlsx" "xls" "parquet" "parq" "png" "jpg" "jpeg" "tif" "tiff" "wav" "flac" "ogg" "mp4" "m4a"]
ReadFcn: @extractFileText
AlternateFileSystemRoots: {}
遍历datastore 中的文件并读取每个文本文件。
str = [];