MATLAB文本分析：03：从文件中提取文本数据

吃小羊

理解世界，数据先行

此示例说明如何从文本、HTML、Microsoft® Word、PDF、CSV 和 Microsoft Excel® 文件中提取文本数据并将其导入 MATLAB® 进行分析。

通常，将文本数据导入 MATLAB 的最简单方法是使用函数 extractFileText 。此函数从文本、PDF、HTML 和 Microsoft Word 文件中提取文本数据。

要从 CSV 和 Microsoft Excel 文件导入文本，请使用 readtable 。

要从 HTML 代码中提取文本，请使用 extractHTMLText 。

要从 PDF 表单中读取数据，请使用 readPDFFormData 。

文本文件

用 extractFileText ，从 sonnets.txt 中提取文本。文件 sonnets.txt 包含纯文本形式的莎士比亚十四行诗。

filename = "sonnets.txt";
str = extractFileText(filename);

通过提取两个标题 I 和 II 之间的文本，来查看第一首十四行诗。

start = " I" + newline;
fin = " II";
sonnet1 = extractBetween(str,start,fin)
sonnet1 = 
       From fairest creatures we desire increase,
       That thereby beauty's rose might never die,
       But as the riper should by time decease,
       His tender heir might bear his memory:
       But thou, contracted to thine own bright eyes,
       Feed'st thy light's flame with self-substantial fuel,
       Making a famine where abundance lies,
       Thy self thy foe, to thy sweet self too cruel:
       Thou that art now the world's fresh ornament,
       And only herald to the gaudy spring,
       Within thine own bud buriest thy content,
       And tender churl mak'st waste in niggarding:
         Pity the world, or else this glutton be,
         To eat the world's due, by the grave and thee.
      "

对于包含由换行符分隔的多个文档的文本文件，请使用 readlines 函数。

filename = "multilineSonnets.txt";
str = readlines(filename)
str = 3×1 string
    "From fairest creatures we desire increase, That thereby beauty's rose might never die, But as the riper should by time decease, His tender heir might bear his memory: But thou, contracted to thine own bright eyes, Feed'st thy light's flame with self-substantial fuel, Making a famine where abundance lies, Thy self thy foe, to thy sweet self too cruel: Thou that art now the world's fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content, And tender churl mak'st waste in niggarding: Pity the world, or else this glutton be, To eat the world's due, by the grave and thee."
    "When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tatter'd weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say, within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserv'd thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse,' Proving his beauty by succession thine! This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold."
    "Look in thy glass and tell the face thou viewest Now is the time that face should form another; Whose fresh repair if now thou not renewest, Thou dost beguile the world, unbless some mother. For where is she so fair whose unear'd womb Disdains the tillage of thy husbandry? Or who is he so fond will be the tomb, Of his self-love to stop posterity? Thou art thy mother's glass and she in thee Calls back the lovely April of her prime; So thou through windows of thine age shalt see, Despite of wrinkles this thy golden time. But if thou live, remember'd not to be, Die single and thine image dies with thee."

微软 Word 文档

从 sonnets.docx 中，使用函数 extractFileText 提取文本。微软 Word 文档 exampleSonnets.docx 中包含莎士比亚的十四行诗。

filename = "exampleSonnets.docx";
str = extractFileText(filename);

通过提取两个标题 II 和 III 之间的文本，来查看第二首十四行诗。

start = " II" + newline;
fin = " III";
sonnet2 = extractBetween(str,start,fin)
sonnet2 = 
       When forty winters shall besiege thy brow,
       And dig deep trenches in thy beauty's field,
       Thy youth's proud livery so gazed on now,
       Will be a tatter'd weed of small worth held:
       Then being asked, where all thy beauty lies,
       Where all the treasure of thy lusty days;
       To say, within thine own deep sunken eyes,
       Were an all-eating shame, and thriftless praise.
       How much more praise deserv'd thy beauty's use,
       If thou couldst answer 'This fair child of mine
       Shall sum my count, and make my old excuse,'
       Proving his beauty by succession thine!
         This were to be new made when thou art old,
         And see thy blood warm when thou feel'st it cold.
      "

示例的微软 Word 文档在每行之间使用两个换行符。要用单个换行符替换这些字符，请使用 replace 函数。

sonnet2 = replace(sonnet2,[newline newline],newline)
sonnet2 = 
       When forty winters shall besiege thy brow,
       And dig deep trenches in thy beauty's field,
       Thy youth's proud livery so gazed on now,
       Will be a tatter'd weed of small worth held:
       Then being asked, where all thy beauty lies,
       Where all the treasure of thy lusty days;
       To say, within thine own deep sunken eyes,
       Were an all-eating shame, and thriftless praise.
       How much more praise deserv'd thy beauty's use,
       If thou couldst answer 'This fair child of mine
       Shall sum my count, and make my old excuse,'
       Proving his beauty by succession thine!
         This were to be new made when thou art old,
         And see thy blood warm when thou feel'st it cold.
      "

PDF文件

从 PDF 文档中提取文本，从 PDF 表单中提取数据。

PDF文档

用函数 extractFileText ，从 sonnets.pdf 中提取文本。文件 exampleSonnets.pdf 包含 PDF 格式的莎士比亚十四行诗。

filename = "exampleSonnets.pdf";
str = extractFileText(filename);

通过提取两个标题 III 和 IV 之间的文本来查看第三首十四行诗。此 PDF 在每个换行符前都有一个空格。

start = " III " + newline;
fin = "IV";
sonnet3 = extractBetween(str,start,fin)
sonnet3 = 
       Look in thy glass and tell the face thou viewest 
       Now is the time that face should form another; 
       Whose fresh repair if now thou not renewest, 
       Thou dost beguile the world, unbless some mother. 
       For where is she so fair whose unear'd womb 
       Disdains the tillage of thy husbandry? 
       Or who is he so fond will be the tomb, 
       Of his self-love to stop posterity? 
       Thou art thy mother's glass and she in thee 
       Calls back the lovely April of her prime; 
       So thou through windows of thine age shalt see, 
       Despite of wrinkles this thy golden time. 
         But if thou live, remember'd not to be, 
         Die single and thine image dies with thee. 
       "

PDF表单

要从 PDF 表单中读取文本数据，使用 readPDFFormData 函数。该函数返回一个结构体，其中包含来自 PDF 表单字段的数据。

filename = "weatherReportForm1.pdf";
data = readPDFFormData(filename)
data = struct with fields:
         event_type: "Thunderstorm Wind"
    event_narrative: "Large tree down between Plantersville and Nettleton."

HTML

从 HTML 文件、HTML 代码和 Web 中提取文本。

HTML文件

要从已保存的 HTML 文件中提取文本数据，请使用 extractFileText 。

filename = "exampleSonnets.html";
str = extractFileText(filename);

通过提取两个标题 IV 和 V 之间的文本来查看第四首十四行诗。

start = newline + "IV" + newline;
fin = newline + "V" + newline;
sonnet4 = extractBetween(str,start,fin)
sonnet4 = 
     Unthrifty loveliness, why dost thou spend
     Upon thy self thy beauty's legacy?
     Nature's bequest gives nothing, but doth lend,
     And being frank she lends to those are free:
     Then, beauteous niggard, why dost thou abuse
     The bounteous largess given thee to give?
     Profitless usurer, why dost thou use
     So great a sum of sums, yet canst not live?
     For having traffic with thy self alone,
     Thou of thy self thy sweet self dost deceive:
     Then how when nature calls thee to be gone,
     What acceptable audit canst thou leave?
     Thy unused beauty must be tombed with thee,
     Which, used, lives th' executor to be.
     "

HTML 代码

要从包含 HTML 代码的字符串中提取文本数据，请使用 extractHTMLText .

code = "<html><body><h1>THE SONNETS</h1><p>by William Shakespeare</p></body></html>";
str = extractHTMLText(code)
str = 
    "THE SONNETS
     by William Shakespeare"

来自网络

要从网页中提取文本数据，首先使用 webread 读取 HTML 代码，然后使用 extractHTMLText 。

url = "https://www.mathworks.com/help/textanalytics";
code = webread(url);
str = extractHTMLText(code)
str = 
    'Text Analytics Toolbox
     Analyze and model text data 
     Release Notes
     PDF Documentation
     Release Notes
     PDF Documentation
     Text Analytics Toolbox™ provides algorithms and visualizations for preprocessing, analyzing, and modeling text data. Models created with the toolbox can be used in applications such as sentiment analysis, predictive maintenance, and topic modeling.
     Text Analytics Toolbox includes tools for processing raw text from sources such as equipment logs, news feeds, surveys, operator reports, and social media. You can extract text from popular file formats, preprocess raw text, extract individual words, convert text into numerical representations, and build statistical models.
     Using machine learning techniques such as LSA, LDA, and word embeddings, you can find clusters and create features from high-dimensional text datasets. Features created with Text Analytics Toolbox can be combined with features from other data sources to build machine learning models that take advantage of textual, numeric, and other types of data.
     Get Started
     Learn the basics of Text Analytics Toolbox
     Text Data Preparation
     Import text data into MATLAB® and preprocess it for analysis
     Modeling and Prediction
     Develop predictive models using topic models and word embeddings
     Display and Presentation
     Visualize text data and models using word clouds and text scatter plots
     Language Support
     Information on language support in Text Analytics Toolbox'

解析 HTML 代码

要查找 HTML 代码的特定元素，请使用 htmlTree 和 findElement 解析代码。解析 HTML 代码并找到所有超链接。超链接是具有元素名称 A 的节点。

tree = htmlTree(code);
selector = "A";
subtrees = findElement(tree,selector);

查看前 10 个子树并使用 extractHTMLText 提取文本。

subtrees(1:10)
ans = 
  10×1 htmlTree:
    <A class="skip_link sr-only" href="#content_container">Skip to content</A>
    <A href="https://www.mathworks.com?s_tid=gn_logo" class="svg_link navbar-brand"><IMG src="/images/responsive/global/pic-header-mathworks-logo.svg" class="mw_logo" alt="MathWorks"/></A>
    <A href="https://www.mathworks.com/products.html?s_tid=gn_ps">Products</A>
    <A href="https://www.mathworks.com/solutions.html?s_tid=gn_sol">Solutions</A>
    <A href="https://www.mathworks.com/academia.html?s_tid=gn_acad">Academia</A>
    <A href="https://www.mathworks.com/support.html?s_tid=gn_supp">Support</A>
    <A href="https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc">Community</A>
    <A href="https://www.mathworks.com/company/events.html?s_tid=gn_ev">Events</A>
    <A href="https://www.mathworks.com/products/get-matlab.html?s_tid=gn_getml">Get MATLAB</A>
    <A href="https://www.mathworks.com?s_tid=gn_logo" class="svg_link pull-left"><IMG src="/images/responsive/global/pic-header-mathworks-logo.svg" class="mw_logo" alt="MathWorks"/></A>
str = extractHTMLText(subtrees);

查看前 10 个超链接的提取文本。

str(1:10)
ans = 10×1 string
    "Skip to content"
    "Products"
    "Solutions"
    "Academia"
    "Support"
    "Community"
    "Events"
    "Get MATLAB"
    ""

要获取链接目标，请使用 getAttributes 并指定属性 "href" （超链接引用）。获取前 10 个子树的链接目标。

attr = "href";
str = getAttribute(subtrees(1:10),attr)
str = 10×1 string
    "#content_container"
    "https://www.mathworks.com?s_tid=gn_logo"
    "https://www.mathworks.com/products.html?s_tid=gn_ps"
    "https://www.mathworks.com/solutions.html?s_tid=gn_sol"
    "https://www.mathworks.com/academia.html?s_tid=gn_acad"
    "https://www.mathworks.com/support.html?s_tid=gn_supp"
    "https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc"
    "https://www.mathworks.com/company/events.html?s_tid=gn_ev"
    "https://www.mathworks.com/products/get-matlab.html?s_tid=gn_getml"
    "https://www.mathworks.com?s_tid=gn_logo"

CSV 和微软 Excel 文件

要从 CSV 和微软 Excel 文件中提取文本数据，请使用 readtable ，可以从它返回的表中提取文本数据。

使用函数 readtable ，从 factoryReposts.csv 中提取表数据，并查看表的前几行。

T = readtable('factoryReports.csv','TextType','string');
head(T)
ans=8×5 table
                                 Description                                       Category          Urgency          Resolution         Cost 
    _____________________________________________________________________    ____________________    ________    ____________________    _____
    "Items are occasionally getting stuck in the scanner spools."            "Mechanical Failure"    "Medium"    "Readjust Machine"         45
    "Loud rattling and banging sounds are coming from assembler pistons."    "Mechanical Failure"    "Medium"    "Readjust Machine"         35
    "There are cuts to the power when starting the plant."                   "Electronic Failure"    "High"      "Full Replacement"      16200
    "Fried capacitors in the assembler."                                     "Electronic Failure"    "High"      "Replace Components"      352
    "Mixer tripped the fuses."                                               "Electronic Failure"    "Low"       "Add to Watch List"        55
    "Burst pipe in the constructing agent is spraying coolant."              "Leak"                  "High"      "Replace Components"      371
    "A fuse is blown in the mixer."                                          "Electronic Failure"    "Low"       "Replace Components"      441
    "Things continue to tumble off of the belt."                             "Mechanical Failure"    "Low"       "Readjust Machine"         38

从 event_narrative 列中提取文本数据，并查看前几个字符串。

str = T.Description;
str(1:10)
ans = 10×1 string
    "Items are occasionally getting stuck in the scanner spools."
    "Loud rattling and banging sounds are coming from assembler pistons."
    "There are cuts to the power when starting the plant."
    "Fried capacitors in the assembler."
    "Mixer tripped the fuses."
    "Burst pipe in the constructing agent is spraying coolant."
    "A fuse is blown in the mixer."
    "Things continue to tumble off of the belt."
    "Falling items from the conveyor belt."
    "The scanner reel is split, it will soon begin to curve."

从多个文件中提取文本

如果您的文本数据包含在文件夹中的多个文件中，则您可以使用文件datastore，将文本数据导入 MATLAB。

为示例十四行诗文本文件创建文件datastore。示例文件名为“ exampleSonnetN.txt ”，其中 N 是十四行诗的编号。使用通配符“*”指定文件名以查找此结构的所有文件名。要将读取函数指定为 extractFileText ，使用函数句柄，请将此函数输入到 fileDatastore 。

location = fullfile(matlabroot,"examples","textanalytics","data","exampleSonnet*.txt");
fds = fileDatastore(location,'ReadFcn',@extractFileText)
fds = 
  FileDatastore with properties:
                       Files: {
                              ' ...\matlab\examples\textanalytics\data\exampleSonnet1.txt';
                              ' ...\matlab\examples\textanalytics\data\exampleSonnet2.txt';
                              ' ...\matlab\examples\textanalytics\data\exampleSonnet3.txt'
                               ... and 2 more
                     Folders: {
                              ' ...\matlab\examples\textanalytics\data'
                 UniformRead: 0
                    ReadMode: 'file'
                   BlockSize: Inf
                  PreviewFcn: @extractFileText
      SupportedOutputFormats: ["txt"    "csv"    "xlsx"    "xls"    "parquet"    "parq"    "png"    "jpg"    "jpeg"    "tif"    "tiff"    "wav"    "flac"    "ogg"    "mp4"    "m4a"]
                     ReadFcn: @extractFileText
    AlternateFileSystemRoots: {}

遍历datastore 中的文件并读取每个文本文件。

str = [];