Extract Text Data from Files

This example shows how to extract the text data from text, HTML, Microsoft® Word, PDF, CSV, and Microsoft Excel® files and import it into MATLAB® for analysis.

Usually, the easiest way to import text data into MATLAB is to use the extractFileText function. This function extracts the text data from text, PDF, HTML, and Microsoft Word files. To import text from CSV and Microsoft Excel files, use readtable. To extract text from HTML code, use extractHTMLText. To read data from PDF forms, use readPDFFormData.

Text File

Extract the text from sonnets.txt using extractFileText. The file sonnets.txt contains Shakespeare's sonnets in plain text.

filename = "sonnets.txt";
str = extractFileText(filename);

View the first sonnet by extracting the text between the two titles "I" and "II".

start = " I" + newline;
fin = " II";
sonnet1 = extractBetween(str,start,fin)
sonnet1 = 
    "
       From fairest creatures we desire increase,
       That thereby beauty's rose might never die,
       But as the riper should by time decease,
       His tender heir might bear his memory:
       But thou, contracted to thine own bright eyes,
       Feed'st thy light's flame with self-substantial fuel,
       Making a famine where abundance lies,
       Thy self thy foe, to thy sweet self too cruel:
       Thou that art now the world's fresh ornament,
       And only herald to the gaudy spring,
       Within thine own bud buriest thy content,
       And tender churl mak'st waste in niggarding:
         Pity the world, or else this glutton be,
         To eat the world's due, by the grave and thee.
     
      "

Microsoft Word Document

Extract the text from sonnets.docx using extractFileText. The file exampleSonnets.docx contains Shakespeare's sonnets in a Microsoft Word document.

filename = "exampleSonnets.docx";
str = extractFileText(filename);

View the second sonnet by extracting the text between the two titles "II" and "III".

start = " II" + newline;
fin = " III";
sonnet2 = extractBetween(str,start,fin)
sonnet2 = 
    "
       When forty winters shall besiege thy brow,
     
       And dig deep trenches in thy beauty's field,
     
       Thy youth's proud livery so gazed on now,
     
       Will be a tatter'd weed of small worth held:
     
       Then being asked, where all thy beauty lies,
     
       Where all the treasure of thy lusty days;
     
       To say, within thine own deep sunken eyes,
     
       Were an all-eating shame, and thriftless praise.
     
       How much more praise deserv'd thy beauty's use,
     
       If thou couldst answer 'This fair child of mine
     
       Shall sum my count, and make my old excuse,'
     
       Proving his beauty by succession thine!
     
         This were to be new made when thou art old,
     
         And see thy blood warm when thou feel'st it cold.
     
      "

The example Microsoft Word document uses two newline characters between each line. To replace these characters with a single newline character, use the replace function.

sonnet2 = replace(sonnet2,[newline newline],newline)
sonnet2 = 
    "
       When forty winters shall besiege thy brow,
       And dig deep trenches in thy beauty's field,
       Thy youth's proud livery so gazed on now,
       Will be a tatter'd weed of small worth held:
       Then being asked, where all thy beauty lies,
       Where all the treasure of thy lusty days;
       To say, within thine own deep sunken eyes,
       Were an all-eating shame, and thriftless praise.
       How much more praise deserv'd thy beauty's use,
       If thou couldst answer 'This fair child of mine
       Shall sum my count, and make my old excuse,'
       Proving his beauty by succession thine!
         This were to be new made when thou art old,
         And see thy blood warm when thou feel'st it cold.
      "

PDF Files

Extract text from PDF documents and data from PDF forms.

PDF Document

Extract the text from sonnets.pdf using extractFileText. The file exampleSonnets.pdf contains Shakespeare's sonnets in a PDF.

filename = "exampleSonnets.pdf";
str = extractFileText(filename);

View the third sonnet by extracting the text between the two titles "III" and "IV". This PDF has a space before each newline character.

start = " III " + newline;
fin = "IV";
sonnet3 = extractBetween(str,start,fin)
sonnet3 = 
    " 
       Look in thy glass and tell the face thou viewest 
       Now is the time that face should form another; 
       Whose fresh repair if now thou not renewest, 
       Thou dost beguile the world, unbless some mother. 
       For where is she so fair whose unear'd womb 
       Disdains the tillage of thy husbandry? 
       Or who is he so fond will be the tomb, 
       Of his self-love to stop posterity? 
       Thou art thy mother's glass and she in thee 
       Calls back the lovely April of her prime; 
       So thou through windows of thine age shalt see, 
       Despite of wrinkles this thy golden time. 
         But if thou live, remember'd not to be, 
         Die single and thine image dies with thee. 
     
      
       "

PDF Form

To read text data from PDF forms, use readPDFFormData. The function returns a struct containing the data from the PDF form fields.

filename = "weatherReportForm1.pdf";
data = readPDFFormData(filename)
data = struct with fields:
         event_type: "Thunderstorm Wind"
    event_narrative: "Large tree down between Plantersville and Nettleton."

HTML

Extract text from HTML files, HTML code, and the web.

HTML File

To extract text data from a saved HTML file, use extractFileText.

filename = "exampleSonnets.html";
str = extractFileText(filename);

View the forth sonnet by extracting the text between the two titles "IV" and "V".

start = newline + "IV" + newline;
fin = newline + "V" + newline;
sonnet4 = extractBetween(str,start,fin)
sonnet4 = 
    "
     Unthrifty loveliness, why dost thou spend
      Upon thy self thy beauty's legacy?
      Nature's bequest gives nothing, but doth lend,
      And being frank she lends to those are free:
      Then, beauteous niggard, why dost thou abuse
      The bounteous largess given thee to give?
      Profitless usurer, why dost thou use
      So great a sum of sums, yet canst not live?
      For having traffic with thy self alone,
      Thou of thy self thy sweet self dost deceive:
      Then how when nature calls thee to be gone,
      What acceptable audit canst thou leave?
      Thy unused beauty must be tombed with thee,
      Which, used, lives th' executor to be.
     "

HTML Code

To extract text data from a string containing HTML code, use extractHTMLText.

code = "<html><body><h1>THE SONNETS</h1><p>by William Shakespeare</p></body></html>";
str = extractHTMLText(code)
str = 
    "THE SONNETS
     
     by William Shakespeare"

From the Web

To extract text data from a web page, first read the HTML code using webread, and then use extractHTMLText.

url = "https://www.mathworks.com/help/textanalytics";
code = webread(url);
str = extractHTMLText(code)
str = 
    'Text Analytics Toolbox™ provides algorithms and visualizations for preprocessing, analyzing, and modeling text data. Models created with the toolbox can be used in applications such as sentiment analysis, predictive maintenance, and topic modeling.
     
     Text Analytics Toolbox includes tools for processing raw text from sources such as equipment logs, news feeds, surveys, operator reports, and social media. You can extract text from popular file formats, preprocess raw text, extract individual words, convert text into numerical representations, and build statistical models.
     
     Using machine learning techniques such as LSA, LDA, and word embeddings, you can find clusters and create features from high-dimensional text datasets. Features created with Text Analytics Toolbox can be combined with features from other data sources to build machine learning models that take advantage of textual, numeric, and other types of data.'

Parse HTML Code

To find particular elements of HTML code, parse the code using htmlTree and use findElement. Parse the HTML code and find all the hyperlinks. The hyperlinks are nodes with element name "A".

tree = htmlTree(code);
selector = "A";
subtrees = findElement(tree,selector);

View the first 10 subtrees and extract the text using extractHTMLText.

subtrees(1:10)
ans = 
  10×1 htmlTree:

    <A class="svg_link navbar-brand" href="https://www.mathworks.com?s_tid=gn_logo"><IMG alt="MathWorks" class="mw_logo" src="/images/responsive/global/pic-header-mathworks-logo.svg"/></A>
    <A class="mwa-nav_login" href="https://www.mathworks.com/login?uri=http://www.mathworks.com/help/textanalytics/index.html">Sign In</A>
    <A href="https://www.mathworks.com/products.html?s_tid=gn_ps">Products</A>
    <A href="https://www.mathworks.com/solutions.html?s_tid=gn_sol">Solutions</A>
    <A href="https://www.mathworks.com/academia.html?s_tid=gn_acad">Academia</A>
    <A href="https://www.mathworks.com/support.html?s_tid=gn_supp">Support</A>
    <A href="https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc">Community</A>
    <A href="https://www.mathworks.com/company/events.html?s_tid=gn_ev">Events</A>
    <A href="https://www.mathworks.com/company/aboutus/contact_us.html?s_tid=gn_cntus">Contact Us</A>
    <A href="https://www.mathworks.com/store?s_cid=store_top_nav&amp;s_tid=gn_store">How to Buy</A>

str = extractHTMLText(subtrees);

View the extracted text of the first 10 hyperlinks.

str(1:10)
ans = 10×1 string array
    ""
    "Sign In"
    "Products"
    "Solutions"
    "Academia"
    "Support"
    "Community"
    "Events"
    "Contact Us"
    "How to Buy"

To get the link targets, use getAttributes and specify the attribute "href" (hyperlink reference). Get the link targets of the first 10 subtrees.

attr = "href";
str = getAttribute(subtrees(1:10),attr)
str = 10×1 string array
    "https://www.mathworks.com?s_tid=gn_logo"
    "https://www.mathworks.com/login?uri=http://www.mathworks.com/help/textanalytics/index.html"
    "https://www.mathworks.com/products.html?s_tid=gn_ps"
    "https://www.mathworks.com/solutions.html?s_tid=gn_sol"
    "https://www.mathworks.com/academia.html?s_tid=gn_acad"
    "https://www.mathworks.com/support.html?s_tid=gn_supp"
    "https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc"
    "https://www.mathworks.com/company/events.html?s_tid=gn_ev"
    "https://www.mathworks.com/company/aboutus/contact_us.html?s_tid=gn_cntus"
    "https://www.mathworks.com/store?s_cid=store_top_nav&s_tid=gn_store"

CSV and Microsoft Excel Files

To extract text data from CSV and Microsoft Excel files, use readtable and extract the text data from the table that it returns.

Extract the table data using the readtable function and view the first few rows of the table.

T = readtable('weatherReports.csv','TextType','string');
head(T)
ans=8×16 table
            Time             event_id          state              event_type         damage_property    damage_crops    begin_lat    begin_lon    end_lat    end_lon                                                                                             event_narrative                                                                                             storm_duration    begin_day    end_day    year       end_timestamp    
    ____________________    __________    ________________    ___________________    _______________    ____________    _________    _________    _______    _______    _________________________________________________________________________________________________________________________________________________________________________________________________    ______________    _________    _______    ____    ____________________

    22-Jul-2016 16:10:00    6.4433e+05    "MISSISSIPPI"       "Thunderstorm Wind"       ""                "0.00K"         34.14        -88.63     34.122     -88.626    "Large tree down between Plantersville and Nettleton."                                                                                                                                                  00:05:00          22          22       2016    22-Jul-0016 16:15:00
    15-Jul-2016 17:15:00    6.5182e+05    "SOUTH CAROLINA"    "Heavy Rain"              "2.00K"           "0.00K"         34.94        -81.03      34.94      -81.03    "One to two feet of deep standing water developed on a street on the Winthrop University campus after more than an inch of rain fell in less than an hour. One vehicle was stalled in the water."       00:00:00          15          15       2016    15-Jul-0016 17:15:00
    15-Jul-2016 17:25:00    6.5183e+05    "SOUTH CAROLINA"    "Thunderstorm Wind"       "0.00K"           "0.00K"         35.01        -80.93      35.01      -80.93    "NWS Columbia relayed a report of trees blown down along Tom Hall St."                                                                                                                                  00:00:00          15          15       2016    15-Jul-0016 17:25:00
    16-Jul-2016 12:46:00    6.5183e+05    "NORTH CAROLINA"    "Thunderstorm Wind"       "0.00K"           "0.00K"         35.64        -82.14      35.64      -82.14    "Media reported two trees blown down along I-40 in the Old Fort area."                                                                                                                                  00:00:00          16          16       2016    16-Jul-0016 12:46:00
    15-Jul-2016 14:28:00    6.4332e+05    "MISSOURI"          "Hail"                    ""                ""              36.45        -89.97      36.45      -89.97    ""                                                                                                                                                                                                      00:07:00          15          15       2016    15-Jul-0016 14:35:00
    15-Jul-2016 16:31:00    6.4332e+05    "ARKANSAS"          "Thunderstorm Wind"       ""                "0.00K"         35.85         -90.1     35.838     -90.087    "A few tree limbs greater than 6 inches down on HWY 18 in Roseland."                                                                                                                                    00:09:00          15          15       2016    15-Jul-0016 16:40:00
    15-Jul-2016 16:03:00    6.4343e+05    "TENNESSEE"         "Thunderstorm Wind"       "20.00K"          "0.00K"        35.056       -89.937      35.05     -89.904    "Awning blown off a building on Lamar Avenue. Multiple trees down near the intersection of Winchester and Perkins."                                                                                     00:07:00          15          15       2016    15-Jul-0016 16:10:00
    15-Jul-2016 17:27:00    6.4344e+05    "TENNESSEE"         "Hail"                    ""                ""             35.385        -89.78     35.385      -89.78    "Quarter size hail near Rosemark."                                                                                                                                                                      00:05:00          15          15       2016    15-Jul-0016 17:32:00

Extract the text data from the event_narrative column and view the first few strings.

str = T.event_narrative;
str(1:10)
ans = 10×1 string array
    "Large tree down between Plantersville and Nettleton."
    "One to two feet of deep standing water developed on a street on the Winthrop University campus after more than an inch of rain fell in less than an hour. One vehicle was stalled in the water."
    "NWS Columbia relayed a report of trees blown down along Tom Hall St."
    "Media reported two trees blown down along I-40 in the Old Fort area."
    ""
    "A few tree limbs greater than 6 inches down on HWY 18 in Roseland."
    "Awning blown off a building on Lamar Avenue. Multiple trees down near the intersection of Winchester and Perkins."
    "Quarter size hail near Rosemark."
    "Tin roof ripped off house on Old Memphis Road near Billings Drive. Several large trees down in the area."
    "Powerlines down at Walnut Grove and Cherry Lane roads."

Extract Text from Multiple Files

If your text data is contained in multiple files in a folder, then you can import the text data into MATLAB using a file datastore.

Create a file datastore for the example sonnet text files. The examples files are named "exampleSonnetN.txt", where N is the number of the sonnet. Specify the file name using the wildcard "*" to find all file names of this structure. To specify the read function to be extractFileText, input this function to fileDatastore using a function handle.

fds = fileDatastore('exampleSonnet*.txt','ReadFcn',@extractFileText)
fds = 
  FileDatastore with properties:

                       Files: {
                              ' ...\Documents\MATLAB\examples\textanalytics-ex15735454\exampleSonnet1.txt';
                              ' ...\Documents\MATLAB\examples\textanalytics-ex15735454\exampleSonnet2.txt';
                              ' ...\Documents\MATLAB\examples\textanalytics-ex15735454\exampleSonnet3.txt'
                               ... and 1 more
                              }
                 UniformRead: 0
                     ReadFcn: @extractFileText
    AlternateFileSystemRoots: {}

Loop over the files in the datastore and read each text file.

str = [];
while hasdata(fds)
    textData = read(fds);
    str = [str; textData];
end

View the extracted text.

str
str = 4×1 string array
    "  From fairest creatures we desire increase,↵  That thereby beauty's rose might never die,↵  But as the riper should by time decease,↵  His tender heir might bear his memory:↵  But thou, contracted to thine own bright eyes,↵  Feed'st thy light's flame with self-substantial fuel,↵  Making a famine where abundance lies,↵  Thy self thy foe, to thy sweet self too cruel:↵  Thou that art now the world's fresh ornament,↵  And only herald to the gaudy spring,↵  Within thine own bud buriest thy content,↵  And tender churl mak'st waste in niggarding:↵    Pity the world, or else this glutton be,↵    To eat the world's due, by the grave and thee."
    "  When forty winters shall besiege thy brow,↵  And dig deep trenches in thy beauty's field,↵  Thy youth's proud livery so gazed on now,↵  Will be a tatter'd weed of small worth held:↵  Then being asked, where all thy beauty lies,↵  Where all the treasure of thy lusty days;↵  To say, within thine own deep sunken eyes,↵  Were an all-eating shame, and thriftless praise.↵  How much more praise deserv'd thy beauty's use,↵  If thou couldst answer 'This fair child of mine↵  Shall sum my count, and make my old excuse,'↵  Proving his beauty by succession thine!↵    This were to be new made when thou art old,↵    And see thy blood warm when thou feel'st it cold."
    "  Look in thy glass and tell the face thou viewest↵  Now is the time that face should form another;↵  Whose fresh repair if now thou not renewest,↵  Thou dost beguile the world, unbless some mother.↵  For where is she so fair whose unear'd womb↵  Disdains the tillage of thy husbandry?↵  Or who is he so fond will be the tomb,↵  Of his self-love to stop posterity?↵  Thou art thy mother's glass and she in thee↵  Calls back the lovely April of her prime;↵  So thou through windows of thine age shalt see,↵  Despite of wrinkles this thy golden time.↵    But if thou live, remember'd not to be,↵    Die single and thine image dies with thee."
    "  Unthrifty loveliness, why dost thou spend↵  Upon thy self thy beauty's legacy?↵  Nature's bequest gives nothing, but doth lend,↵  And being frank she lends to those are free:↵  Then, beauteous niggard, why dost thou abuse↵  The bounteous largess given thee to give?↵  Profitless usurer, why dost thou use↵  So great a sum of sums, yet canst not live?↵  For having traffic with thy self alone,↵  Thou of thy self thy sweet self dost deceive:↵  Then how when nature calls thee to be gone,↵  What acceptable audit canst thou leave?↵    Thy unused beauty must be tombed with thee,↵    Which, used, lives th' executor to be."

See Also

| | |

Related Topics