Read text from PDF, Microsoft Word, HTML, and plain text files
specifies additional options using one or more name-value pair arguments.
str = extractFileText(
Extract Text Data from Text File
Extract the text from
extractFileText. The file
sonnets.txt contains Shakespeare's sonnets in plain text.
str = extractFileText("sonnets.txt");
View the first sonnet.
i = strfind(str,"I"); ii = strfind(str,"II"); start = i(1); fin = ii(1); extractBetween(str,start,fin-1)
ans = "I From fairest creatures we desire increase, That thereby beauty's rose might never die, But as the riper should by time decease, His tender heir might bear his memory: But thou, contracted to thine own bright eyes, Feed'st thy light's flame with self-substantial fuel, Making a famine where abundance lies, Thy self thy foe, to thy sweet self too cruel: Thou that art now the world's fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content, And tender churl mak'st waste in niggarding: Pity the world, or else this glutton be, To eat the world's due, by the grave and thee. "
Extract Text Data from PDF
Extract the text from
extractFileText. The file
exampleSonnets.pdf contains Shakespeare's sonnets in a PDF file.
str = extractFileText("exampleSonnets.pdf");
View the second sonnet.
ii = strfind(str,"II"); iii = strfind(str,"III"); start = ii(1); fin = iii(1); extractBetween(str,start,fin-1)
ans = "II When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tatter'd weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say, within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserv'd thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse,' Proving his beauty by succession thine! This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold. "
Extract the text from pages 3, 5, and 7 of the PDF file.
pages = [3 5 7]; str = extractFileText("exampleSonnets.pdf", ... 'Pages',pages);
View the 10th sonnet.
x = strfind(str,"X"); xi = strfind(str,"XI"); start = x(1); fin = xi(1); extractBetween(str,start,fin-1)
ans = "X Is it for fear to wet a widow's eye, That thou consum'st thy self in single life? Ah! if thou issueless shalt hap to die, The world will wail thee like a makeless wife; The world will be thy widow and still weep That thou no form of thee hast left behind, When every private widow well may keep By children's eyes, her husband's shape in mind: Look! what an unthrift in the world doth spend Shifts but his place, for still the world enjoys it; But beauty's waste hath in the world an end, And kept unused the user so destroys it. No love toward others in that bosom sits That on himself such murd'rous shame commits. X For shame! deny that thou bear'st love to any, Who for thy self art so unprovident. Grant, if thou wilt, thou art belov'd of many, But that thou none lov'st is most evident: For thou art so possess'd with murderous hate, That 'gainst thy self thou stick'st not to conspire, Seeking that beauteous roof to ruinate Which to repair should be thy chief desire. "
Import Text from Multiple Files Using a File Datastore
If your text data is contained in multiple files in a folder, then you can import the text data into MATLAB using a file datastore.
Create a file datastore for the example sonnet text files. The examples sonnets have file names "
N is the number of the sonnet. Specify the read function to be
readFcn = @extractFileText; fds = fileDatastore('exampleSonnet*.txt','ReadFcn',readFcn);
Create an empty bag-of-words model.
bag = bagOfWords
bag = bagOfWords with properties: Counts:  Vocabulary: [1x0 string] NumWords: 0 NumDocuments: 0
Loop over the files in the datastore and read each file. Tokenize the text in each file and add the document to
while hasdata(fds) str = read(fds); document = tokenizedDocument(str); bag = addDocument(bag,document); end
View the updated bag-of-words model.
bag = bagOfWords with properties: Counts: [4x276 double] Vocabulary: ["From" "fairest" "creatures" "we" "desire" "increase" "," "That" "thereby" "beauty's" "rose" "might" "never" "die" "But" "as" "the" "riper" "should" "by" "time" ... ] NumWords: 276 NumDocuments: 4
Extract Text from HTML
To extract text data directly from HTML code, use
extractHTMLText and specify the HTML code as a string.
code = "<html><body><h1>THE SONNETS</h1><p>by William Shakespeare</p></body></html>"; str = extractHTMLText(code)
str = "THE SONNETS by William Shakespeare"
filename — Name of file
string scalar | character vector | 1-by-1 cell array containing a character vector
Name of the file, specified as a string scalar, character vector, or a 1-by-1 cell array containing a character vector.
Specify optional pairs of arguments as
the argument name and
Value is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name in quotes.
'Pages',[1 3 5] specifies to read pages 1, 3, and 5
from a PDF file.
Encoding — Character encoding
'auto' (default) |
'windows-1252' | ...
Character encoding to use, specified as the comma-separated pair
'Encoding' and a character vector or a
string scalar. The character vector or string scalar must contain a
standard character encoding scheme name such as the following.
If you do not specify an encoding scheme, then the function performs heuristic auto-detection for the encoding to use. The heuristics depend on your locale. If these heuristics fail, then you must specify one explicitly.
This option only applies when the input is a plain text file.
ExtractionMethod — Extraction method
'tree' (default) |
Extraction method, specified as the comma-separated pair consisting of
'ExtractionMethod' and one of the
|Analyze the DOM tree and text contents, then extract a block of paragraphs.|
|Detect article text and extract a block of paragraphs.|
|Extract all text in the HTML body, except for scripts and CSS styles.|
This option supports HTML file input only.
Password — Password to open PDF file
character vector | string scalar
Password to open the PDF file, specified as the comma-separated pair
'Password' and a character vector or a
string scalar. This option only applies if the input file is a
Pages — Pages to read from PDF file
vector of positive integers
Pages to read from PDF file, specified as the comma-separated pair
'Pages' and a vector of positive
integers. This option only applies if the input file is a PDF file. The
function, by default, reads all pages from the PDF file.
'Pages',[1 3 5]
To read text directly from HTML code, use
Version HistoryIntroduced in R2017b
extractFileText no longer supports extracting text from Microsoft Word 97–2003 binary DOC files
Support for extracting text from Microsoft® Word 97–2003 binary DOC files using the
extractFileText function has been removed. Microsoft Word DOCX files will continue to be supported.
To extract text data from Microsoft Word 97–2003 binary DOC files, first save the file as a PDF, Microsoft Word DOCX, HTML, or plain text file, then use the