Reading mathematical formula's in pdf with matlab is inconsistent, how to generalize this?
9 views (last 30 days)
I'm trying to extract certain pieces of text (the 4.50% and the 22.50% in picture 1) from a pdf file with matlab. To do so I use the pdfRead function. To get the text as generic as possible I remove enters, double spaces, tabs and indents and make all text uppercase. In reading the file, I run into the following problem:
- some text in the file seems to be in math mode (see picture 1 and pay special attention to the two cases of "Notional Amount") :
- It turns out this math mode is not consistent when reading it with pdfRead (see picture 2 and pay special attention to the two cases of "Notional Amount" (For readability I chose to show the file before removing enters, double spaces etc. however the problem is the same)).
- The spaces within the word "notional amount" here are in a different spot for every pdf file, this results in the fact that I cannot use 1 matlab code for multiple pdf files (I do need that).
- Besides this when copy pasting the part into my command window it appears different than it appears in the text (see picture 3)
My question consists of multiple questions:
- Why doesn't the text appear as text and how can I make it appear as text?
- How can I make this part generic such that I can read multiple pdf files with the same code?
Solutions I tried:
- Removing all spaces
- Saving it as a txt file and try to change font (the formula part didn't change)
- Use Python to try to adjust the file
Thanks in advance!
Pranav Verma on 20 May 2021
Edited: Pranav Verma on 21 May 2021
The function you mentioned that you are using : pdfRead, does not seem to be present with the official MATLAB software. However I see a similar function in one of the MATLAB File Exchange submissions: "Read text from a PDF document".
"Read text from a PDF document" is one of the several submissions in MATLAB File Exchange on MATLAB Central which is a forum for our product users to interact, exchange information and knowledge, without MathWorks' involvement. Feel free to contact the author of this submission directly for specific questions about the implementation.