Clear Filters
Clear Filters

Maintain formatting while reading PDF document

6 views (last 30 days)
b on 15 Jan 2024
Commented: b on 19 Jan 2024
Hello, while reading a PDF document, I want to let the formatting as it is - for the bold to be bold, for the italic to be italic. I have tried this with extractFileText, but not successful. How can this be done? Thanks.
b on 19 Jan 2024
Assuming that all text inside double-stars is bold, if the Word document (the easiest and the most preferred option number 2 out of the 5 options provided by Shah to read the Word document converted by pdf2word) has:
first word - this line contains first word which appears in between a sentence.
Now I want to select the bold instances matching with whatever occurs before the hyphen (the string 'first word' in this case) and replace it with three consecutive underscores ___ so that the output is:
first word - this line contains ___ which appears in between a sentence.
It would have been easy to do this manually, however, there are thousands of such replacements to be made in the document.
b on 19 Jan 2024
Forgot to mention that I have to neccessarily use MATLAB for this ... Otherwise it is ridiculously easy using Find-Replace of Word.

Sign in to comment.

Answers (2)

akshatsood on 15 Jan 2024
Hi @b,
I understand that you want to maintain formatting while reading a PDF document. To extract text from a PDF document while preserving formatting such as bold and italic, you would typically need a more advanced PDF processing tool or library that supports rich text extraction. MATLAB's built-in "extractFileText" function does not preserve text formatting, as it is designed to extract plain text.
I hope this helps.

Hassaan on 15 Jan 2024
If you want to preserve the formatting, MATLAB itself does not provide built-in functions to directly extract formatted text from PDFs, as this requires interpretation of the PDF content stream which can be quite complex due to the nature of PDF formatting.
  1. External Tools: Use an external tool designed for PDF text extraction that preserves formatting. There are several tools available that can extract text with formatting from PDFs, such as Adobe Acrobat's SDK or other third-party libraries. You can call these tools from MATLAB using the system function or other interfacing methods depending on the tool.
  2. PDF to Word: Convert the PDF to a Word document (which preserves formatting) using an external tool or online service, and then use MATLAB to read the Word document using functions from the Text Analytics Toolbox.
  3. Manual Inspection: If you only have a few documents and you're looking for specific formatted text, you might manually inspect the PDF file for the markup of bold and italic text. However, this is not practical for large-scale or automated extraction.
  4. Custom Scripting with Other Programming Languages: Use a scripting language that has libraries for PDF manipulation (like Python with PyPDF2 or PDFMiner) to extract the text while preserving formatting, and then pass the extracted content to MATLAB if needed.
  5. Optical Character Recognition (OCR): Use OCR tools that can recognize and preserve text formatting. MATLAB has an OCR function that can recognize text in images, but it won't retain text formatting. You would need to use a more advanced OCR tool for formatted text extraction.
[status, cmdout] = system('command-to-extract-formatted-text-from-pdf');
Remember to replace 'command-to-extract-formatted-text-from-pdf' with the actual command that invokes your PDF text extraction tool.
For advanced document processing needs that go beyond what MATLAB directly supports, it's usually more effective to use a combination of tools, possibly involving other programming environments that have more specialized libraries for handling PDFs.
If you find the solution helpful and it resolves your issue, it would be greatly appreciated if you could accept the answer. Also, leaving an upvote and a comment are also wonderful ways to provide feedback.
Professional Interests
  • Technical Services and Consulting
  • Embedded Systems | Firmware Developement | Simulations
  • Electrical and Electronics Engineering
Feel free to contact me.
b on 15 Jan 2024
Out of these, option 2 is feasible. If you can tell how to retain the bold and italic formatting when reading in a Word document.
Christopher Creutzig
Christopher Creutzig on 19 Jan 2024
Just for clarification, extractFileText already does >90% of the complexity of parsing the PDF stream you mentioned. The reason it does not give information about font names, bold/italic/roman, position on the page, etc. is that its design point is to read the text to then use in text analytics workflows.
Most of that information is, after all, already used internally to arrange the text found correctly before returning a string.

Sign in to comment.


Find more on Characters and Strings in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!