Maintain formatting while reading PDF document

Question

0 votes

Hello, while reading a PDF document, I want to let the formatting as it is - for the bold to be bold, for the italic to be italic. I have tried this with extractFileText, but not successful. How can this be done? Thanks.

3 Comments
Show 1 older comment Hide 1 older comment

b on 19 Jan 2024

Assuming that all text inside double-stars is bold, if the Word document (the easiest and the most preferred option number 2 out of the 5 options provided by Shah to read the Word document converted by pdf2word) has:

first word - this line contains first word which appears in between a sentence.

Now I want to select the bold instances matching with whatever occurs before the hyphen (the string 'first word' in this case) and replace it with three consecutive underscores ___ so that the output is:

first word - this line contains ___ which appears in between a sentence.

It would have been easy to do this manually, however, there are thousands of such replacements to be made in the document.

Thanks.

b on 19 Jan 2024

Forgot to mention that I have to neccessarily use MATLAB for this ... Otherwise it is ridiculously easy using Find-Replace of Word.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

akshatsood on 15 Jan 2024

1 vote

Hi @b,

I understand that you want to maintain formatting while reading a PDF document. To extract text from a PDF document while preserving formatting such as bold and italic, you would typically need a more advanced PDF processing tool or library that supports rich text extraction. MATLAB's built-in "extractFileText" function does not preserve text formatting, as it is designed to extract plain text.

I hope this helps.

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Answer 2

Hassaan on 15 Jan 2024

Open in MATLAB Online

1 vote

If you want to preserve the formatting, MATLAB itself does not provide built-in functions to directly extract formatted text from PDFs, as this requires interpretation of the PDF content stream which can be quite complex due to the nature of PDF formatting.

External Tools: Use an external tool designed for PDF text extraction that preserves formatting. There are several tools available that can extract text with formatting from PDFs, such as Adobe Acrobat's SDK or other third-party libraries. You can call these tools from MATLAB using the system function or other interfacing methods depending on the tool.
PDF to Word: Convert the PDF to a Word document (which preserves formatting) using an external tool or online service, and then use MATLAB to read the Word document using functions from the Text Analytics Toolbox.
Manual Inspection: If you only have a few documents and you're looking for specific formatted text, you might manually inspect the PDF file for the markup of bold and italic text. However, this is not practical for large-scale or automated extraction.
Custom Scripting with Other Programming Languages: Use a scripting language that has libraries for PDF manipulation (like Python with PyPDF2 or PDFMiner) to extract the text while preserving formatting, and then pass the extracted content to MATLAB if needed.
Optical Character Recognition (OCR): Use OCR tools that can recognize and preserve text formatting. MATLAB has an OCR function that can recognize text in images, but it won't retain text formatting. You would need to use a more advanced OCR tool for formatted text extraction.

[status, cmdout] = system('command-to-extract-formatted-text-from-pdf');

Remember to replace 'command-to-extract-formatted-text-from-pdf' with the actual command that invokes your PDF text extraction tool.

For advanced document processing needs that go beyond what MATLAB directly supports, it's usually more effective to use a combination of tools, possibly involving other programming environments that have more specialized libraries for handling PDFs.

---------------------------------------------------------------------------------------------------------------------------------------------------------

If you find the solution helpful and it resolves your issue, it would be greatly appreciated if you could accept the answer. Also, leaving an upvote and a comment are also wonderful ways to provide feedback.

Professional Interests

Technical Services and Consulting
Embedded Systems | Firmware Developement | Simulations
Electrical and Electronics Engineering

Feel free to contact me.

3 Comments
Show 1 older comment Hide 1 older comment

Christopher Creutzig on 19 Jan 2024

Just for clarification, extractFileText already does >90% of the complexity of parsing the PDF stream you mentioned. The reason it does not give information about font names, bold/italic/roman, position on the page, etc. is that its design point is to read the text to then use in text analytics workflows.

Most of that information is, after all, already used internally to arrange the text found correctly before returning a string.

Steven on 7 Jun 2024

the conditions are very clearly stated. thank you.

Sign in to comment.

Maintain formatting while reading PDF document

3 Comments
Show 1 older comment Hide 1 older comment

Answers (2)

0 Comments
Show -2 older comments Hide -2 older comments

3 Comments
Show 1 older comment Hide 1 older comment

Categories

Tags

Community Treasure Hunt

Maintain formatting while reading PDF document

3 Comments Show 1 older comment Hide 1 older comment

Answers (2)

0 Comments Show -2 older comments Hide -2 older comments

3 Comments Show 1 older comment Hide 1 older comment

Categories

Tags

See Also

Community Treasure Hunt

3 Comments
Show 1 older comment Hide 1 older comment

0 Comments
Show -2 older comments Hide -2 older comments

3 Comments
Show 1 older comment Hide 1 older comment