Is there a way to use the extractFileText() function (from Text Analytics Toolbox) for correctly extracting two-column text from PDF scientific articles? Or another function?

Question

Patrick on 25 Mar 2025

0
Link

Direct link to this question

https://nl.mathworks.com/matlabcentral/answers/2175622-is-there-a-way-to-use-the-extractfiletext-function-from-text-analytics-toolbox-for-correctly-ext

Answered: Aryan on 19 Aug 2025

I want to use Matlab tools for analyzing a large collection of PDF scientific papers. I start with using

str=extractFileText(name_dirfile)

on a test PDF file named name_dirfile in my code, which shows me the text all in one object (so far, so good). Yet, the text is two-column, and a stricking problem is that extractFileText() does not read the text column by column, but rather line by line, namely first line of column 1, then first line of column 2, then second line of column 1, etc. The text being extracted this way, I see no simple manner of searching for N-grams, or even for single-word occurrenced (as end-of-line hyphen-split words are now difficult to recombine). Has anyone encountered this problem and found a solution ?

12 Comments
Show 10 older commentsHide 10 older comments

Stephen23 on 25 Mar 2025

Edited: Stephen23 on 25 Mar 2025

"but the two-column aspect of the PDF remains undetected, or at least not considered"

This task is much more complex than it appears (a small PDF joke for y'all). PDF is not a simple text container, but is actually a page-layout language that precisely define how content appears on a page. The implications of that are important.

If you're working with multi-column PDFs, simply using basic text extraction tools will likely result in:

Text being pulled in a jumbled, non-linear order
Columns mixing together randomly
Loss of intended reading sequence
Significant post-processing required to reconstruct the original layout

This is because the text is not stored in blocks/lines of contiguous text (like it might be in e.g. a markup-based file format). Those hyphens you mention might just be graphics elements with some x-y positions: how is the tool supposed to know that some graphic element at one location has anything to do with some text elements in some distant location? The point is, there is nothing that links all of the text into some order. There is nothing that requires PDF text to be easy to extract. PDFs are NOT a file format designed for data exchange!

You could try some specialized PDF Extraction Tools (and understand that these are complex tools which may still return data in ways you do not expect/want and require significant post-processing):

use libraries like PyMuPDF, Apache PDFBox, or Tika
these tools can preserve spatial awareness during text extraction
they understand the coordinate-based nature of PDF layout

Or use computer vision techniques to understand page layout and content.

PDFs are complex. Treat them as intricate layout documents, not simple text files. Patience and the right tools will make your extraction process much smoother. Expecting one function to return a perfect output is unlikely to get you very far.

dpb on 27 Mar 2025

You could possibly combine the two approaches in using the OCR data to find column line-ending sequences with which to locate the column breaks in the continuous text.

dpb on 29 Mar 2025

I just saw <a promo for Swifdoo PDF Pro> $30 lifetime license if you're on Windows by any chance. I haven't used it, but it's supposed to be able to extract formatted text to Word documents via OCR...perhaps a 3rd party tool might help; it looks as though there is a free trial that could test with before committing...

Sign in to comment.

Sign in to answer this question.

Answer 1

Aryan on 19 Aug 2025

0
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/2175622-is-there-a-way-to-use-the-extractfiletext-function-from-text-analytics-toolbox-for-correctly-ext#answer_1569499

Hi Patrick,

I understand that you are trying to process two-column scientific PDFs with “extractFileText”, but are facing the issue that text is extracted in an interleaved way (line from column 1, then line from column 2, etc.). Unfortunately, this happens because PDF files do not store the “reading order” of text — they only store text objects with coordinates.

The most reliable option is to use OCR (Optical Character Recognition) with the Computer Vision Toolbox. The idea is to convert the page into an image, split it into its two columns, and run OCR separately on each column.

Steps for OCR-based extraction in MATLAB

1. Convert PDF to images

OCR in MATLAB (”ocr” function) works on images, not directly on PDFs.

2. Preprocess the page image (grayscale, binarization, noise removal)

Improves OCR accuracy by enhancing text contrast and removing background noise.

3. Detect the vertical “gutter” between columns

Two-column layouts need separation; otherwise OCR reads across columns. Detecting and splitting at the gutter ensures each OCR pass processes text in the correct column order.

4. Crop the image into left and right column regions

Running OCR separately on each cropped column guarantees the reading order is preserved (top-to-bottom in column 1, then top-to-bottom in column 2).

5. Apply OCR on each column image

6. Post-process the OCR text (fix hyphenation and line breaks)

Cleaning these artifacts makes the output ready for search, n-gram analysis, or NLP tasks.

NOTE: Accuracy depends on image quality — scanned PDFs or low-resolution pages may need extra preprocessing (deskewing, denoising, upscaling).

This workflow can be extended to 3-column layouts, pages with mixed full-width + column sections, etc., just by adjusting the cropping logic.

Is there a way to use the extractFileText() function (from Text Analytics Toolbox) for correctly extracting two-column text from PDF scientific articles? Or another function?

12 Comments
Show 10 older commentsHide 10 older comments

Answers (1)

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Is there a way to use the extractFileText() function (from Text Analytics Toolbox) for correctly extracting two-column text from PDF scientific articles? Or another function?

12 Comments Show 10 older commentsHide 10 older comments

Answers (1)

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

12 Comments
Show 10 older commentsHide 10 older comments

0 Comments
Show -2 older commentsHide -2 older comments