Is there a way to use the extractFileText() function (from Text Analytics Toolbox) for correctly extracting two-column text from PDF scientific articles? Or another function?
4 views (last 30 days)
Show older comments
I want to use Matlab tools for analyzing a large collection of PDF scientific papers. I start with using
str=extractFileText(name_dirfile)
on a test PDF file named name_dirfile in my code, which shows me the text all in one object (so far, so good). Yet, the text is two-column, and a stricking problem is that extractFileText() does not read the text column by column, but rather line by line, namely first line of column 1, then first line of column 2, then second line of column 1, etc. The text being extracted this way, I see no simple manner of searching for N-grams, or even for single-word occurrenced (as end-of-line hyphen-split words are now difficult to recombine). Has anyone encountered this problem and found a solution ?
12 Comments
dpb
on 27 Mar 2025
You could possibly combine the two approaches in using the OCR data to find column line-ending sequences with which to locate the column breaks in the continuous text.
dpb
on 29 Mar 2025
I just saw <a promo for Swifdoo PDF Pro> $30 lifetime license if you're on Windows by any chance. I haven't used it, but it's supposed to be able to extract formatted text to Word documents via OCR...perhaps a 3rd party tool might help; it looks as though there is a free trial that could test with before committing...
Answers (1)
Aryan
on 19 Aug 2025
Hi Patrick,
I understand that you are trying to process two-column scientific PDFs with “extractFileText”, but are facing the issue that text is extracted in an interleaved way (line from column 1, then line from column 2, etc.). Unfortunately, this happens because PDF files do not store the “reading order” of text — they only store text objects with coordinates.
The most reliable option is to use OCR (Optical Character Recognition) with the Computer Vision Toolbox. The idea is to convert the page into an image, split it into its two columns, and run OCR separately on each column.
Steps for OCR-based extraction in MATLAB
1. Convert PDF to images
- OCR in MATLAB (”ocr” function) works on images, not directly on PDFs.
2. Preprocess the page image (grayscale, binarization, noise removal)
- Improves OCR accuracy by enhancing text contrast and removing background noise.
3. Detect the vertical “gutter” between columns
- Two-column layouts need separation; otherwise OCR reads across columns. Detecting and splitting at the gutter ensures each OCR pass processes text in the correct column order.
4. Crop the image into left and right column regions
- Running OCR separately on each cropped column guarantees the reading order is preserved (top-to-bottom in column 1, then top-to-bottom in column 2).
5. Apply OCR on each column image
6. Post-process the OCR text (fix hyphenation and line breaks)
- Cleaning these artifacts makes the output ready for search, n-gram analysis, or NLP tasks.
NOTE: Accuracy depends on image quality — scanned PDFs or low-resolution pages may need extra preprocessing (deskewing, denoising, upscaling).
This workflow can be extended to 3-column layouts, pages with mixed full-width + column sections, etc., just by adjusting the cropping logic.
Please refer to the following MATLAB R2025a documentation:
Hope it helps!
0 Comments
See Also
Categories
Find more on Characters and Strings in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!