Efficient way to read variable column number data from a mixed-format text file?

Question

Carson Purnell on 6 Sep 2022

0
Link

Direct link to this question

https://nl.mathworks.com/matlabcentral/answers/1796435-efficient-way-to-read-variable-column-number-data-from-a-mixed-format-text-file

Commented: Carson Purnell on 21 Oct 2022

I'm trying to read in specific data from .cif files, which have an unfortunate text format. A relevant section is below: the section starts with a list of the property identifiers for each column, and then there are many rows with that number of properties before the section is closed. The file can have multiple sections, I already have ways to find each of them and handle them separately, as the rest of the file contents are irrelevant and differently formatted.

loop_

_atom_site.group_PDB

_atom_site.id

_atom_site.type_symbol

_atom_site.label_atom_id

_atom_site.label_alt_id

_atom_site.label_comp_id

_atom_site.label_asym_id

_atom_site.label_entity_id

_atom_site.label_seq_id

_atom_site.Cartn_x

_atom_site.Cartn_y

_atom_site.Cartn_z

_atom_site.auth_asym_id

_atom_site.auth_seq_id

_atom_site.pdbx_PDB_ins_code

_atom_site.occupancy

_atom_site.B_iso_or_equiv

_atom_site.pdbx_PDB_model_num

ATOM 1 N N . ASP A 1 1 624.249 268.361 303.253 A 2 ? 0.00 0.00 1

ATOM 2 C CA . ASP A 1 1 625.516 268.284 302.473 A 2 ? 0.00 0.00 1

ATOM 3 C C . ASP A 1 1 626.767 268.479 303.343 A 2 ? 0.00 0.00 1

ATOM 4 O O . ASP A 1 1 627.026 269.597 303.785 A 2 ? 0.00 0.00 1

ATOM 5 C CB . ASP A 1 1 625.533 269.354 301.363 A 2 ? 0.00 0.00 1

The problem is that the number of properties can vary, and the width of properties is not fixed so I cannot directly parse the data block. Text reading functions like textscan aren't working because of the leading data being entirely differently formatted, and won't operate on extracted strings of cleaned data as far as i can see.

Is there some sneaky way to make a table with a list of headers transposed like that? I'm trying to avoid a very slow loop to parse each line individually, especially as I only need select columns of data.

1 Comment
Show -1 older commentsHide -1 older comments

dpb on 6 Sep 2022

"...trying to avoid a very slow loop to parse each line individually"

Actually, that generally is NOT that slow as long as the output varaible(s) have been preallocated so you're not dynamically reallocating every pass through the loop.

"...way to make a table with a list of headers transposed like that?"

What list transposed like what? Lost me here, sorry; don't see what would correspond to that statement in looking at the data given.

You can certainly use textscan on in-memory data; I don't know that it has been extended to string arrays yet, however which may be where you ran into issue depending on how you read the file.

BUT, my recommendation is to provide a pertinent data file as an attachment so folks can access it and then describe explicitly what it is that is wanted/needed from the file and undoubtedly the file(s) can be read.

Sign in to comment.

Sign in to answer this question.

Answer 1

chrisw23 on 8 Sep 2022

0
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/1796435-efficient-way-to-read-variable-column-number-data-from-a-mixed-format-text-file#answer_1048075

try string operation performance (saved your example as text file)

rawLines = readlines("example.txt");

headerId = rawLines.startsWith("_");

varNames = rawLines(headerId);

dataId = rawLines.startsWith("ATOM");

pat = whitespacePattern(1,inf);

dataRows = rawLines(dataId).strip.replace(pat," ").split();

resTbl = array2table(dataRows);

resTbl.Properties.VariableNames = varNames;

1 Comment
Show -1 older commentsHide -1 older comments

Carson Purnell on 21 Oct 2022

This general strategy ended up working. get that header block into a table and then it became possible to get the target information a priori without needing to regex sort the headers themselves or anything like that.

Sign in to comment.

Answer 2

Walter Roberson on 6 Sep 2022

0
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/1796435-efficient-way-to-read-variable-column-number-data-from-a-mixed-format-text-file#answer_1044040

Edited: Walter Roberson on 6 Sep 2022

Open in MATLAB Online

I suggest using fileread() and text processing. For example,

S = fileread('YourFile.cif');
S = regexprep(S, {'^.*?(?=^ATOM)', '^(?!=ATOM).*$'}, {'', ''}, 'lineanchors');

if I got the pattern right then this should first delete everything in the file before the first line that starts with ATOM, and then should delete everything in the file from the first remaining line that does not start with ATOM.

What remains would be a character vector that you could pass as the first parameter to textscan()

3 Comments
Show 1 older commentHide 1 older comment

Carson Purnell on 6 Sep 2022

Regexprep is much too slow - i need to parse millions of lines of strings. str2double has the same problem, because it won't operate on arrays in a useful way and a looped str2double is (for this) far slower than a single str2num vectorized solution.

There's also one or more data blocks per file, so clearing lines before and after the first block is not useful. I can already extract the relevant lines for each block, I just can't figure out how to parse it without an incredibly slow set of loops because the columns are variable.

Walter Roberson on 6 Sep 2022

Open in MATLAB Online

When you currently extract the relevant lines for each block, what form are you extracting them into? Are you extracting them all first and post-processing?

If you have a file identifier fid positioned to loop_ then

textscan(fid, '_%*[\n]', 'headerlines', 1)

should read through all of the _ lines, leaving you positioned at ATOM. At that point you can ftell() and record the position. Then fgetl() and analyze that one line to count fields, figure out which columns are character and which are number. With that information on hand, you can generate a format to read such lines. fseek() to go back to the beginning of the line and textscan() with that format.

However, this approach would be weak if the columns are fixed width and there are some empty columns -- for example if you had a situation where alt_id was empty if the alt_id was the same as the atom_id .

Sign in to comment.

Efficient way to read variable column number data from a mixed-format text file?

1 Comment
Show -1 older commentsHide -1 older comments

Accepted Answer

1 Comment
Show -1 older commentsHide -1 older comments

More Answers (1)

3 Comments
Show 1 older commentHide 1 older comment

See Also

Categories

Tags

Products

Community Treasure Hunt

Efficient way to read variable column number data from a mixed-format text file?

1 Comment Show -1 older commentsHide -1 older comments

Accepted Answer

1 Comment Show -1 older commentsHide -1 older comments

More Answers (1)

3 Comments Show 1 older commentHide 1 older comment

See Also

Categories

Tags

Products

Community Treasure Hunt

1 Comment
Show -1 older commentsHide -1 older comments

1 Comment
Show -1 older commentsHide -1 older comments

3 Comments
Show 1 older commentHide 1 older comment