Efficient way to read variable column number data from a mixed-format text file?

6 views (last 30 days)
I'm trying to read in specific data from .cif files, which have an unfortunate text format. A relevant section is below: the section starts with a list of the property identifiers for each column, and then there are many rows with that number of properties before the section is closed. The file can have multiple sections, I already have ways to find each of them and handle them separately, as the rest of the file contents are irrelevant and differently formatted.
loop_
_atom_site.group_PDB
_atom_site.id
_atom_site.type_symbol
_atom_site.label_atom_id
_atom_site.label_alt_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.label_entity_id
_atom_site.label_seq_id
_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
_atom_site.auth_asym_id
_atom_site.auth_seq_id
_atom_site.pdbx_PDB_ins_code
_atom_site.occupancy
_atom_site.B_iso_or_equiv
_atom_site.pdbx_PDB_model_num
ATOM 1 N N . ASP A 1 1 624.249 268.361 303.253 A 2 ? 0.00 0.00 1
ATOM 2 C CA . ASP A 1 1 625.516 268.284 302.473 A 2 ? 0.00 0.00 1
ATOM 3 C C . ASP A 1 1 626.767 268.479 303.343 A 2 ? 0.00 0.00 1
ATOM 4 O O . ASP A 1 1 627.026 269.597 303.785 A 2 ? 0.00 0.00 1
ATOM 5 C CB . ASP A 1 1 625.533 269.354 301.363 A 2 ? 0.00 0.00 1
The problem is that the number of properties can vary, and the width of properties is not fixed so I cannot directly parse the data block. Text reading functions like textscan aren't working because of the leading data being entirely differently formatted, and won't operate on extracted strings of cleaned data as far as i can see.
Is there some sneaky way to make a table with a list of headers transposed like that? I'm trying to avoid a very slow loop to parse each line individually, especially as I only need select columns of data.
  1 Comment
dpb
dpb on 6 Sep 2022
"...trying to avoid a very slow loop to parse each line individually"
Actually, that generally is NOT that slow as long as the output varaible(s) have been preallocated so you're not dynamically reallocating every pass through the loop.
"...way to make a table with a list of headers transposed like that?"
What list transposed like what? Lost me here, sorry; don't see what would correspond to that statement in looking at the data given.
You can certainly use textscan on in-memory data; I don't know that it has been extended to string arrays yet, however which may be where you ran into issue depending on how you read the file.
BUT, my recommendation is to provide a pertinent data file as an attachment so folks can access it and then describe explicitly what it is that is wanted/needed from the file and undoubtedly the file(s) can be read.

Sign in to comment.

Accepted Answer

chrisw23
chrisw23 on 8 Sep 2022
try string operation performance (saved your example as text file)
rawLines = readlines("example.txt");
headerId = rawLines.startsWith("_");
varNames = rawLines(headerId);
dataId = rawLines.startsWith("ATOM");
pat = whitespacePattern(1,inf);
dataRows = rawLines(dataId).strip.replace(pat," ").split();
resTbl = array2table(dataRows);
resTbl.Properties.VariableNames = varNames;
  1 Comment
Carson Purnell
Carson Purnell on 21 Oct 2022
This general strategy ended up working. get that header block into a table and then it became possible to get the target information a priori without needing to regex sort the headers themselves or anything like that.

Sign in to comment.

More Answers (1)

Walter Roberson
Walter Roberson on 6 Sep 2022
Edited: Walter Roberson on 6 Sep 2022
I suggest using fileread() and text processing. For example,
S = fileread('YourFile.cif');
S = regexprep(S, {'^.*?(?=^ATOM)', '^(?!=ATOM).*$'}, {'', ''}, 'lineanchors');
if I got the pattern right then this should first delete everything in the file before the first line that starts with ATOM, and then should delete everything in the file from the first remaining line that does not start with ATOM.
What remains would be a character vector that you could pass as the first parameter to textscan()
  3 Comments
Carson Purnell
Carson Purnell on 6 Sep 2022
Regexprep is much too slow - i need to parse millions of lines of strings. str2double has the same problem, because it won't operate on arrays in a useful way and a looped str2double is (for this) far slower than a single str2num vectorized solution.
There's also one or more data blocks per file, so clearing lines before and after the first block is not useful. I can already extract the relevant lines for each block, I just can't figure out how to parse it without an incredibly slow set of loops because the columns are variable.
Walter Roberson
Walter Roberson on 6 Sep 2022
When you currently extract the relevant lines for each block, what form are you extracting them into? Are you extracting them all first and post-processing?
If you have a file identifier fid positioned to loop_ then
textscan(fid, '_%*[\n]', 'headerlines', 1)
should read through all of the _ lines, leaving you positioned at ATOM. At that point you can ftell() and record the position. Then fgetl() and analyze that one line to count fields, figure out which columns are character and which are number. With that information on hand, you can generate a format to read such lines. fseek() to go back to the beginning of the line and textscan() with that format.
However, this approach would be weak if the columns are fixed width and there are some empty columns -- for example if you had a situation where alt_id was empty if the alt_id was the same as the atom_id .

Sign in to comment.

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!