Read specific data columns from a text file based on header name requested by user

Question

Yildirim Kocoglu on 29 May 2019

0
Link

Direct link to this question

https://nl.mathworks.com/matlabcentral/answers/464694-read-specific-data-columns-from-a-text-file-based-on-header-name-requested-by-user

Answered: Yildirim Kocoglu on 5 Jun 2019

Accepted Answer: Yildirim Kocoglu

ORIGINAL_1.txt

Open in MATLAB Online

Hello,

I have the matlab version 2018a.

I'm trying to extract specific columns of a text file based on the header name of the column. I have tried couple of different methods such as readtable, textscanf, etc. but, none of them exactly worked as I expected.

I have attached the text file itself. I'm trying to make sure the code I'm writing is not slow because there are 1000's of these files that I need to look into in a for-loop possibly.

The structure never changes but, the header columns can be in different positions and that's the reason why I want the code to find the header name no matter which position the column is in.

Here is a sample from the text file:

As it can be seen, the same dates are repeated below with different headers (information) and it is repeated 3-4 times in the actual text file. If I know how to pick up "WOPR - PROD1", "WOPR-PROD2", and "FOPT" columns and put them into a matrix in this order [WOPR-PROD1; WOPR-PROD2; FOPT] I can figure out the rest I believe. I prefer not to modify the text file itself if possible.

"--------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------"
"SUMMARY OF RUN Original_1                                                                                                        
"--------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------"
"DATE       ""YEARS      ""FOPR       ""FWPR       ""FGPR       ""FOPT       ""FGPT       ""FWPT       ""FWCT       ""FWIR       "
"           ""YEARS      ""STB/DAY    ""STB/DAY    ""MSCF/DAY   ""STB        ""MSCF       ""STB        ""           ""STB/DAY    "
"           ""           ""           ""           ""           ""           ""           ""           ""           ""           "
"           ""           ""           ""           ""           ""           ""           ""           ""           ""           "
"--------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------"
" 1JAN2009"         0            0            0            0            0            0            0            0            0     
" 1FEB2009"  0.084873            0            0            0            0            0            0            0            0     
" 1MAR2009"  0.161533     2000.000     65.16867     1360.000     56000.00     38080.00     1824.723     0.031556            0     
" 1APR2009"  0.246407     2000.000     67.93040     1360.000     118000.0     80240.00     3906.001     0.032849            0     
" 1MAY2009"  0.328542     2449.850     53.91752     1665.898     191495.5     130216.9     5523.527     0.021535            0 
"--------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------"
"SUMMARY OF RUN Original_1                                                                                                                                                                                                                                                                                                                                                                                                                                                              "
"--------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------"
"DATE       ""FWIT       ""FGOR       ""FOIP       ""FWIP       ""FGIP       ""FPR        ""WOPR       ""WOPR       ""WOPR       "
"           ""STB        ""MSCF/STB   ""STB        ""STB        ""MSCF       ""PSIA       ""STB/DAY    ""STB/DAY    ""STB/DAY    "
"           ""           ""           ""*10**3     ""*10**3     ""*10**3     ""           ""           ""           ""           "
"           ""           ""           ""           ""           ""           ""           ""PROD1      ""PROD2      ""PROD3      "
"           ""           ""           ""           ""           ""           ""           ""           ""           ""           "
"--------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------"
" 1JAN2009"         0            0     31190.54     645456.1     21209.57     6553.930            0            0            0     
" 1FEB2009"         0            0     31190.54     645456.1     21209.57     6553.922            0            0            0     
" 1MAR2009"         0     0.680000     31134.54     645454.2     21171.49     6473.267            0            0            0     
" 1APR2009"         0     0.680000     31072.54     645452.2     21129.33     6394.598            0            0            0     
" 1MAY2009"         0     0.680000     30999.18     645450.7     21079.44     6296.722            0     1675.190            0         

Any help is appreciated. Thank you.

7 Comments
Show 5 older commentsHide 5 older comments

Stephen23 on 1 Jun 2019

Edited: Stephen23 on 1 Jun 2019

I would use textscan, something like this:

read the very first line using fgetl
identify the number of columns (regexp, count delimiters, or whatever).
generate textscan format strings, for header and data.
in a while loop import the file data using textscan.
post-process the headers to generate unique valid fieldnames / table variable names.
convert to structure or table.
now the order of the columns is irrelevant: simply access the structure fields / table variables.

You should read this (EDIT: unfortunately TMW seem to have removed the very useful example from this page showing how to use while to read blocks of data, but note how the repeated textscan calls could be within a while loop):

https://www.mathworks.com/help/matlab/import_export/import-block-of-mixed-data-from-text-file.html

https://www.mathworks.com/matlabcentral/answers/323009-large-data-file-with-mixed-character-strings-and-numerical-formats

Yildirim Kocoglu on 1 Jun 2019

Edited: Yildirim Kocoglu on 1 Jun 2019

Open in MATLAB Online

newfile.txt

Thank you for your suggestions Stephen. I always appreciate your input.

I'll look into the links you have provided. I'm still working on the code and I will post the answer once I have it ready. I was able to create a new file without all the [", "--...", "SUMMARY..."] lines. I might be doing some things the long way but, it's a learning curve and I'm not sure if this is the path I should follow yet.

Here is the code I have used with the previous text file that generated this newfile.txt that I have attached. I don't know how to combine all the expressions into one line if they are doing pretty much the same thing like the one regexprep that uses dotexceptnewline or if it's possible to write a one line code that does all of this. That is not important though for now.

I'm thinking if I use a " , " delimiter replacing spaces and add a space horizontally between the pages of new info (where dates start) and read them seperately, that might do the trick (maybe).

content = fileread('Original_1.txt');
filebyline = regexprep(content, '"SUMMARY\s.*$', '', 'lineanchors', 'dotexceptnewline');
filebyline = regexprep(filebyline, '"-.*$', '', 'lineanchors', 'dotexceptnewline');
newfile = regexprep(filebyline, '["]', ' ');
fid = fopen('newfile.txt', 'w');
fprintf(fid, '%s', newfile);
fclose(fid);

Sign in to comment.

Sign in to answer this question.

Answer 1

Yildirim Kocoglu on 5 Jun 2019

0
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/464694-read-specific-data-columns-from-a-text-file-based-on-header-name-requested-by-user#answer_378039

Open in MATLAB Online

Before I continue, I want to Thank @Bob Nbob and @Stephen Cobeldick for their work, suggestions and help.

I really appreciate everything you guys are doing for this community.

Whoever is interested in this post and was waiting for an answer.

It took me a while but, I have finally got it to work correctly. The code is a little long and I did not exactly choose good variable names or probably wrote comments detailed enough.

If I can write a more efficient code and anyone has suggestions I'll consider them.

I don't know if 0.027222 seconds (from beginning to end) is efficient enough for this kind of task: read 68 columns of info from a text file full of "pages" of columns with no delimiters between columns.

The output is a 1X68 column cell array called "storage" with 99X1 or 100X1 cell arrays inside each cell of 68 cells. The output can be changed to numeric value later after removing some of the "." at the very end of some numbers (look at the number "8713996." FOPT in first cell towards the last rows - there is no "0" after the decimal).

The output is also a cell array and can be converted to a different type of array by using cell2mat,etc. functions (which I have not used in this code).

%% Read txt file
% Reset all the variables
clear;
clc;
% Read the content of the text file into memory
content = fileread('Original_1.txt');
% Declare desired string occurences (columns) to create the storage cell
% (initially with an unknown size) and store specified columns
desired_string = ["FOPT", "FGPT", "FWPT", "WOPR", "WWPR", "WOPT", "WWPT", "WGPT", "WGPR", "WBHP", "WGOR" ];
% Create cell array to store the columns desired
count_string = count(content,desired_string );
storage = cell(1,count_string);
% Delete unnecessary strings and special characters for readibility (should
% be left with 130 characters "per line"-content variable in workspace is a 
% character vector of size 1*154485). Warning: New line and carriage return
% characters also need to be deleted to get 130 characters "per line".
new_content = regexprep(content, '"SUMMARY\s.*$|"-.*$|SUMMARY\s.*$|^\s+|\n|\r', '', 'lineanchors', 'dotexceptnewline');
% If first 13 characters containts the string 'DATE' insert '!' to beginning
% as a delimiter to separate into pages
search_string = {'"','DATE'};
first_13 = new_content(1:13);
if contains(first_13,search_string(1,1))
   new_str = insertBefore(new_content, '"DATE', '!');
   final_str = strsplit(new_str, '!');
elseif contains(first_13, search_string(1,2))
   new_str = insertBefore(new_content, 'DATE', '!');
   final_str = strsplit(new_str, '!');
end
% Search for a specific string and read all the columns
pages = length(final_str);
add = 13;
count = 0;
next_column = 0;
% For loop for reading each page of the content
for ii = 2:pages
    
    % Convert cell into character vector
    new_char_vec = char(final_str(1,ii));
    % Length of each character vector in each cell
    long = length(char(final_str(1,ii)));
    
    % Column number of each page!
    row_no = long/130;
    
    % Read first 130 characters, 13 characters at a time 10 times and 
    % match the desired string
    for i = 1:10
        
        row = 1 + count;
        column = 13 + count;
        testing = new_char_vec(1, (row:column));
        count = count + add;
        
        % if contains 'FOPT' %read all the columns related
        if contains(testing,desired_string)
            
            % Move to next column of storage cell when string is found
            next_column = next_column + 1;
            
            % Create cell array inside each column of storage cell array
            % row_no varies
            storage{1,next_column} = cell(row_no,1);
            % read_rows and read_columns are reset to row and column
            % everytime the desired_string is found
            read_rows = row;
            read_columns = column;
            
            % Store the the desired string column's first row
            storage{1,next_column}{1,1} = new_char_vec(1, (read_rows:read_columns));
            % Store each row of the desired string column (starting from 2nd row) in a for loop
                
                for jj = 2:(row_no)
                    read_rows = read_rows + 130;
                    read_columns = read_columns + 130;
                    storage{1,next_column}{jj,1} = new_char_vec(1, (read_rows:read_columns));
                end
            
        end
         
    end
    
    
% Reset count
count = 0;
end
% Clear all the unnecessary variables
clear add column content count count_string desired_string final_str first_13 i ii jj long new_char_vec new_content new_str next_column pages read_rows read_columns row row_no search_string testing

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Answer 2

Bob Thompson on 31 May 2019

1
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/464694-read-specific-data-columns-from-a-text-file-based-on-header-name-requested-by-user#answer_377351

I started to work on this, and realized that with my knowledge the code was going to be pretty ugly. If anybody knows a way to convert the text into an array that would help a lot, but I do not know how to do that.

With that in mind, I abandoned regexp, as it got too complex and ugly with the different layers of cells needed to break everything. Instead I just used fgetl and parsing to look through the file for the key words you want.

I am going to assume that the matrix layouts are going to remain the same for all of your text files. For example, in the posted sample you have WOPR in the first row, with PROD1, PROD2, PROD3 in the fourth row. The sample code I wrote is looking for these specific locations. They can be in any of the matrices, and don't all need to be in the same matrix, but WOPR is assumes to be in the first row, and PROD# in the fourth. You will need to adjust things if these references are not consistent.

flist = dir('ORIGINAL_*.txt');
for n = 3:length(flist) % dir usually picks up a '.' and '..' listing as the first two elements. If this is not the case just start i = 1
% Read in file
A = fopen(flist(n).name);
% Initialize everything
line = fgetl(A);
c = 1;
i = 0;
j = 0;
ifpt = [];
iwpr = {};
FPT = {};
WPR = cell(2,1);
% Loop through lines of file
while ~isnumeric(line)
    tmp = strsplit(line); % Split line at white space, to find proper column
    strt = strfind(tmp,'SUMMARY'); % Look for beginning of matrix
    strt = strt(~cellfun('isempty',strt)); % Remove negative results from check
    
    % Record data from previous matrix
    if ~isempty(strt)
        if i > 0 % First matrix check
            FOPT(:,1,i) = cellfun(@(x) str2num(x),FPT(5:9)); % Record FOPT data in array
            FPT = {}; % Reset FPT var
        end
        if j > 0
            WOPR(:,1,i) = cellfun(@(x) str2num(x),WPR{1}(3:7)); % Record WOPR-PROD1 in first column
            WOPR(:,2,i) = cellfun(@(x) str2num(x),WPR{2}(3:7)); % Record WOPR-PROD2 in second column
            WPR = {}; % Reset
        end
        ifpt = []; % Reset
        iwpr = {}; % Reset
    end
    
    % Check for desired columns
    fpt = strfind(tmp,'FOPT'); % Look for FOPT
    fpt = fpt(~cellfun('isempty',fpt)); % Remove negatives
    if ~isempty(fpt)
        ifpt = find(contains(tmp,'FOPT')); % Get index of positive result
        i = i + 1; % Advance FOPT results array index
    end
    wpr = strfind(tmp,'WOPR'); % Look for WOPR
    wpr = wpr(~cellfun('isempty',wpr));
    if ~isempty(wpr)
        for k = 1:6 % Skip lines to find PROD
            line = fgetl(A); c = c + 3;
        end
        % NOTE: The above will ruin the indexing of FOPT array if both
        % occur in the same matrix
        tmp = strsplit(line);
        iwpr{1} = find(contains(tmp,'PROD1')); % Check which WOPR is PROD1
        iwpr{2} = find(contains(tmp,'PROD2')); % Check which WOPR is PROD2
        j = j + 1; % Advance WOPR results array index
    end
    % Capture data
    if ~isempty(ifpt)&size(tmp,2)>1 % FOPT exists in matrix, and isn't blank line
        FPT = vertcat(FPT, tmp(ifpt));
    end
    
    if ~isempty(iwpr)&size(tmp,2)>1 % WOPR exists in matrix, and isn't blank line
        WPR{1} = vertcat(WPR{1}, tmp(iwpr{1}+1));
        WPR{2} = vertcat(WPR{2}, tmp(iwpr{2}+1));
    end
    
    % Advance through file
    line = fgetl(A); 
    c = c + 1;
end
% Record results from last matrix
if ~isempty(FPT) % FOPT existed in last matrix
    if i > 0
        FOPT(:,1,i) = cellfun(@(x) str2num(x),FPT(5:9));
        FPT = {};
    end
end
if ~isempty(WPR{1}) % WOPR existed in last matrix
    if j > 0
        WOPR(:,1,i) = cellfun(@(x) str2num(x),WPR{1}(3:7));
        WOPR(:,2,i) = cellfun(@(x) str2num(x),WPR{2}(3:7));
        WPR = {};
    end
end
    
end

2 Comments
Show NoneHide None

Yildirim Kocoglu on 31 May 2019

Edited: Yildirim Kocoglu on 1 Jun 2019

Thank you @Bob Nbob. This code must have taken some time to write... I really appreciate everything you have done.

I have not tested your code yet. I'll let get back to you after testing the code however, I also want you and anyone else reading this post be aware of some new things I have discovered which might be very useful in finding an easier solution for this problem.

I realize that I should have updated you earlier about the pattern discoveries I made yesterday night in the text file and I apologize for not letting you know sooner.

I noticed these patterns in text file:

Every single column of info is made up of 13 characters which includes [", _, space] characters

Example: Read from "1JAN2009"-"DATE" to the beginning of the next value-"YEARS" in the same row and you will see that it is 13 characters not including the next value -"FOPR" and when you start reading from the next value "FOPR" to the beginning of the next value-"FWPR" you will again see that it is 13 characters not including the next value (FWPR) and this pattern is consistent until the end of the line.

I have tested this theory with fileread() function: There are 10 columns = 13*10=130 characters written in each line and a new line starts at 131 st character. These 130 characters does not necessarily include 10 columns of info, it could be 9 columns of info and the rest of the characters are space characters but, each line is always made up of 130 characters regardless. If you start from the beginning of the line and read to the right-downwards at 132nd character you will get the letter "S" of "SUMMARY" after ' " ' character as 131st character.

The string "SUMMARY..." in the text file marks the beginning of new information = Dates,etc.
The last line of the page does not have the string "SUMMARY...".
One more important realization: The line with string "SUMMARY..." is an exception to the 130 characters per line rule. That seems to be the only line that goes way beyond 130 characters.
Last realization (hopefully): lines that start with "_ _ _..." also are an exception to the 130 per line rule.

I have a plan of my own for solving this issue (approximate solution):

Replace [",-] characters with space character so that the structure of the text file=13 characters per column stays in tact.
If a line is made of all space characters, remove/skip that line so that I can only work with the info I need: read from the string "SUMMARY..." to the next string "SUMMARY..." and work with that info but, do not include the string "SUMMARY...".
Store everything after the string "SUMMARY..." that is not a date or a number into one cell as Header no matter how many rows it takes to read since the number of rows can be inconsistent for each header (4 above and 5 below including PROD1,etc.).
After seeing a date or a number, divide everything into 10 x 13 character columns until the next string "SUMMARY..." or end of page since the last info will not have the string "SUMMARY..." at the end of it.
Make sure each header is correctly related to the column info and call the column data based on the header and get rid of the rest of the information.
There is also a possibility that the last columns might not equal 10 columns of info: 10th column could be made up of all space characters so, somehow skip these.

With my current knowledge it will not be that easy for me to put this into code however, I think this will solve the problem. There might also be a better solution to this that I don't know about.

Thank you again @Bob Nbob.

Yildirim Kocoglu on 31 May 2019

Edited: Yildirim Kocoglu on 1 Jun 2019

I have tested the code without any modifications and it did not produce the results I wanted.

The line with "SUMMARY..." string has the character ' " ' at the very end (need to scroll all the way) and it has the same character in every line of "SUMMARY..." string.

The results it produced were:

The structure has name, path, date of text file, bytes, isdir=0, and datenum=7.3757e+05.

I might have to look into it more but, regardless thank you for the time and effort you have put into this. At least your code will give me more ideas and I will learn/remember faster. It's been a while since the last time I used matlab.

Sign in to comment.

Read specific data columns from a text file based on header name requested by user

7 Comments
Show 5 older commentsHide 5 older comments

Accepted Answer

0 Comments
Show -2 older commentsHide -2 older comments

More Answers (1)

2 Comments
Show NoneHide None

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Read specific data columns from a text file based on header name requested by user

7 Comments Show 5 older commentsHide 5 older comments

Accepted Answer

0 Comments Show -2 older commentsHide -2 older comments

More Answers (1)

2 Comments Show NoneHide None

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

7 Comments
Show 5 older commentsHide 5 older comments

0 Comments
Show -2 older commentsHide -2 older comments

2 Comments
Show NoneHide None