Read specific data columns from a text file based on header name requested by user

30 views (last 30 days)
Hello,
I have the matlab version 2018a.
I'm trying to extract specific columns of a text file based on the header name of the column. I have tried couple of different methods such as readtable, textscanf, etc. but, none of them exactly worked as I expected.
I have attached the text file itself. I'm trying to make sure the code I'm writing is not slow because there are 1000's of these files that I need to look into in a for-loop possibly.
The structure never changes but, the header columns can be in different positions and that's the reason why I want the code to find the header name no matter which position the column is in.
Here is a sample from the text file:
As it can be seen, the same dates are repeated below with different headers (information) and it is repeated 3-4 times in the actual text file. If I know how to pick up "WOPR - PROD1", "WOPR-PROD2", and "FOPT" columns and put them into a matrix in this order [WOPR-PROD1; WOPR-PROD2; FOPT] I can figure out the rest I believe. I prefer not to modify the text file itself if possible.
"--------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------"
"SUMMARY OF RUN Original_1
"--------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------"
"DATE ""YEARS ""FOPR ""FWPR ""FGPR ""FOPT ""FGPT ""FWPT ""FWCT ""FWIR "
" ""YEARS ""STB/DAY ""STB/DAY ""MSCF/DAY ""STB ""MSCF ""STB "" ""STB/DAY "
" "" "" "" "" "" "" "" "" "" "
" "" "" "" "" "" "" "" "" "" "
"--------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------"
" 1JAN2009" 0 0 0 0 0 0 0 0 0
" 1FEB2009" 0.084873 0 0 0 0 0 0 0 0
" 1MAR2009" 0.161533 2000.000 65.16867 1360.000 56000.00 38080.00 1824.723 0.031556 0
" 1APR2009" 0.246407 2000.000 67.93040 1360.000 118000.0 80240.00 3906.001 0.032849 0
" 1MAY2009" 0.328542 2449.850 53.91752 1665.898 191495.5 130216.9 5523.527 0.021535 0
"--------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------"
"SUMMARY OF RUN Original_1 "
"--------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------"
"DATE ""FWIT ""FGOR ""FOIP ""FWIP ""FGIP ""FPR ""WOPR ""WOPR ""WOPR "
" ""STB ""MSCF/STB ""STB ""STB ""MSCF ""PSIA ""STB/DAY ""STB/DAY ""STB/DAY "
" "" "" ""*10**3 ""*10**3 ""*10**3 "" "" "" "" "
" "" "" "" "" "" "" ""PROD1 ""PROD2 ""PROD3 "
" "" "" "" "" "" "" "" "" "" "
"--------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------"
" 1JAN2009" 0 0 31190.54 645456.1 21209.57 6553.930 0 0 0
" 1FEB2009" 0 0 31190.54 645456.1 21209.57 6553.922 0 0 0
" 1MAR2009" 0 0.680000 31134.54 645454.2 21171.49 6473.267 0 0 0
" 1APR2009" 0 0.680000 31072.54 645452.2 21129.33 6394.598 0 0 0
" 1MAY2009" 0 0.680000 30999.18 645450.7 21079.44 6296.722 0 1675.190 0
Any help is appreciated. Thank you.
  7 Comments
Yildirim Kocoglu
Yildirim Kocoglu on 1 Jun 2019
Edited: Yildirim Kocoglu on 1 Jun 2019
Thank you for your suggestions Stephen. I always appreciate your input.
I'll look into the links you have provided. I'm still working on the code and I will post the answer once I have it ready. I was able to create a new file without all the [", "--...", "SUMMARY..."] lines. I might be doing some things the long way but, it's a learning curve and I'm not sure if this is the path I should follow yet.
Here is the code I have used with the previous text file that generated this newfile.txt that I have attached. I don't know how to combine all the expressions into one line if they are doing pretty much the same thing like the one regexprep that uses dotexceptnewline or if it's possible to write a one line code that does all of this. That is not important though for now.
I'm thinking if I use a " , " delimiter replacing spaces and add a space horizontally between the pages of new info (where dates start) and read them seperately, that might do the trick (maybe).
content = fileread('Original_1.txt');
filebyline = regexprep(content, '"SUMMARY\s.*$', '', 'lineanchors', 'dotexceptnewline');
filebyline = regexprep(filebyline, '"-.*$', '', 'lineanchors', 'dotexceptnewline');
newfile = regexprep(filebyline, '["]', ' ');
fid = fopen('newfile.txt', 'w');
fprintf(fid, '%s', newfile);
fclose(fid);

Sign in to comment.

Accepted Answer

Yildirim Kocoglu
Yildirim Kocoglu on 5 Jun 2019
Before I continue, I want to Thank @Bob Nbob and @Stephen Cobeldick for their work, suggestions and help.
I really appreciate everything you guys are doing for this community.
Whoever is interested in this post and was waiting for an answer.
It took me a while but, I have finally got it to work correctly. The code is a little long and I did not exactly choose good variable names or probably wrote comments detailed enough.
If I can write a more efficient code and anyone has suggestions I'll consider them.
I don't know if 0.027222 seconds (from beginning to end) is efficient enough for this kind of task: read 68 columns of info from a text file full of "pages" of columns with no delimiters between columns.
The output is a 1X68 column cell array called "storage" with 99X1 or 100X1 cell arrays inside each cell of 68 cells. The output can be changed to numeric value later after removing some of the "." at the very end of some numbers (look at the number "8713996." FOPT in first cell towards the last rows - there is no "0" after the decimal).
The output is also a cell array and can be converted to a different type of array by using cell2mat,etc. functions (which I have not used in this code).
%% Read txt file
% Reset all the variables
clear;
clc;
% Read the content of the text file into memory
content = fileread('Original_1.txt');
% Declare desired string occurences (columns) to create the storage cell
% (initially with an unknown size) and store specified columns
desired_string = ["FOPT", "FGPT", "FWPT", "WOPR", "WWPR", "WOPT", "WWPT", "WGPT", "WGPR", "WBHP", "WGOR" ];
% Create cell array to store the columns desired
count_string = count(content,desired_string );
storage = cell(1,count_string);
% Delete unnecessary strings and special characters for readibility (should
% be left with 130 characters "per line"-content variable in workspace is a
% character vector of size 1*154485). Warning: New line and carriage return
% characters also need to be deleted to get 130 characters "per line".
new_content = regexprep(content, '"SUMMARY\s.*$|"-.*$|SUMMARY\s.*$|^\s+|\n|\r', '', 'lineanchors', 'dotexceptnewline');
% If first 13 characters containts the string 'DATE' insert '!' to beginning
% as a delimiter to separate into pages
search_string = {'"','DATE'};
first_13 = new_content(1:13);
if contains(first_13,search_string(1,1))
new_str = insertBefore(new_content, '"DATE', '!');
final_str = strsplit(new_str, '!');
elseif contains(first_13, search_string(1,2))
new_str = insertBefore(new_content, 'DATE', '!');
final_str = strsplit(new_str, '!');
end
% Search for a specific string and read all the columns
pages = length(final_str);
add = 13;
count = 0;
next_column = 0;
% For loop for reading each page of the content
for ii = 2:pages
% Convert cell into character vector
new_char_vec = char(final_str(1,ii));
% Length of each character vector in each cell
long = length(char(final_str(1,ii)));
% Column number of each page!
row_no = long/130;
% Read first 130 characters, 13 characters at a time 10 times and
% match the desired string
for i = 1:10
row = 1 + count;
column = 13 + count;
testing = new_char_vec(1, (row:column));
count = count + add;
% if contains 'FOPT' %read all the columns related
if contains(testing,desired_string)
% Move to next column of storage cell when string is found
next_column = next_column + 1;
% Create cell array inside each column of storage cell array
% row_no varies
storage{1,next_column} = cell(row_no,1);
% read_rows and read_columns are reset to row and column
% everytime the desired_string is found
read_rows = row;
read_columns = column;
% Store the the desired string column's first row
storage{1,next_column}{1,1} = new_char_vec(1, (read_rows:read_columns));
% Store each row of the desired string column (starting from 2nd row) in a for loop
for jj = 2:(row_no)
read_rows = read_rows + 130;
read_columns = read_columns + 130;
storage{1,next_column}{jj,1} = new_char_vec(1, (read_rows:read_columns));
end
end
end
% Reset count
count = 0;
end
% Clear all the unnecessary variables
clear add column content count count_string desired_string final_str first_13 i ii jj long new_char_vec new_content new_str next_column pages read_rows read_columns row row_no search_string testing

More Answers (1)

Bob Thompson
Bob Thompson on 31 May 2019
I started to work on this, and realized that with my knowledge the code was going to be pretty ugly. If anybody knows a way to convert the text into an array that would help a lot, but I do not know how to do that.
With that in mind, I abandoned regexp, as it got too complex and ugly with the different layers of cells needed to break everything. Instead I just used fgetl and parsing to look through the file for the key words you want.
I am going to assume that the matrix layouts are going to remain the same for all of your text files. For example, in the posted sample you have WOPR in the first row, with PROD1, PROD2, PROD3 in the fourth row. The sample code I wrote is looking for these specific locations. They can be in any of the matrices, and don't all need to be in the same matrix, but WOPR is assumes to be in the first row, and PROD# in the fourth. You will need to adjust things if these references are not consistent.
flist = dir('ORIGINAL_*.txt');
for n = 3:length(flist) % dir usually picks up a '.' and '..' listing as the first two elements. If this is not the case just start i = 1
% Read in file
A = fopen(flist(n).name);
% Initialize everything
line = fgetl(A);
c = 1;
i = 0;
j = 0;
ifpt = [];
iwpr = {};
FPT = {};
WPR = cell(2,1);
% Loop through lines of file
while ~isnumeric(line)
tmp = strsplit(line); % Split line at white space, to find proper column
strt = strfind(tmp,'SUMMARY'); % Look for beginning of matrix
strt = strt(~cellfun('isempty',strt)); % Remove negative results from check
% Record data from previous matrix
if ~isempty(strt)
if i > 0 % First matrix check
FOPT(:,1,i) = cellfun(@(x) str2num(x),FPT(5:9)); % Record FOPT data in array
FPT = {}; % Reset FPT var
end
if j > 0
WOPR(:,1,i) = cellfun(@(x) str2num(x),WPR{1}(3:7)); % Record WOPR-PROD1 in first column
WOPR(:,2,i) = cellfun(@(x) str2num(x),WPR{2}(3:7)); % Record WOPR-PROD2 in second column
WPR = {}; % Reset
end
ifpt = []; % Reset
iwpr = {}; % Reset
end
% Check for desired columns
fpt = strfind(tmp,'FOPT'); % Look for FOPT
fpt = fpt(~cellfun('isempty',fpt)); % Remove negatives
if ~isempty(fpt)
ifpt = find(contains(tmp,'FOPT')); % Get index of positive result
i = i + 1; % Advance FOPT results array index
end
wpr = strfind(tmp,'WOPR'); % Look for WOPR
wpr = wpr(~cellfun('isempty',wpr));
if ~isempty(wpr)
for k = 1:6 % Skip lines to find PROD
line = fgetl(A); c = c + 3;
end
% NOTE: The above will ruin the indexing of FOPT array if both
% occur in the same matrix
tmp = strsplit(line);
iwpr{1} = find(contains(tmp,'PROD1')); % Check which WOPR is PROD1
iwpr{2} = find(contains(tmp,'PROD2')); % Check which WOPR is PROD2
j = j + 1; % Advance WOPR results array index
end
% Capture data
if ~isempty(ifpt)&size(tmp,2)>1 % FOPT exists in matrix, and isn't blank line
FPT = vertcat(FPT, tmp(ifpt));
end
if ~isempty(iwpr)&size(tmp,2)>1 % WOPR exists in matrix, and isn't blank line
WPR{1} = vertcat(WPR{1}, tmp(iwpr{1}+1));
WPR{2} = vertcat(WPR{2}, tmp(iwpr{2}+1));
end
% Advance through file
line = fgetl(A);
c = c + 1;
end
% Record results from last matrix
if ~isempty(FPT) % FOPT existed in last matrix
if i > 0
FOPT(:,1,i) = cellfun(@(x) str2num(x),FPT(5:9));
FPT = {};
end
end
if ~isempty(WPR{1}) % WOPR existed in last matrix
if j > 0
WOPR(:,1,i) = cellfun(@(x) str2num(x),WPR{1}(3:7));
WOPR(:,2,i) = cellfun(@(x) str2num(x),WPR{2}(3:7));
WPR = {};
end
end
end
  2 Comments
Yildirim Kocoglu
Yildirim Kocoglu on 31 May 2019
Edited: Yildirim Kocoglu on 1 Jun 2019
I have tested the code without any modifications and it did not produce the results I wanted.
The line with "SUMMARY..." string has the character ' " ' at the very end (need to scroll all the way) and it has the same character in every line of "SUMMARY..." string.
The results it produced were:
Capture.PNG
The structure has name, path, date of text file, bytes, isdir=0, and datenum=7.3757e+05.
I might have to look into it more but, regardless thank you for the time and effort you have put into this. At least your code will give me more ideas and I will learn/remember faster. It's been a while since the last time I used matlab.

Sign in to comment.

Products


Release

R2018a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!