How to import Text File with 2 different Delimiters (how to organize header data and numeric data)

I want to import a text file. This contains a header (with space as delimiter) and data (tab delimited).
The txt-file looks like this:
FORMAT TAB_DELIMITED
NUM_HEADER_BLOCKS 162
NUM_PARAMS 646
PT_COUNT.CND_1 3895
FRAMES.CND_1 16
FILE_TYPE TIME_HISTORY
OPERATION RSP_TO_TAB
DATA_TYPE ASCII_FLOATING_POINT
DATE Fri Jun 23 11:20:24 2017
DELTA_T 9.765625e-02
TOTAL_T 3.803711e+02
PTS_PER_FRAME 256
PTS_PER_GROUP 256
CHANNELS 120
.
.
NUM_ZEROS 5 %end of header with line index 646
RfLongPositionFbk RfLatPositionFbk ...... %start of tab delimited area with the data (120 channels)
mm mm
-12.6182 -4.071238
-12.6192 -4.070237
-12.6182 -4.069237
  1. I want to search the Line which contains "NUM_PARAMS" and want to read the numeric value, which tell me the size of the header section.
  2. After that I want to read the file up to the line 646 in 2 rows - (1st row -> parameter name and 2nd row value.#Then I want to read the data (which is tab delimited - 120 channels).It would be fine if I can rename the channels with the names shown in the line above the units of measurement.
I started to read the full txt-file with the following code to import the header and search for the NUM_PARAM:
s = textscan(fid, '%s%s', 'delimiter', ' ');
idx_NUM_PARAMS = find(strcmp(s{1}, 'NUM_PARAMS'), 1, 'first');
NUM_PARAMSdbl = str2double(s{1,2}{idx_NUM_PARAMS,1});
But I imported also the data as String which is not usable because of the different delimiter.
So I read out the data in a second step:
dataTable = readtable(fileName, 'Delimiter', '\t', 'headerLines',NUM_PARAMSdbl+4,'ReadVariableNames',true);
But I cannot name the rows with the channel names, only with the line right above the data (with the units of measurement).
Thank you for every hint how can I solve my problem.

Answers (1)

You may not need to use header information for parsing your file. Look at this example (applied to data.txt attached):
content = fileread( 'data.txt' ) ;
% - Split header/data.
pos = strfind( content, 'RfLongPositionFbk' ) ;
header = strtrim( content(1:pos-1) ) ;
data = content(pos:end) ;
% - Header -> struct with numeric values when possible.
header = regexp( header, '^(\S+)\s+([^\r\n]+)', 'tokens', 'lineanchors' ) ;
header = vertcat( header{:} ) ;
fNames = regexprep( header(:,1), '\W', '_' ) ;
values = strtrim( header(:,2) ) ;
buffer = str2double( values ) ;
isNum = ~isnan( buffer ) ;
values(isNum) = num2cell( buffer(isNum) ) ;
header = cell2struct( values,fNames ) ;
% - Data -> num array.
data = cell2mat( textscan( data, '%f %f', 'headerlines', 2 )) ;
Running this, you get:
>> header
header =
struct with fields:
FORMAT: 'TAB_DELIMITED'
NUM_HEADER_BLOCKS: 162
NUM_PARAMS: 646
PT_COUNT_CND_1: 3895
FRAMES_CND_1: 16
FILE_TYPE: 'TIME_HISTORY'
OPERATION: 'RSP_TO_TAB'
DATA_TYPE: 'ASCII_FLOATING_POINT'
DATE: 'Fri Jun 23 11:20:24 2017'
DELTA_T: 0.0977
TOTAL_T: 380.3711
PTS_PER_FRAME: 256
PTS_PER_GROUP: 256
CHANNELS: 120
NUM_ZEROS: 5
>> data
data =
-12.6182 -4.0712
-12.6192 -4.0702
-12.6182 -4.0692

7 Comments

Comment from Ulrich moved here [Cedric]:
Thank you for your answer. I want to run your code but I got the following error, because the parameter "VEH_D_TEST_NAME" exists in the header two times.
Error using cell2struct
Duplicate field name "VEH_D_TEST_NAME"
Then it is better to keep a cell array. Replace the block that processes the header by:
% - Header -> cell array with numeric values when possible.
header = regexp( header, '^(\S+)\s+([^\r\n]+)', 'tokens', 'lineanchors' ) ;
header = vertcat( header{:} ) ;
values = str2double( strtrim( header(:,2) )) ;
isNum = ~isnan( values ) ;
header(isNum,2) = num2cell( values(isNum) ) ;
with that, header is a cell array with names in the first column and values in the second.
We can talk if you absolutely need a struct and want to e.g. group values with the same field name in a sub-cell-array.
Answer from Ulrich moved here [Cedric]:
With the extraction of the header into a 2 column cell I am fine.
Because the first channel is not always RfLongPositionFbk and the quantity of the channels is unknown, is there a solution which is more flexible, like:
  1. Locate the line with the channel names and read them out in cell. So I can count them and can determine the index of a channel name (to locate the data in another array (see next)).
  2. Read out the data with dlmread or readtable.
I know how long the header is (value of the Parameter NUM_PARAMS). How can I only read a line at e specific index (without the following lines)?
We can do almost anything, it's just a question of defining well what we can exploit and how we make the approach flexible. For example, we can easily get NUM_PARAMS and CHANNELS if it is all you need from the header:
content = fileread( 'data.txt' ) ;
numParams = str2double( regexp( content, '(?<=NUM_PARAMS\s+)\S+', 'match', 'once' )) ;
numChannels = str2double( regexp( content, '(?<=CHANNELS\s+)\S+', 'match', 'once' )) ;
With that:
>> numParams
numParams =
646
>> numChannels
numChannels =
120
Do you have one numeric array per channel then?
If your file is not too confidential, the easiest way to get to a solution is to attach it, or you can send it to me by email at matlab@elitemail.org.
I still don't understand if you really need the information in the header or not (if it was just for getting the number of lines in the header and the number of channels). Assuming that you just want the data and the channel names and units, the following works:
content = fileread( '012f1ri(Forum).txt' ) ;
% - Extract # parameters and channels.
nParams = str2double( regexp( content, '(?<=NUM_PARAMS\s+)\S+', 'match', 'once' )) ;
nChannels = str2double( regexp( content, '(?<=CHANNELS\s+)\S+', 'match', 'once' )) ;
% - Read channel names and units, at line numParames plus 2 and 3.
fmtSpecNames = repmat( '%s', 1, nChannels ) ;
channelNames = textscan( content, fmtSpecNames, 1, 'HeaderLines', nParams+2 ) ;
channelNames = horzcat( channelNames{:} ) ;
channelUnits = textscan( content, fmtSpecNames, 1, 'HeaderLines', nParams+3 ) ;
channelUnits = horzcat( channelUnits{:} ) ;
% - Read channel data, from line numParams plus 4 on.
fmtSpecData = repmat( '%f', 1, nChannels ) ;
channelData = textscan( content, fmtSpecData, 'HeaderLines', nParams+4 ) ;
channelData = cell2mat( channelData ) ;
After running this, variables channelNames, channelUnits, and channelData contain names, units and data respectively.
Then we can convert to struct array, table, or whatever is best for you, and extract data from the header as well if needed.
Ulrich Bretz's "Answer" moved here:
That's now my status:
content = fileread(fileName);
lineStarts = [0, strfind( content, sprintf('\n') )] + 1 ;
numParams_header = str2double( regexp( content, '(?<=NUM_PARAMS\s+)\S+', 'match', 'once' ));
header = content(lineStarts(1):(lineStarts(numParams_header+1)-1));
channels = content(lineStarts(numParams_header +3):(lineStarts(numParams_header +4)-1));
units = content(lineStarts(numParams_header +4):(lineStarts(numParams_header +5)-1));
data = content(lineStarts(numParams_header +6):end);
How can i convert the channels and units from a sequence of characters to a char array?
I use Matlab R2014a
The answer in my comment above does this already. But if you want to follow your current approach, you can use STRSPLIT to get cell arrays of channel names and units (and possibly STRTRIM before, to get rid of \r if STRSPLIT outputs a 121th empty cell).
For the data, I would do it this way:
data = sscanf( data, '%f' ) ; % Long vector of all data.
data = reshape( data, numel(channels), [] ).' ; % Reshape into array.
where channels is a cell array of channel names (output of STRSPLIT).

Sign in to comment.

Asked:

on 1 Nov 2017

Edited:

on 3 Nov 2017

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!