Select rows from txt file

Hi,
Could someone help on how can I choose specific rows from the below sample (the original txt file is actually huge):
Stock Date Time Price Volume Stock Category > ETE 04/01/2010 10145959 18.34 500 Big Cap > ETE 04/01/2010 10150000 18.34 70 Big Cap > ETE 04/01/2010 10170000 18.34 430 Big Cap > ABC 04/01/2010 10190000 18.34 200 Big Cap > YYY 04/01/2010 10200000 18.34 100 Big Cap > ETE 04/01/2010 10250000 18.34 40 Big Cap > ETE 04/01/2010 10295959 18.34 215 Big Cap > ETE 04/01/2010 10300000 18.34 500 Big Cap > ETE 04/01/2010 10320000 18.34 500 Big Cap
For instance can I keep only rows for stock 'ABC' (column 1)?
Thanks in advance,
Panos

1 Comment

Is each string between >'s supposed to be a row? If so, please format it so people know what you're talking about.

Sign in to comment.

 Accepted Answer

How are you going to store the data once you get it? Do you just want the text? Or do you actually want to read in data? If the latter, you'll need to store the data in some kind of flexible container, like a cell array. Unfortunately, this will increase your memory requirements (which could be a problem, given that you said the file is huge). Here are some options you could try. Try them on the small example to see the various formats for the result. In particular, keep an eye on the bytes used, if memory is an issue.
Option 1:
fid = fopen('stocks.txt');
data = textscan(fid,'%s%s%f%f%f%[^\n]','delimiter',' ','headerlines',1);
idx = strcmp('ETE',data{1});
f = @(x) x(idx);
ETEdata = cellfun(f,data,'uniformoutput',false)
whos ETEdata
fclose(fid);
Option 2:
fid = fopen('stocks.txt');
hdr = textscan(fid,'%s%s%s%s%s%[^\n]',1,'delimiter',' ');
ETEdata = {};
while ~feof(fid)
thisdata = textscan(fid,'%s%s%f%f%f%[^\n]',1,'delimiter',' ');
if strcmp(thisdata{1},'ETE')
ETEdata = [ETEdata;{thisdata{1}{1},thisdata{2}{1},...
thisdata{3:5},thisdata{6}{1}}];
end
end
fclose(fid);
ETEdata
whos ETEdata
Option 3 (requires Statistics Toolbox):
fid = fopen('stocks.txt');
hdr = textscan(fid,'%s%s%s%s%s%[^\n]',1,'delimiter',' ');
fclose(fid);
hdr = [hdr{:}];
hdr = regexprep(hdr,'\W','');
data = dataset('file','stocks.txt','format','%s%s%f%f%f%[^\n]',...
'delimiter',' ','headerlines',1,'readvarnames',false);
data.Properties.VarNames = hdr
data.Stock = nominal(data.Stock);
data.StockCategory = nominal(data.StockCategory);
ETEdata = data(data.Stock == 'ETE',:)
whos ETEdata

7 Comments

Pap
Pap on 22 Mar 2011
Thanks for your reply Matt,
I need to elect only each stock's elements. Then I need to create a new variable ( lets say column 7)as logical function of previous columns and following that I need to run a statistical analysis (maximum likelihood).
In any case, the maximum likelihood should be run at the elements of each stock ( along to the new variable).
- My original file has almost 11,000,000 rows. So which method do you think is the more appropriate one, in terms of memory?
- You apply the above codes for selecting 'ETE', or exclude it?
Many many thanks again,
Panos
Oh yeah, I forgot to mention that I was selecting ETE instead of ABC, just because there was only one ABC, so it wasn't so obvious how the result would be packaged.
Option 1 is the least memory intensive, and probably the way you want to go, given that you're going to be applying some specific formula to the various columns. In fact, in that case, I'd probably go for something even simpler/more brute force:
fid = fopen('stocks.txt');
data = textscan(fid,'%s%s%f%f%f%[^\n]','delimiter',' ','headerlines',1);
idx = strcmp('ETE',data{1});
dt = data{2}(idx);
tm = data{3}(idx);
price = data{4}(idx);
[etc]
Then just work with the individual variables. Also, when using textscan, you can skip any columns you don't actually need. For example, if you don't care about the time, do this:
fid = fopen('stocks.txt');
data = textscan(fid,'%s%s%*f%f%f%[^\n]','delimiter',' ','headerlines',1);
idx = strcmp('ETE',data{1});
dt = data{2}(idx);
price = data{3}(idx);
[etc]
Note the asterisk in the textscan format specifier, and the fact that price is now the third column of data.
Pap
Pap on 22 Mar 2011
Thanks again Matt,
% when I apply the later textscan I receive the below error message:
data=textscan(fid,'%s%s%f%f%f%[^\n]',delimiter','',headerlines',1);
??? Undefined function or variable 'delimiter'.
% As I think that might be due to the fact that my ASCII consists of more than one delimiters ('tab', 'space', etc), I apply the below, but there is still something wrong with 'headerlines'
data=textscan(fid,'%s%s%f%f%f%[^\n]',headerlines',1);
??? Undefined function or variable 'headerlines'.
% Any idea on what might caused this error?
Panos
Pap
Pap on 22 Mar 2011
Sorry Matt,
actually I get the below:
data=textscan(fid,'%s%s%f%f%f%[^\n]','','headerlines',1);
??? Error using ==> textscan
Param/value pairs must come in pairs.
Any idea?
Panos
You took out the word 'delimiter' but you left in the '' value for the parameter.
I would have to look more closely at the code to see why the empty delimiter was being suggested.
There's a space between those single quotes, to tell textscan that the delimiter is whitespace.
The errors in your first attempt, as I assume you figured out, were due to missing quotes around the property names (delimiter and headerlines). Walter is correct about the error in the second attempt. You took out the delimiter property name, but left the empty quotes.
Pap
Pap on 22 Mar 2011
Many many thanks guys

Sign in to comment.

More Answers (1)

Walter Roberson
Walter Roberson on 21 Mar 2011

0 votes

In such a case I would probably use perl. The same functionality can be written in Matlab directly but the I/O would be slower than for perl.
Anyhow, fopen() your text file, fgetl() on it to read the header, then start a loop. fgetl() on the file, compare the first N+1 characters to your target stock name followed by a blank; only save the line if you got a match; continue loop.

Products

Asked:

Pap
on 21 Mar 2011

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!