Comparing the same items in two massive files

4 views (last 30 days)
I need to read two excel files (first column only), skip a few headers and then select whats common to both.
There are 848,000 lines so before I embark on this, I was wondering if someone could give me some pointers on how to do the compare as the files are so large. Thanks jason
  2 Comments
Adam
Adam on 9 Apr 2015
If you can't fit ~1.7 million cell results in memory at one time or it is too unwieldy for working with in Matlab just read the column in in chunks of e.g. 100,000 lines at a time and do your comparison on those.
Jason
Jason on 9 Apr 2015
I was more after the principle. My first approach would be to get the first cell in file 1, and then compare to the 848,000 cells in file 2 to see if there is a match, if there is then keep it, if not discard. Then repeat this for all cells in file 1.
I was wondering if there was a vector way to do this? Thanks.

Sign in to comment.

Accepted Answer

Adam
Adam on 9 Apr 2015
Edited: Adam on 9 Apr 2015
I originally assumed you meant "common to both" on the same line in each file. Comparing every line with every other is a little more complex then!
Are we talking about data here or just random strings and other stuff?
If it is data you could just read the two columns of data and do an
intersect
on the results. Again, if the data is too large to load in all at once then you can load in the data in blocks again, taking all blocks of the 2nd file for each block of the first, do the same process and consolidate all the results at the end.
  2 Comments
Jason
Jason on 10 Apr 2015
Edited: Jason on 10 Apr 2015
Hi Adam.
The intersect appear to work.
My typical data is:
'exm596721-0_B_R_1921919902'
'exm59885-0_T_R_1921409534'
'exm599139-0_T_R_1918526162'
'exm599186-0_T_R_1918582240'
'exm599192-0_B_R_1918452634'
I've noticed that even when I try to read a small number of cells i.e.
[ndata2, text2]=xlsread(file,'A9:A19')
It takes about 3 minutes, Im guessing it needs to open the full sheet first. Is there any way to avoid this or speed it up if Im reading in chunks?|
Thanks
Adam
Adam on 10 Apr 2015
Yeah, intersect won't work on cell arrays.
ismember( col1, col2 );
should give you the indices of elements in col1 that are also in col2.
As for reading from excel I'm afraid it isn't something I have ever done so I don't know any tips or tricks for that.
You should be able to run a few test programs reading in different amounts and timing how long it takes though and using any arguments to the xlsread function that might help (e.g. I don't know what the 'basic' flag does or if it speeds it up).
When reading most files indexing in certain ways is more efficient than in other ways even if you read in far more data than you want and then chuck away what you don't want after reading it in. I don't know for Excel data though.

Sign in to comment.

More Answers (0)

Categories

Find more on Data Import from MATLAB in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!