Pre-processing tall array / datastore data

5 views (last 30 days)
Michael
Michael on 25 Aug 2019
Commented: Michael on 27 Aug 2019
Please pardon me if this has been asked before.
What is the most efficient way to pre-process a large wide table (about 10,000,000 rows by 500 columns) currently divided in many small tables in separate mat-files. The data may be quite sparse in areas and is mostly numeric with a datetime column, some categorical, and some text fields.
For example, should I:
  • stack() the small tables tables but leave them in separate files
  • combine them into a giant wide table
  • combine them into a stacked very tall table
  • delete many NaNs significantly reducing the height of the stacked table
  • use sparse() on a wide table
Those are just some thoughts. Please let me know the best way.
Thank You,
Michael
  2 Comments
Guillaume
Guillaume on 25 Aug 2019
Edited: Guillaume on 25 Aug 2019
I'm not really clear on your question. pre-process in order to achieve what?
Note that stack makes a table less wide (less variables) but a lot taller. I'm not sure that's what you mean by stacking. Perhaps you mean vertically concatenate, in which case the datastore takes care of that for you.
Also note, that sparse is not a function (or a concept) that applies to tables.
I would think that if you use a datastore with tall tables, there's nothing to do. Just use the tables as is (as one big tall table backed by the datastore).
Guillaume
Guillaume on 26 Aug 2019
Michael's comment mistakenly posted as an answer moved here:
Dear Guillame,
Thank you for responding.
1) I haven't used sparse in a long time and didn't realize sparse didn't apply to tables. I guess that's a very good reason, to use a tall table. I was worried that stacked tables would be slow relative to wide tables but perhaps the sparse nature of the stacked table would offset that. Would you know about the relative speed of a stacked vs wide table for SQL type lookups?
2) Regarding datastore/tall/mapreduce, I am just starting to read about them, and like most Matlab docs, they're a little light on examples. Do you know how I can write to a datastore to build a file from scratch rather than just pointing the datastore to existing files?
3) Also, I am unclear on the value of having one datastore table vs. many. I currently have tens of thousands of little mat-files with timetables in them. Is there a benefit to combining them and is there a preferred format that is faster than others, e.g. mat vs. CSV?
Thanks again for your help,
Michael
PS, Yes, what I mean by stacked is from wide or unstacked:
Date Var1 Var2 Var3
Jan xxxx xxxx xxxx
Feb xxxx xxxx xxxx
vs. stacked:
Date Field Value
Jan Var1 xxxx
Jan Var2 xxxx
Jan Var3 xxxx
Feb Var1 xxxx
Feb Var2 xxxx
Feb Var3 xxxx

Sign in to comment.

Answers (4)

Guillaume
Guillaume on 26 Aug 2019
I was worried that stacked tables would be slow relative to wide tables
slow for what type operation. I would think that some things are better suited to wide tables, others to stacked ones. If you are going to be using myfun(mytable.Var1, mytable.Var2) then stacking Var1 and Var2 may not be a good idea. In addition, in the context of tall arrays less rows may be better.
Do you know how I can write to a datastore to build a file from scratch rather than just pointing the datastore to existing files?
datastores are only for reading. If you have a tall table, you can write it directly to a single text file with writetable. If you want to split it into several text files, simply write chunks of rows in a loop with writetable.
Also, I am unclear on the value of having one datastore table vs. many
You have just one datastore that is backed by as many files as you want (all files must have the same format and variables of course). The datastore manages accessing the data from the file as required and you access the data using a single tall table (or array). There isn't an option to get several tables out of one datastore.

Michael
Michael on 26 Aug 2019
Dear Guillaume,
Certainly wide tables are simpler for my purpose (generating tables for machine learning inputs) but due to the sparsness and size of the data that I'm selecting from, tall arrays may be more efficient. I'm not sure.
Thank you for pointing out that datastore is read-only. It looked that way and I was quite frustrated because I could not confirm it. Likewise, writetable doesn't have an append feature which is disappointing. I've been using writetable to make CSVs and then combining them by piping the output of DOS copy or type commands.
I will perform some experiments and write back regarding:
  • The relative speed of a large CSV datastore vs. many small ones
  • Access speed of tall vs. wide storage
  • The speed of the 32bit KDB+ (Q) solution via the datafeed toolbox.
Thanks,
Michael
  2 Comments
Walter Roberson
Walter Roberson on 26 Aug 2019
dlmwrite() has append mode -- but it is only for numeric values.
Michael
Michael on 27 Aug 2019
Dear Mr. Robertson,
Thank you. That may be helpful in the future. In this case, unfortunately, I have mostly text and categorical data.
It would be great if The Mathworks added some basic IO like appending with writetable and writing to a datastore.
It's a bit of a mission to write a whole flexible routine to append a table with many data types using fprintf.
Thanks Again,
Michael

Sign in to comment.


Michael
Michael on 26 Aug 2019
Hello,
I'm trying to evaluate a datastore of about 20,000 csv files with about 36GB of data that I saved with Matlab using writetable. One column contains datetimes and an example of the files' contents is this:
29-Jul-1983 00:00:00,BHP AT EQUITY,MOV_AVG_50D,0.8979
31-Aug-1983 00:00:00,BHP AT EQUITY,MOV_AVG_50D,0.9029
30-Sep-1983 00:00:00,BHP AT EQUITY,MOV_AVG_50D,0.9106
31-Oct-1983 00:00:00,BHP AT EQUITY,MOV_AVG_50D,0.9154
30-Nov-1983 00:00:00,BHP AT EQUITY,MOV_AVG_50D,0.9227
30-Dec-1983 00:00:00,BHP AT EQUITY,MOV_AVG_50D,0.9311
I tried the following code and received the subsequent error. Can someone enlighten me on how to get this to work?
Thank You,
Michael
PS, Code:
ds = datastore('tall*.csv');
tds = tall(ds);
u = unique(tds.FIELD);
U = gather(u);
PPS, Error:
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 1: 0% complete
Evaluation 0% complete
Error using matlab.io.datastore.TabularTextDatastore/readData (line 77)
Unable to read the DATETIME data using the locale setting for your system: 'en_US'
If the data contains month or day names in a language foreign to this locale, use the 'DateLocale' parameter to specify the correct locale.
Learn more about errors encountered during GATHER.
Error in matlab.io.datastore.TabularDatastore/read (line 120)
[t, info] = readData(ds);
Error in tall/gather (line 50)
[varargout{:}, readFailureSummary] = iGather(varargin{:});

Michael
Michael on 27 Aug 2019
Hello,
I moved the datetime problem to a separate thread.
As for the speed, I ran a little experiment comparing one 36GB file and the same data in 20,000 smaller files.
Here are my unscientific results that show many files are about 20% slower than one big file in this example:
One Huge File
One large CSV
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 1: Completed in 10 min 31 sec
Evaluation completed in 10 min 32 sec
dt = 30
20,000 Small FIles
20,000 small CSVs
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 1: Completed in 12 min 35 sec
Evaluation completed in 12 min 36 sec
dt = 28.9531
The Code
clear all
fprintf('One large CSV\n')
tcpu(1) = cputime;
ds = datastore('bigtall.csv','DatetimeType','text');
tds = tall(ds);
u = unique(tds.FIELD);
U = gather(u);
tcpu(2) = cputime;
dt = tcpu(2)-tcpu(1)
clear all
fprintf('\n20,000 small CSVs\n')
tcpu(1) = cputime;
ds = datastore('tall*.csv','DatetimeType','text');
u = unique(tds.FIELD);
U = gather(u);
tds = tall(ds);
tcpu(2) = cputime;
dt = tcpu(2)-tcpu(1)
Thanks,
Michael

Categories

Find more on Large Files and Big Data in Help Center and File Exchange

Products


Release

R2019a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!