Alternative to csv and parquet for arrays
Show older comments
I have pretty massive csv files. It's a pain for both transfer and also in terms of read time.
I have been using parquet but this is only for tables and my functions only work with double arrays. So whenever I load the file, I have to use table2array to create the proper variables. This takes some extra time. Still much better than using csv, but I am wondering if there are any light and efficient alternatives to csv for arrays....
14 Comments
Jan
on 13 Mar 2022
The question is not clear. Why do you use a text file to store large data sets? Text files are useful, if they are read and manipulated by human. The conversion of floating point numbers to strings and back to numbers can cause rounding effects. Therefore using a binary format is recommended and much more efficient.
How do you work with parquet in Matlab?
Walter Roberson
on 14 Mar 2022
How often is each file read compared to being written?
Pelajar UM
on 14 Mar 2022
Walter Roberson
on 14 Mar 2022
I suggest writing it as a binary file, with a header indicating the size. You might want to make it compatible with https://www.mathworks.com/help/matlab/ref/multibandread.html
Pelajar UM
on 14 Mar 2022
Walter Roberson
on 14 Mar 2022
That hints to me that your data might perhaps only justify single precision but that you are using double precision.
Sarah Gilmore
on 14 Mar 2022
Do you mind answering a few questons that may help narrow down the issue?
- Do you know if it's parquetread or table2array that's taking the most time? You can use the performance profiler to determine which lines are causing the issue.
- How wide is the table in the Parquet file. Is it just 7 columns?
- Which version of MATLAB are you running?
Thanks,
Sarah
Pelajar UM
on 15 Mar 2022
Walter Roberson
on 15 Mar 2022
If the original data is 23 columns but you only need 7 of them, then you can improve reading speed by writing in binary one column at a time with a header indicating how many rows are present; then by knowing the size of the header and the number of rows, you can fseek() directly to the beginning of any particular column. Or in your case, since you are using the first 7 columns, just ask to fread [nrow 7] after you have positioned past the header, leaving the other 16 columns unread.
Walter Roberson
on 15 Mar 2022
If those file sizes are a problem, then have the Python write each column into a separate binary file and zip the set of files together. Transfer the zip. Unzip at the destination. open and read only the files corresponding to the columns you want to read.
Pelajar UM
on 15 Mar 2022
Walter Roberson
on 15 Mar 2022
ok then write the file in binary and zip it to reduce file size for transfer.
Pelajar UM
on 15 Mar 2022
Walter Roberson
on 15 Mar 2022
unzip()
Answers (0)
Categories
Find more on Tables in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!