You are now following this question

You will see updates in your followed content feed.
You may receive emails, depending on your communication preferences.

Alternative to csv and parquet for arrays

5 views (last 30 days)

Show older comments

Pelajar UM on 13 Mar 2022

0
Link

Direct link to this question

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays

⋮

0
Link

Direct link to this question

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays

Commented: Walter Roberson on 15 Mar 2022

I have pretty massive csv files. It's a pain for both transfer and also in terms of read time.

I have been using parquet but this is only for tables and my functions only work with double arrays. So whenever I load the file, I have to use table2array to create the proper variables. This takes some extra time. Still much better than using csv, but I am wondering if there are any light and efficient alternatives to csv for arrays....

14 Comments
Show 12 older commentsHide 12 older comments

Jan on 13 Mar 2022

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2038744

⋮

Link

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2038744

The question is not clear. Why do you use a text file to store large data sets? Text files are useful, if they are read and manipulated by human. The conversion of floating point numbers to strings and back to numbers can cause rounding effects. Therefore using a binary format is recommended and much more efficient.

How do you work with parquet in Matlab?

Walter Roberson on 14 Mar 2022

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2038764

⋮

Link

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2038764

How often is each file read compared to being written?

Pelajar UM on 14 Mar 2022

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2039039

⋮

Link

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2039039

Open in MATLAB Online

Thanks guys.

@Jan The data is prepared using another program (written in python). It is a mesh data with various values assinged to each one of the element (all numerical values). The MATLAB code is used to visualize this data. The file does not have to be human readable (parquet is not human readable).

Example:

[filename, folder] = uigetfile ({'*parquet'});
if ~ischar(filename); return; end   %user cancel
filename = fullfile(folder, filename);
input = parquetread(filename);
app.UITable2.Data = table2array(input);
nodes = [app.UITable2.Data(:,1),app.UITable2.Data(:,2),app.UITable2.Data(:,3)];
elements = [app.UITable2.Data(:,4),app.UITable2.Data(:,5),app.UITable2.Data(:,6),app.UITable2.Data(:,7)];
elements=rmmissing(elements);
TR = triangulation(elements,nodes); %generating triangular mesh
[F,P] = freeBoundary(TR); %extracting the surface

@Walter Roberson It's written only once (in Python), and every time you open a new session of MATLAB, you have to load the data. So I would reading time is more than important than writing time.

Walter Roberson on 14 Mar 2022

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2039059

⋮

Link

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2039059

I suggest writing it as a binary file, with a header indicating the size. You might want to make it compatible with https://www.mathworks.com/help/matlab/ref/multibandread.html

Pelajar UM on 14 Mar 2022

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2039074

⋮

Link

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2039074

@Walter Roberson using fwrite / multibandwrite, it's only slightly lighter:

CSV: 25 mb

fwrite/multibandwrite: 23 mb

Parquet: 12 mb

Walter Roberson on 14 Mar 2022

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2039094

⋮

Link

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2039094

That hints to me that your data might perhaps only justify single precision but that you are using double precision.

Sarah Gilmore on 14 Mar 2022

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2040189

⋮

Link

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2040189

Hi @Pelajar UM,

Do you mind answering a few questons that may help narrow down the issue?

Do you know if it's parquetread or table2array that's taking the most time? You can use the performance profiler to determine which lines are causing the issue.
How wide is the table in the Parquet file. Is it just 7 columns?
Which version of MATLAB are you running?

Thanks,

Sarah

Pelajar UM on 15 Mar 2022

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2041619

⋮

Link

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2041619

Hi Sarah

parquetread takes 0.11 s and table2array 0.034 s. No issues per se and relatively speaking, not long at all. But it is just an extra step and I was wondering if it could be avoided.
This particular dataset is 23 columns wide with ~114,000 rows.
R2021a

Walter Roberson on 15 Mar 2022

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2041659

⋮

Link

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2041659

If the original data is 23 columns but you only need 7 of them, then you can improve reading speed by writing in binary one column at a time with a header indicating how many rows are present; then by knowing the size of the header and the number of rows, you can fseek() directly to the beginning of any particular column. Or in your case, since you are using the first 7 columns, just ask to fread [nrow 7] after you have positioned past the header, leaving the other 16 columns unread.

Walter Roberson on 15 Mar 2022

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2041689

⋮

Link

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2041689

If those file sizes are a problem, then have the Python write each column into a separate binary file and zip the set of files together. Transfer the zip. Unzip at the destination. open and read only the files corresponding to the columns you want to read.

Pelajar UM on 15 Mar 2022

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2041704

⋮

Link

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2041704

@Walter Roberson Thanks. I am actually using all 23 columns. That was just a small part of the code.

Walter Roberson on 15 Mar 2022

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2041754

⋮

Link

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2041754

ok then write the file in binary and zip it to reduce file size for transfer.

Pelajar UM on 15 Mar 2022

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2041804

⋮

Link

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2041804

Zipping does a wonderful job reducing the size (down to 4 mb), but I don't think MATLAB can unzip the file or otherwise read the zipped file, right?

Walter Roberson on 15 Mar 2022

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2041824

⋮

Link

Direct link to this comment

https://nl.mathworks.com/matlabcentral/answers/1670629-alternative-to-csv-and-parquet-for-arrays#comment_2041824

unzip()

Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

An Error Occurred

Unable to complete the action because of changes made to the page. Reload the page to see its updated state.

Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

(English)
(Deutsch)
(Français)

（简体中文）
(English)

You can also select a web site from the following list

How to Get Best Site Performance

Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.

Americas

América Latina (Español)
Canada (English)
United States (English)

Europe

Belgium (English)
Denmark (English)
Deutschland (Deutsch)
España (Español)
Finland (English)
France (Français)
Ireland (English)
Italia (Italiano)
Luxembourg (English)

Netherlands (English)
Norway (English)
Österreich (Deutsch)
Portugal (English)
Sweden (English)
Switzerland
United Kingdom(English)

Asia Pacific

Australia (English)
India (English)
New Zealand (English)
中国
- 简体中文Chinese
- English
日本Japanese (日本語)
한국Korean (한국어)

Contact your local office

Alternative to csv and parquet for arrays

14 Comments
Show 12 older commentsHide 12 older comments

Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

Alternative to csv and parquet for arrays

14 Comments Show 12 older commentsHide 12 older comments

Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

14 Comments
Show 12 older commentsHide 12 older comments