Categorical bins incorrectly ordered after saving table as .csv
2 views (last 30 days)
Show older comments
I created the following table and then exported it as a csv:
gs_pks_threshold = groupsummary(regularizedData_calibrated_pks, 'vmRaw', [0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 16000 17000 18000 19000 20000 21000 22000 23000 24000 25000 26000 27000 28000 29000 30000 31000 32000], 'nnz', {'pks','pkswt', 'threshold', 'pks_final'})
writetable(gs_pks_threshold, 'gs_pks_threshold.csv')
Next, I import the data (as I am moving from a remote server to my local laptop) with the bins as a categorical variable, as this is how the variable is originally stored.
%% Set up the Import Options and import the data
opts = delimitedTextImportOptions("NumVariables", 6);
% Specify range and delimiter
opts.DataLines = [2, Inf];
opts.Delimiter = ",";
% Specify column names and types
opts.VariableNames = ["disc_vmRaw", "GroupCount", "nnz_pks", "nnz_pkswt", "nnz_threshold", "nnz_pks_final"];
opts.VariableTypes = ["categorical", "double", "double", "double", "double", "double"];
% Specify file level properties
opts.ExtraColumnsRule = "ignore";
opts.EmptyLineRule = "read";
% Specify variable properties
opts = setvaropts(opts, "disc_vmRaw", "EmptyFieldRule", "auto");
% Import the data
gspksP1 = readtable("/Users/shuff/Documents/GitHub/axivity/process_ax6_cwa_files/gs_pks_P041.csv", opts);
When I go to plot this table, if I use the same code that I use to plot the table within matlab, the bins become ordered such that in ascending order they are plotted as folows:
'[0, 1000)'
'[1000, 2000)'
'[10000, 11000)'
'[11000, 12000)'
'[12000, 13000)'
'[13000, 14000)'
'[14000, 15000)'
'[15000, 16000)'
'[16000, 17000)'
'[17000, 18000)'
'[18000, 19000)'
'[19000, 20000)'
'[2000, 3000)'
'[20000, 21000)'
'[3000, 4000)'
'[4000, 5000)'
'[5000, 6000)'
'[6000, 7000)'
'[7000, 8000)'
'[8000, 9000)'
'[9000, 10000)'
How can I import the data so that the bins are ordered correctly?
2 Comments
Eric Sofen
on 18 Mar 2022
If you're just saving to file to move the data between computers, can you save as a MAT file? That will preserve the table with the categorical bins rather than dealing with the "lossy" aspects of CSV.
Answers (1)
Siraj
on 12 Sep 2023
Hi! It is my understanding that you are using the “groupsummary” function to group the data, you have specified "vmRaw" as the grouping variable and applied the "GROUPBINS" to discretize the "vmRaw" data. As a result, a column named "disc_vmRaw" is created in the grouped table.
The grouped table is saved using the "writetable" function. However, when the data is read again using “readtable” and plotted with "disc_vmRaw" on the x-axis, the categories on the x-axis are sorted lexicographically, that’s why '[10000, 11000)' appear before '[2000, 3000)'.
I was able to replicate this issue using a simple example, and I'm unsure of the exact reason behind it. However, you can try the following workaround. After reading the table, plot the desired columns excluding the discrete column ("disc_vmRaw" in your case). Then, use the "xticklabels" function to relabel the x-ticks using the categories present in the discrete column. This should help address the sorting issue you encountered.
To gain a better understanding, you can execute the provided example below and observe the outcomes.
Profit = [20 13 11 9 19 25 31 4 26 30]';
ItemsSold = [14 13 8 5 10 16 8 6 7 11]';
TT = table(Profit,ItemsSold)
G = groupsummary(TT,"Profit",[1 6 11 16 21 26 31 36],{"mean"},"Profit")
%saving the grouped table.
writetable(G,'GroupedTable.csv');
opts = delimitedTextImportOptions("NumVariables", 3);
% Specify range and delimiter
opts.DataLines = [2, Inf];
opts.Delimiter = ",";
opts.VariableTypes = ["categorical", "double", "double"];
opts.VariableNames = ["disc_Profit","GroupCount","mean_Profit"];
%reading the saved table.
G_Read = readtable('GroupedTable.csv', opts)
%plotting the original grouped table.
figure;
plot(G.disc_Profit, G.GroupCount)
title("Grouped Table plot")
%plotting the grouped table that is read from "gs_pks_threshold.csv"
figure;
plot(G_Read.disc_Profit, G_Read.GroupCount)
title("Grouped Table plot read from the csv")
%Workaround.
figure,plot(G_Read.GroupCount)
title("Workaround Grouped Table plot read from the csv")
xticklabels(G_Read.disc_Profit)
Refer to the link below to learn more about “xticklabels”.
Hope this helps.
2 Comments
Stephen23
on 12 Sep 2023
Edited: Stephen23
on 12 Sep 2023
"...I'm unsure of the exact reason behind it"
The cause is simply that by default categorical arrays are created from text with their categories in Lexicographic order. Clearly the CSV file does not store the categories themselves anywhere (where would it?) which means that the original category order gets lost as soon as the file is written. This is easy to demonstrate by simply viewing the CSV file.
Exporting to file is a lossy operation because the categories (and hence their order) are not exported, much like when categorical data are converted to text e.g. string.
Profit = [20 13 11 9 19 25 31 4 26 30]';
ItemsSold = [14 13 8 5 10 16 8 6 7 11]';
TT = table(Profit,ItemsSold);
G = groupsummary(TT,"Profit",[1 6 11 16 21 26 31 36],{"mean"},"Profit")
categories(G.disc_Profit) % original category order
%saving the grouped table.
writetable(G,'GroupedTable.csv'); % the category order is actually lost here
% Note: the order of the rows does NOT determine the category order !!!!
opts = delimitedTextImportOptions("NumVariables", 3);
% Specify range and delimiter
opts.DataLines = [2, Inf];
opts.Delimiter = ",";
opts.VariableTypes = ["categorical", "double", "double"];
opts.VariableNames = ["disc_Profit","GroupCount","mean_Profit"];
%reading the saved table.
G_Read = readtable('GroupedTable.csv', opts)
categories(G_Read.disc_Profit) % Lexicographic sorted categories
Note that fiddling around with XTICKLABELS does not fix the category order, that categorical variable will be completely incorrect for all other operations using it. To fix the category order you will need to use e.g.
The NATSORT documentation already includes examples of how to do this:
See the HTML file chapter "Example: Categorical Categories".
Siraj
on 13 Sep 2023
Yes, this seems to work, thanks for help. The provided code below is an updated workaround that addresses the issue.
%% Workaround.
G_Read.disc_Profit = reordercats(G_Read.disc_Profit, natsort(categories(G_Read.disc_Profit)));
figure;
plot(G_Read.disc_Profit,G_Read.GroupCount)
title("Workaround Grouped Table plot read from the csv")
To use "natsort" function, download and unzip the required file from the following link.
See Also
Categories
Find more on Tables in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!