MATLAB Answers

How to use map and reduce efficiently?

19 views (last 30 days)
Daniel Pinto
Daniel Pinto on 8 Aug 2019
Edited: Daniel Pinto on 9 Aug 2019
I have the following table, which has over 40 million rows and 5 columns:
Screenshot 2019-08-08 at 15.58.48.png
The first column is irrelevant. The second column is a YYYYMMDD date, and the frequency of the data is quarterly. The third column is a firmID - some firm IDs include letters as well as numbers. The fourth and fifth columns are values assigned to 2 different variables.
I wish to do 2 things:
1) for every rdate-cusip pair, sum shares across all different identifiers of mgrno that exist for that rdate-cusip combination. Call this value A.
2) for every rdate-cusip pair, obtain the mode value of shrout2 across the different identifiers of mgrno that exist for that rdate-cusip combination. Call this value B.
3) divide A by B.
This would normally be straightforward, but due to the big dimensions of the data, I am struggling to do it. I have tried to use the functions map and reduce, without really loading the file into the workspace, but I believe I am mkaing some kind of mistake. I was getting error messages trying to conduct the division inside the mapping phase, so I decided to skip the division and just have as output a table in which the first column is quarter-CUSIP identifier, second column is A, and third column is B.
ds = datastore('myFile.csv');
ds.TextscanFormats{3} = '%q';
ds.TextscanFormats{4} = '%q';
outds = mapreduce(ds, @gvkeyMapFun2, @gvkeyReduceFun2);
output = readall(outds);
where the functions are defined as
function gvkeyMapFun2(data, ~, intermKVStore)
% gets quarter variables
vQuarter = num2str(data.rdate); % char format
% gets cusip in char format
vNCUSIP = cell2mat(data.cusip);
% creates quarter-ncusip identifer
IDnum = strcat(vQuarter,vNCUSIP);
IDnum = cellstr(IDnum);
% finds unique NCUSIPS-quarter
[intermKeys,~,idx] = unique(IDnum, 'stable'); % intermKeys is cell of characters (some cusips have letters), idx is double
% gets variables of intersst
dataOwnership = cellfun(@(x) str2double(x),data.shares);
dataTotalShares = data.shrout2;
for ii = 1:numel(intermKeys)
totalOwnership = sum(dataOwnership(idx==ii));
totalShares = mode(dataTotalShares(idx==ii));
totalOwnershipInfo(ii,1:3) = [repmat(intermKeys(ii),size(totalOwnership,1),1), totalOwnership,repmat(totalShares,size(totalOwnership,1),1) ];
add(intermKVStore, intermKeys{ii}, totalOwnershipInfo);
function gvkeyReduceFun2(intermKey, intermValIter, outKVStore)
databasereducedFinal = array2table([]);
while hasnext(intermValIter)
databasereducedFinal = [databasereducedFinal; getnext(intermValIter)];
add(outKVStore, 'output', databasereducedFinal);
I then run
output = readall(outds);
c = vertcat(output{:, 2});
tableBig = vertcat(c{:});
to try and get the table because "output" looks like this:
Screenshot 2019-08-08 at 12.12.29.png
I feel this is still quite inefficient. Is there anyway do this more efficiently? (also, I believe there's some other mistake somewhere, because the final table "tableBig" is larger than I would expect given the possible number of unique CUSIP-quarters.
thank you.
Daniel Pinto
Daniel Pinto on 8 Aug 2019
I am trying splitapply. Before doing the splitapply, I am trying to define a group based on a quarter-cusip identifier. to do that, i try to concatenate quarter and cusip in string format. the problem is that the strcat of those two is taking for hours now... my computer has 16gb of ram, shouldn't that be enough despite the fact that I am dealing with 40 million rows?

Sign in to comment.

Accepted Answer

Guillaume on 8 Aug 2019
Assuming your shares variable is numeric and assuming your grouping variables are {'rdate', 'cusip'},
t = tall(ds);
[group, rdate, cusip] = findgroups(t.rdate, t.cusip);
shareratio = splitapply(@(shares, shrout2) sum(shares) / mode(shrout2), t.shares, t.shrout2, group);
result = gather(table(rdate, cusip, shareratio));
I was getting an error message
What was the error message. I would suspect that the a posteriori str2double conversion would really slow things down and really it shouldn't be necessary.
Daniel Pinto
Daniel Pinto on 9 Aug 2019
Thank you Guillaume for taking the time. It turns out in this case the correlation between the 2 outputs is 0.999, but you do make a very good point and I changed the code accordingly. This was very helpful and instructive, thank you!
ps: the multiplication by 1000 is just to scale the variable properly.

Sign in to comment.

More Answers (0)


Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!