How to use map and reduce efficiently?

Question

Daniel Pinto on 8 Aug 2019

0
Link

Direct link to this question

https://nl.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently

Edited: Daniel Pinto on 9 Aug 2019

Accepted Answer: Guillaume

Open in MATLAB Online

I have the following table, which has over 40 million rows and 5 columns:

The first column is irrelevant. The second column is a YYYYMMDD date, and the frequency of the data is quarterly. The third column is a firmID - some firm IDs include letters as well as numbers. The fourth and fifth columns are values assigned to 2 different variables.

I wish to do 2 things:

1) for every rdate-cusip pair, sum shares across all different identifiers of mgrno that exist for that rdate-cusip combination. Call this value A.

2) for every rdate-cusip pair, obtain the mode value of shrout2 across the different identifiers of mgrno that exist for that rdate-cusip combination. Call this value B.

3) divide A by B.

This would normally be straightforward, but due to the big dimensions of the data, I am struggling to do it. I have tried to use the functions map and reduce, without really loading the file into the workspace, but I believe I am mkaing some kind of mistake. I was getting error messages trying to conduct the division inside the mapping phase, so I decided to skip the division and just have as output a table in which the first column is quarter-CUSIP identifier, second column is A, and third column is B.

ds = datastore('myFile.csv');
ds.TextscanFormats{3} = '%q';
ds.TextscanFormats{4} = '%q';
outds = mapreduce(ds, @gvkeyMapFun2, @gvkeyReduceFun2);
output = readall(outds);

where the functions are defined as

function gvkeyMapFun2(data, ~, intermKVStore)
 
    % gets quarter variables
    
    vQuarter = num2str(data.rdate); % char format
    
    % gets cusip in char format
 
    vNCUSIP = cell2mat(data.cusip);
       
    % creates quarter-ncusip identifer
 
     IDnum = strcat(vQuarter,vNCUSIP);
      
     IDnum = cellstr(IDnum);
            
    % finds unique NCUSIPS-quarter
        
    [intermKeys,~,idx] = unique(IDnum, 'stable'); % intermKeys is cell of characters (some cusips have letters), idx is double
        
    % gets variables of intersst
   
     dataOwnership = cellfun(@(x) str2double(x),data.shares);
 
     dataTotalShares =  data.shrout2;
 
     for ii = 1:numel(intermKeys)
        
         totalOwnership  = sum(dataOwnership(idx==ii));
               
         totalShares  = mode(dataTotalShares(idx==ii));
                
         totalOwnershipInfo(ii,1:3) = [repmat(intermKeys(ii),size(totalOwnership,1),1), totalOwnership,repmat(totalShares,size(totalOwnership,1),1)   ];
        
         add(intermKVStore, intermKeys{ii}, totalOwnershipInfo);
         
     end
 
end

and

function gvkeyReduceFun2(intermKey, intermValIter, outKVStore)
 
    databasereducedFinal = array2table([]);
 
    while hasnext(intermValIter)
        databasereducedFinal = [databasereducedFinal; getnext(intermValIter)];
    end
  
    add(outKVStore, 'output', databasereducedFinal);
 
end

I then run

output = readall(outds);
c = vertcat(output{:, 2});
tableBig = vertcat(c{:}); 

to try and get the table because "output" looks like this:

I feel this is still quite inefficient. Is there anyway do this more efficiently? (also, I believe there's some other mistake somewhere, because the final table "tableBig" is larger than I would expect given the possible number of unique CUSIP-quarters.

thank you.

5 Comments
Show 3 older commentsHide 3 older comments

Daniel Pinto on 8 Aug 2019

Open in MATLAB Online

Thank you. Im trying splitapply. I ll update here on my progress.

Yes, it's weird imposing shares be read as text. so with map and replace, if I do not set

ds.TextscanFormats{4} = '%q';

I was getting an error message in the mapping phase . so i was doing that to avoid that error message, and then get it back to numeric:

dataOwnership = cellfun(@(x) str2double(x),data.shares);

Daniel Pinto on 8 Aug 2019

I am trying splitapply. Before doing the splitapply, I am trying to define a group based on a quarter-cusip identifier. to do that, i try to concatenate quarter and cusip in string format. the problem is that the strcat of those two is taking for hours now... my computer has 16gb of ram, shouldn't that be enough despite the fact that I am dealing with 40 million rows?

Sign in to comment.

Sign in to answer this question.

Answer 1

Guillaume on 8 Aug 2019

2
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/475418-how-to-use-map-and-reduce-efficiently#answer_386829

Open in MATLAB Online

Assuming your shares variable is numeric and assuming your grouping variables are {'rdate', 'cusip'},

t = tall(ds);
[group, rdate, cusip] = findgroups(t.rdate, t.cusip);
shareratio  = splitapply(@(shares, shrout2) sum(shares) / mode(shrout2), t.shares, t.shrout2, group);
result = gather(table(rdate, cusip, shareratio));

I was getting an error message

What was the error message. I would suspect that the a posteriori str2double conversion would really slow things down and really it shouldn't be necessary.

5 Comments
Show 3 older commentsHide 3 older comments

Daniel Pinto on 9 Aug 2019

Edited: Daniel Pinto on 9 Aug 2019

Open in MATLAB Online

t = tall(ds2);
group = findgroups(t.rdate, t.cusip);
shareratio  = splitapply(@(shares, shrout2) sum(shares / (shrout2*1000)) , t.shares, t.shrout2, group);
rdate = splitapply(@(x) x(1), t.rdate, group);
cusip = splitapply(@(x) x(1), t.cusip, group);
result = gather(table(rdate, cusip, shareratio));

This for instance seems to work - no error message, but I did not let it finish because it was taking some time. The problem is the use of the isequal statement in the mode function as it does not support tall arrays. Therefore if I write a mode function myself without isequal, it should work.

Alternatively, with the map-reduce function, I found this to work perfectly and very fast:

function gvkeyMapFun3(data, ~, intermKVStore)
     IDnum = strcat(data.rdate,data.cusip);
      
     dataOwnership =  data.shares;
     dataTotalShares =  data.shrout2;
     [intermKeys,~,idx] = unique(IDnum, 'stable');
     
     intermVals1 = accumarray(idx,dataOwnership,size(intermKeys));
     intermVals2 = 1000 * accumarray(idx,dataTotalShares,size(intermKeys),@mode);
    
     intermVals = num2cell(intermVals1./intermVals2);
     addmulti(intermKVStore,intermKeys,intermVals);
  
end
function gvkeyReduceFun3(intermKey, intermValIter, outKVStore)
while hasnext(intermValIter)
    
    intermValue = getnext(intermValIter);
       
end
add(outKVStore,intermKey,intermValue);
end

Thanks a lot for the help Guillaume. I understand this stuff much better now!

Guillaume on 9 Aug 2019

Edited: Guillaume on 9 Aug 2019

Open in MATLAB Online

Your mapreduce algorithm is problematic as it assumes that you get all the values (sum and mode) associated with a key (pair of rdate, cusip) in just one map pass. If the data in your table is not sorted by the key, it certainly won't be the case and even if it is, it may be that a key straddles two calls to map.

E.g. let's assume your data is sorted and in one of the call to gvkeyMapFun3, you get:

   rdata      cusip       shares     shrout2
   ---------------------------------------------
   ...        ...         ...        ...
   ...        ...         ...        ...
   'J'        'KK'        123        88
   'X'        'YY'        111        1
   'X'        'YY'        123        1
   'X'        'YY'        245        2
   'X'        'YY'        785        2
   'X'        'YY'        951        1

On the next call to gvkeyMapFun3, you get:

   rdata      cusip       shares     shrout2
   ---------------------------------------------
   'X'        'YY'        632        2
   'X'        'YY'        421        3
   'X'        'YY'        248        2
   'X'        'YY'        892        3
   'X'        'YY'        230        3
   'Z'        'NN'        673        45
   ...        ...         ...        ...

As you can see the mode of your key pair {'X', 'YY'} is 2 which you never got with your current algorithm.

Coping with that significantly complicate the mode calculation, as you can't calculate the mode in the mapping function. Instead you have to keep around the histogram of all the values of shrout2 and calculate the mode yourself in the reduce function, once you've combined all the histograms for the key.

Here is something that should work properly:

function gvkeyMapFun3(data, ~, intermKVStore)
     keys = strcat(data.rdate, '-', data.cusip);       %may be a better way to build your keys. Will be easier to separate afterwards
     [intermKeys, ~, keyid] = unique(keys, 'stable');  %I don't see the point of 'stable'. Probably makes no difference
     sharesum = accumarray(keyid, data.shares);        %sum up the shares that match the same keyid      
     shrouthist = accumarray(keyid, data.shrout2, [], @tempmode);   %compute histogram of shrout2 for each keyid
     values = num2cell([num2cell(sharesum), shrouthist], 2);        %for each key store a 1x2 cell array, 1st element is the share sum, 2nd element is a 2 column matrix [values, counts]    
     addmulti(intermKVStore, intermKeys, values);  
end
%can be local in gvKeyMapFun3.m
function modehist = tempmode(shrout2)
    %compute the histogram of shrout2, receives values that match just one key
    %returns a scalar cell array containing a Nx2 matrix. First column is unique values, 2nd column is their histogram
    [shroutval, ~, id] = unique(shrout2);
    shroutcount = accumarray(id, 1);
    modehist = {[shroutval, shroutcount]};
end

function gvkeyReduceFun3(intermKey, intermValIter, outKVStore)
    sharesum = 0;
    shrouthist = zeros(0, 2);
    while hasnext(intermValIter)
        values = getnext(intermValIter);
        sharesum = sharesum + values{1};
        shrouthist = [shrouthist; values{2}];  %#ok<AGROW>
    end
    %now that we have the full histogram of shrout2, we can calculate the mode
    [allshrout, ~, id] = unique(shrouthist(:, 1));
    count = accumarray(id, shrouthist(:, 2));
    [~, moderow] = max(count);
    shroutmode = allshrout(moderow);
    result = sharesum / 1000 / shroutmode;  %Not sure where the 1000 comes from. Was never mentioned before
    add(outKVStore, intermKey, result);
end

Daniel Pinto on 9 Aug 2019

Edited: Daniel Pinto on 9 Aug 2019

Thank you Guillaume for taking the time. It turns out in this case the correlation between the 2 outputs is 0.999, but you do make a very good point and I changed the code accordingly. This was very helpful and instructive, thank you!

ps: the multiplication by 1000 is just to scale the variable properly.

Sign in to comment.

How to use map and reduce efficiently?

5 Comments
Show 3 older commentsHide 3 older comments

Accepted Answer

5 Comments
Show 3 older commentsHide 3 older comments

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

How to use map and reduce efficiently?

5 Comments Show 3 older commentsHide 3 older comments

Accepted Answer

5 Comments Show 3 older commentsHide 3 older comments

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

5 Comments
Show 3 older commentsHide 3 older comments

5 Comments
Show 3 older commentsHide 3 older comments