Find all cycles (or the initial node of each cycle) in a very large directed graph (12m nodes, 6.7m edges)
5 views (last 30 days)
Jared Violo on 9 Aug 2021
Hi if anyone could help me out with this problem I have been stuck for a while and would be really grateful for your help,
I have about 7 million transactions between accounts, where a transaction is a single row that describes who the buyer account and seller account are, and the object that was sold. So an example of one row would be: [B:0x3, S:0x1, O:Object1] or to better understand 0x1 sold Object1 to 0x3
I have concatenated the object to the buyer and seller aka 0x1-Object1 -> 0x3-Object1 (because an account can sell multiple different objects) and created a directed graph from this data to create a history of how all objects have transferred between accounts.
I want a list of all the accounts where an object has returned back to the account it originally sold from, so for example:
If I had the transactions such that: Account1-Object1 -> Account2-Object1 -> Account1-Object1 then the output should be: Account1-Object1
because the asset eventually cycled back to Account1.
Current Non-Feasible Solution:
this is interpreted as: 0xc sold Object1 to 0x5, 0x5 sold Object2 to 0xc, 0xc sold Object3 to 0x0, 0x5 sold Object5 to 0x0, etc
My code so far:
%concatenate Object to buyer and seller
sales.BuyerAndObject = strcat(string(sales.Buyer),"-",string(sales.Object));
buyer = transpose(sales.BuyerAndObject);
seller = transpose(sales.SellerAndObject);
%Directed Graph: Seller -> Buyer
G = digraph(seller,buyer);
%I have then been trying to use allcycles(G) to find all the cycles,
cycles = allcycles(G)
then return the first node of each cycle, but the data is too massive as I have about 12 Million nodes and 6.7 MIllion edges and the time to completion is beyond reasonable. I am not great at graph theory and at a loss for how to go about what I need to do, and any help is greatly apreciated.
Christine Tobler on 9 Aug 2021
Edited: Christine Tobler on 9 Aug 2021
Have you tried using 'MaxNumCycles' to only compute the first 100 / 1000 / 1e4 cycles? That could give some indication if it's computing each cycle that's slower than acceptable for your problem, or if there are just very many cycles.
It might also be worthwhile to take a look at some of these cycles, to make sure they are all legitimate results (that is, perhaps you're seeing several cycles that can be recombined in different ways, which could cause there to be many more cycles returned than you'd expect).