array exceeds maximum array size using dbscan function

Hi,
i have a huge data from the radar sensor most likely 2-3 millions of data i.e xy coordinates.
i am using dbscan where pdist2 function is called, due to huge data it says array exceeds maximum size , you can find the same as attahced, can you please do help me out how can i proceed further ?
Please note : i do not want to cut/separate the data.

3 Comments

You are making some other mistake
2-3 millions of data is not a problem and it will not cost that much memory
There is no mistake as such , i have already dried it.
you can also try by running pdist2 for the attached data and provide me the feedback.
Can you please do suggest how shall i run pairwise distance calculation for the attached data ?
Yup, not working sorry for that
what is the Distance metric?
'euclidean','squaredeuclidean','mahalanobis, ....
For euclidean, just simply apply
sqrt(sum(abs(ImageSpots.x-ImageSpots.y).^2))

Sign in to comment.

Answers (1)

pdist2 can indeed exhaust the memory. If the inputs are vectors of size m and n, then it returns a matrix of size m*n. If m and n are in millions, then it will fail.
SatyaPrakash, you tagged the question with R2018b. But Mathworks introduced dbscan from R2019a. Are you using the function from an external source? In that case, it is best to contact the developer to ask for the workaround. I checked the definition of dbscan in R2020a, and it does not include any call to pdist2. I think that the MATLAB's built-in function will be able to handle such large arrays. You might try this own some latest release.

6 Comments

so when using pdist2 , how can we avoid fail ?
2nd --> can i get the inbuild function of dbscan which you are referring ?
"so when using pdist2 , how can we avoid fail ?"
You don't really have a choice. pdist2 is the pairwise distance between all points in x and y. Therefore it must generate a numel(x) X numel(y) matrix and you need to have enough memory to store that matrix. If you don't, the only way is to reduce the size of the inputs.
"can i get the inbuild function of dbscan"
Upgrade to 2019a (or better 2020a). It's the only way.
You cannot avoid failing with pdist2 for your current dataset. You will need to use some other method to input the distance to the dbscan algorithm. The original developer of the code will know whether it is possible to avoid a call to pdist2 at all and use pdist instead. If fact, if both inputs to pdist2 are the same, it is best to call pdist; however, it can make the code more complicated for processing later. I guess the developer of the code used a more straightforward approach but at the cost of scalability.
Only way to get the inbuilt function is to upgrade your MATLAB release.
@Guilaume :
Can you please do help me out with the data i have provided in the earlier comment and generate a numel(x) X numel(y) and resolve it ?
or
is there another machanism to do it , can you give me some example code ?
Whichever way you create it, a 408122 x 408122 single array requires about 620 GB of memory to store it. There's no way around that.
As it's very unlikely that you have anywhere near that amount of memory, the only option you have is to significantly reduce the size of your inputs.
For reference, a 32768 x 32768 single array would already use 4 GB of memory.
I agree with you, thank you for the feedback.
is there any mechanism or algorithm to calculate parwise distance for this huge data ?

Sign in to comment.

Categories

Find more on Statistics and Machine Learning Toolbox in Help Center and File Exchange

Products

Release

R2018b

Asked:

on 8 Apr 2020

Edited:

on 8 Apr 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!