how can I use pdist2 function for big data?

5 views (last 30 days)
I want to implement k-means in matlab. my data set is matrix 9,000,000 by 1. when I used Euclidean for finding distance of points, I faced with following error:
Error using pdist2mex
Out of memory. Type HELP MEMORY for your options.
Error in pdist2 (line 343)
D = pdist2mex(X',Y',dist,additionalArg,smallestLargestFlag,radius);
Error in k_means_new (line 38)
dist = pdist2(d,centroids,distance); % distance between all data points and
centroids
I'd like to mention that I used matlab in system with windows 8 and following configuration :
RAM: 8G
CPU: intel core i5-3230M
so would you please help me?
Thanks in advance.
  2 Comments
Walter Roberson
Walter Roberson on 29 Apr 2016
what is size(d) and size(centroids) ?
mina movahed
mina movahed on 30 Apr 2016
Edited: mina movahed on 30 Apr 2016
size(d)= 9000000 * 1
size(centroids)=240

Sign in to comment.

Answers (2)

Image Analyst
Image Analyst on 30 Apr 2016
Chances are you don't need that all in memory at the same time. What are you really trying to do? Like find the two points farthest from each other? If so, a simple double for loop where you're storing only the max distance (one value) instead of an 18 gigapixel array would work. OR you might be able to get what you need by taking a subsample of your original 9 million element array. So tell us the big picture. What are you really trying to accomplish so we can advise you on a better, less memory intensive approach.
  1 Comment
mina movahed
mina movahed on 2 May 2016
first of all, sorry I did not see your comment. as Walter said, it is better, to rewrite the algorithm to not need as much memory. I want to implement some data mining algorithms in Matlab and after the analyze the data.

Sign in to comment.


Walter Roberson
Walter Roberson on 30 Apr 2016
Why are you bothering with euclidean distance between 1 dimension objects? That is the same as abs() of the difference between them
abs(bsxfun(@minus, d, centroids(:).'))
This is only going to be 9000000 * 240 entries, each of 8 bytes, which is only 17.28 gigabytes. An additional working storage of 9000000 * 8 bytes (72 megabytes) would also be required. Just make sure your swap space is set large enough to hold the array, and set your preferences to not prevent large arrays. It should probably only take 5 or 6 hours to compute.
  6 Comments
mina movahed
mina movahed on 2 May 2016
thanks a lot. I will try this and if it worked, I will inform you. the task is implementation of k_means and so I need to find the distance between all points and centroids.
Walter Roberson
Walter Roberson on 2 May 2016
For k_means you do not need to retain those distances, you only need to figure out where the closest one is. That takes the long term storage requirement down by a factor of length(centroids)

Sign in to comment.

Categories

Find more on Text Data Preparation in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!