probability distribution from a simple vector

23 views (last 30 days)
Assume a vector like... 1 3 2 3 1 4 2 3 1 3 4 2 1 1 2 3 4
How can I calculate the likelyhood that nr 2 follows after nr 3 or nr 1 follows after nr 2? Ideally I would like to display this relationship for all numbers in a probability distribution.

Accepted Answer

Rik
Rik on 17 Aug 2018
Edited: Rik on 20 Aug 2018
Use meshgrid to generate all combinations and loop through them to count all occurrences. To convert to probability, divide by the total number of elements.
data=[1 3 2 3 1 4 2 3 1 3 4 2 1 1 2 3 5];
[first,second]=meshgrid(unique(data));
out=zeros(size(first));
for k=1:numel(first)
out(k)=sum(...
data(1:(end-1))==first(k) &...
data(2:end)==second(k));
end
P=out/numel(data);
figure(1),clf(1)
x=1:size(out,1);
y=size(out,2):-1:1;%flip y-direction
y_label=cellfun(@(x) num2str(x),num2cell(y),'UniformOutput',0);
image(x,y,P,'CDataMapping','scaled')
colormap(gray)
set(gca,'XTick',x)
set(gca,'YTick',y(end:-1:1))
set(gca,'YTickLabel',y_label)
xlabel('First value')
ylabel('Second value')
  3 Comments
Florian
Florian on 20 Aug 2018
Edited: Rik on 20 Aug 2018
Thanks for your answer, Rik. What is still unclear to me is how I need to read the figure when reading off the vaules of probability for one number following another: is it from the left to right or from bottom to top?
Rik
Rik on 20 Aug 2018
To more easily add a scale, I've changed the previous code from imshow to image. I've also flipped the y-direction to have the (0,0) position in the lower left corner.

Sign in to comment.

More Answers (2)

John D'Errico
John D'Errico on 8 Jun 2022
Note that the use of meshgrid is wildly inefficient, if all you want to know is count the frequency of one number following another. For example, suppose the vector had a length of 1e6? Then you would be generating matrices with meshgrid of size 1e7 by 1e7. Do you really want that? Do you have enough memory?
M = 1e7*1e7;
disp("At a minimum, approximately " + M/1e9/8 + " gigabytes of RAM will be required to perform your computation.")
At a minimum, approximately 12500 gigabytes of RAM will be required to perform your computation.
That seems like a lot, so unless you have god's computer on your desktop, you might consider alternatives. :)
For example:
n = 1e7;
datavector = randi(7,[1,n]);
% counts(i,j) gives the number of events where i fell directly before j in the vector
ind = (1:n-1);
counts = accumarray([datavector(ind);datavector(ind+1)]',1)
counts = 7×7
204176 203419 204076 203722 204089 204579 204456 204407 204309 203879 203265 203700 204747 203963 203924 203798 203312 203958 203727 204337 204412 204126 203904 204150 204138 204430 204159 203838 204168 204210 204208 204192 204488 203748 203433 204174 204300 204399 204592 204150 204157 204233 203542 204330 203444 204878 203862 204279 204212
And for the actual frequency of those events in this sample, we have:
freq = counts/n
freq = 7×7
0.0204 0.0203 0.0204 0.0204 0.0204 0.0205 0.0204 0.0204 0.0204 0.0204 0.0203 0.0204 0.0205 0.0204 0.0204 0.0204 0.0203 0.0204 0.0204 0.0204 0.0204 0.0204 0.0204 0.0204 0.0204 0.0204 0.0204 0.0204 0.0204 0.0204 0.0204 0.0204 0.0204 0.0204 0.0203 0.0204 0.0204 0.0204 0.0205 0.0204 0.0204 0.0204 0.0204 0.0204 0.0203 0.0205 0.0204 0.0204 0.0204
So we see a remarkably uniform distribution, as would be expected in this specific case, since randi will indeed be a uniform random genertor of integers.
Our expectation for the true frequency would be (as the sample size approaches infinity) is of course:
format long
1/(7*7)
ans =
0.020408163265306

Steven Lord
Steven Lord on 8 Jun 2022
If you only have a small number of potential states (and they're all integer values) you could try histcounts2.
rng default
A = randi(6, 100, 1);
histcounts2(A(1:end-1), A(2:end), 'BinMethod', 'integers')
ans = 6×6
4 3 4 2 1 3 1 1 1 3 5 3 1 1 3 4 3 2 5 2 2 1 3 2 2 5 2 2 6 4 4 2 3 3 2 4
Let's validate the 5 that is in element (5, 2).
locationOfFirst5 = find(A(1:end-1) == 5 & A(2:end) == 2)
locationOfFirst5 = 5×1
37 44 50 61 73
There are five (5, 2) pairs.
A([locationOfFirst5, locationOfFirst5+1])
ans = 5×2
5 2 5 2 5 2 5 2 5 2
The other 5s are followed by other values.
other5s = find(A(1:end-1) == 5 & A(2:end) ~= 2);
A([other5s, other5s+1])
ans = 16×2
5 6 5 1 5 6 5 5 5 5 5 3 5 1 5 5 5 5 5 5
any(A(other5s+1) == 2) % false
ans = logical
0
If you want probabilities use a different Normalization.
histcounts2(A(1:end-1), A(2:end), 'BinMethod', 'integers', 'Normalization', 'probability')
ans = 6×6
0.0404 0.0303 0.0404 0.0202 0.0101 0.0303 0.0101 0.0101 0.0101 0.0303 0.0505 0.0303 0.0101 0.0101 0.0303 0.0404 0.0303 0.0202 0.0505 0.0202 0.0202 0.0101 0.0303 0.0202 0.0202 0.0505 0.0202 0.0202 0.0606 0.0404 0.0404 0.0202 0.0303 0.0303 0.0202 0.0404

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!