Comparing lists of years for similarity

Question

James Ryan on 14 Dec 2016

0
Link

Direct link to this question

https://nl.mathworks.com/matlabcentral/answers/316793-comparing-lists-of-years-for-similarity

Commented: Guillaume on 15 Dec 2016

My problem involves calibrating a numerical model which predicts some event which happens or not in each year. It could be economic events, coral bleaching, or many other things. I want to compare the similarity of results from different model versions, or with real-world historical data.

The models are expected to miss quite often, so looking for exact matches won't do. Size of error matters so Wilcoxson rank-sum won't do. The lists will often be different in length, and they could be quite a bit longer than my examples below.

Examples of what is subjectively "good" and "bad".

A = [1968 1972 1991 1993 2001 2010]
B = [1968 1972 1993 2001 2010]
C = [1969 1973 1991 1995 2001 2011]
D = [1950 1960 1991 1993 2001 2050]
E = [1968 1972 1991 1993 2001 2010 2050]

Consider A to be "correct"

B is missing one year entirely, but this is not disastrous.
C has only two matching values, but the others are close, I'd call this better than B.
D has three exact matches, but the others are way off.  I'd consider this the worst.
E has five exact matches and one really bad point.  Again, not disastrous.

Of course I don't expect an algorithm to match my subjective evaluation all the time. I just want it to take the things I have mentioned into account.

If I were to make up an algorithm off the cuff I'd probably try to for look points with near neighbors in the other list and score their distances root-mean-square style, with some maximum value counted against any points left with no neighbor. This is really crude, and there must be a better way.

Suggestions, please!

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Guillaume on 14 Dec 2016

2
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/316793-comparing-lists-of-years-for-similarity#answer_247124

It sounds like you need some sort of edit distance calculation. A pure edit distance algorithm would rank B as better than C (1 deletion vs 4 substitutions) but you can weight the deletions more than the substitutions and give different weight to the substitutions by how far they are from the original value.

There is an edit distance function on the File Exchange. No idea of its quality.

2 Comments
Show NoneHide None

James Ryan on 14 Dec 2016

Thanks. This definitely moves me closer to a solution. The only difference is in that algorithm (designed for strings) replacing one letter with another has the same "cost" regardless of the letter. With dates, the replacement matters. Maybe I can tweak it to work.

Guillaume on 15 Dec 2016

Yes, as I said you can modify the standard algorithm to give different weight to substitutions depending on how far they are from the original value.

The concept of what you are trying to do is definitively one of an edit distance, so I'm sure you can find an algorithm already developed somewhere.

Sign in to comment.

Answer 2

KSSV on 14 Dec 2016

1
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/316793-comparing-lists-of-years-for-similarity#answer_247120

You can try this ismembertol https://in.mathworks.com/help/matlab/ref/ismembertol.html. You can fix some tolerance limits and find out whether two sets of numbers have any common elements. You can decide your scenarios by setting your tolerance limits.

1 Comment
Show -1 older commentsHide -1 older comments

James Ryan on 14 Dec 2016

Another step in the right direction. Perhaps I could count exact matches, then near matches, and then count years which don't have a near match. Each count could be weighted differently to create a "nearness" score. Thanks.

Sign in to comment.

Answer 3

Image Analyst on 15 Dec 2016

0
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/316793-comparing-lists-of-years-for-similarity#answer_247224

What about ismember() and/or setdiff()? You don't need ismembertol() if all your numbers (years) are integers. setdiff() tells you what numbers are different between the two vectors, and ismember() tells you what number are the same in the two vectors. Neither one cares about position but I don't think that matters to you - you only care if the number(s) is/are present or not in the array.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Comparing lists of years for similarity

0 Comments
Show -2 older commentsHide -2 older comments

Answers (3)

2 Comments
Show NoneHide None

1 Comment
Show -1 older commentsHide -1 older comments

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

Comparing lists of years for similarity

0 Comments Show -2 older commentsHide -2 older comments

Answers (3)

2 Comments Show NoneHide None

1 Comment Show -1 older commentsHide -1 older comments

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

2 Comments
Show NoneHide None

1 Comment
Show -1 older commentsHide -1 older comments

0 Comments
Show -2 older commentsHide -2 older comments