# Comparing lists of years for similarity

2 views (last 30 days)
James Ryan on 14 Dec 2016
Commented: Guillaume on 15 Dec 2016
My problem involves calibrating a numerical model which predicts some event which happens or not in each year. It could be economic events, coral bleaching, or many other things. I want to compare the similarity of results from different model versions, or with real-world historical data.
The models are expected to miss quite often, so looking for exact matches won't do. Size of error matters so Wilcoxson rank-sum won't do. The lists will often be different in length, and they could be quite a bit longer than my examples below.
Examples of what is subjectively "good" and "bad".
A = [1968 1972 1991 1993 2001 2010]
B = [1968 1972 1993 2001 2010]
C = [1969 1973 1991 1995 2001 2011]
D = [1950 1960 1991 1993 2001 2050]
E = [1968 1972 1991 1993 2001 2010 2050]
Consider A to be "correct"
B is missing one year entirely, but this is not disastrous.
C has only two matching values, but the others are close, I'd call this better than B.
D has three exact matches, but the others are way off. I'd consider this the worst.
E has five exact matches and one really bad point. Again, not disastrous.
Of course I don't expect an algorithm to match my subjective evaluation all the time. I just want it to take the things I have mentioned into account.
If I were to make up an algorithm off the cuff I'd probably try to for look points with near neighbors in the other list and score their distances root-mean-square style, with some maximum value counted against any points left with no neighbor. This is really crude, and there must be a better way.

Guillaume on 14 Dec 2016
It sounds like you need some sort of edit distance calculation. A pure edit distance algorithm would rank B as better than C (1 deletion vs 4 substitutions) but you can weight the deletions more than the substitutions and give different weight to the substitutions by how far they are from the original value.
There is an edit distance function on the File Exchange. No idea of its quality.
##### 2 CommentsShow NoneHide None
James Ryan on 14 Dec 2016
Thanks. This definitely moves me closer to a solution. The only difference is in that algorithm (designed for strings) replacing one letter with another has the same "cost" regardless of the letter. With dates, the replacement matters. Maybe I can tweak it to work.
Guillaume on 15 Dec 2016
Yes, as I said you can modify the standard algorithm to give different weight to substitutions depending on how far they are from the original value.
The concept of what you are trying to do is definitively one of an edit distance, so I'm sure you can find an algorithm already developed somewhere.

KSSV on 14 Dec 2016
You can try this ismembertol https://in.mathworks.com/help/matlab/ref/ismembertol.html. You can fix some tolerance limits and find out whether two sets of numbers have any common elements. You can decide your scenarios by setting your tolerance limits.
James Ryan on 14 Dec 2016
Another step in the right direction. Perhaps I could count exact matches, then near matches, and then count years which don't have a near match. Each count could be weighted differently to create a "nearness" score. Thanks.

Image Analyst on 15 Dec 2016
What about ismember() and/or setdiff()? You don't need ismembertol() if all your numbers (years) are integers. setdiff() tells you what numbers are different between the two vectors, and ismember() tells you what number are the same in the two vectors. Neither one cares about position but I don't think that matters to you - you only care if the number(s) is/are present or not in the array.