Wednesday, June 23, 2021

07.03

Week 7, Day 3

After testing, I've realized that cosine similarity is not the best metric for a mix of continuous and discrete variables.  In particular, cosine similarity doesn't work for binary variables because cosine similarity does not account for zero vectors. these are an inevitability with conditionalization sets comprised only of binary variables. For example, if you have three binary variables you want to compare, you'll have the combination [0, 0, 0]
with a magnitude of zero, and since that's in the denominator of the cos sim function you'll get compilation errors every time. Moreover, cosine similarity is based on a normalized dot product. this means that the cosine similarity of 0 and 0 is the same as the similarity between 0 and 1, when it should be the same as 1 and 1. And conceptually, this representation doesn't capture the behavior of categorical variables, since two mutually exclusive categories shouldn't be placed on the same axis, they should be orthogonal. Instead of treating 0 and 1 as numbers, they should be treated as vectors on their own axes.

After research, I found the solution of using matching coefficients for categorical RVs, particularly using the Jaccard coefficient metric. This means we need a way of knowing which RVs are discrete. which I will bring up in tomorrow's team meeting.