Met with the team this morning and discussed progress. Roger pointed out that the Prob class has an isDiscrete method, which I am now using to split the dataset into continuous and discrete vars in the closest worlds implementation. Now the code steps are as follows:
- Convert the dataset dictionary to a pandas DataFrame.
- Conditionalize on the counterfactual and filter columns on the conditional clause.
- Calculate k (how many worlds to find; hard-coded at 1% of dataset length).
- Split the dataset into two dataframes for continuous and discrete columns.
- Use Jaccard coefficient to compare discrete dataset against discrete RVs of the given unit observation.
- Use Euclidean distance* to compare continuous dataset against continuous RVs of the given unit observation.
- Use the above values to produce a weighted-average similarity score.
- Sort the records by similarity, and restrict the dataset to the k closest.
- Return a call to Prob.predictDist, targeting the RV and using only the k closest records as data.
* See Week 7, Day 3 -- I realized these limitations also apply to the continuous variables, as cos sim breaks for colinear variables: [1 1] = [9 9]. Thus, I changed the similarity metric to Euclidean distance; however, this doesn't produce scores in the range [0, 1]; TBD tomorrow.