Thursday, June 24, 2021

07.04

Week 7, Day 4

Met with the team this morning and discussed progress. Roger pointed out that the Prob class has an isDiscrete method, which I am now using to split the dataset into continuous and discrete vars in the closest worlds implementation. Now the code steps are as follows:
  1. Convert the dataset dictionary to a pandas DataFrame.
  2. Conditionalize on the counterfactual and filter columns on the conditional clause.
  3. Calculate k (how many worlds to find; hard-coded at 1% of dataset length).
  4. Split the dataset into two dataframes for continuous and discrete columns.
  5. Use Jaccard coefficient to compare discrete dataset against discrete RVs of the given unit observation.
  6. Use Euclidean distance* to compare continuous dataset against continuous RVs of the given unit observation.
  7. Use the above values to produce a weighted-average similarity score.
  8. Sort the records by similarity, and restrict the dataset to the k closest.
  9. Return a call to Prob.predictDist, targeting the RV and using only the k closest records as data.
* See Week 7, Day 3 -- I realized these limitations also apply to the continuous variables, as cos sim breaks for colinear variables: [1 1] = [9 9]. Thus, I changed the similarity metric to Euclidean distance; however, this doesn't produce scores in the range [0, 1]; TBD tomorrow.