Friday, June 25, 2021


Week 7, Day 5

Improved code organization greatly, adding complete documentation to the closest worlds implementation. Wrote up Roger's test case for evaluating this implementation. Also, found a Euclidean distance transformation which yields values in the range [0, 1] using exponentiation: 1/e^distance. Since I noticed that the similarity scores get very small (>0.0001), I added in a power-scaling mechanism which brings the lowest nonzero similarity score to 0.01.

I began looking for real-world datasets, and found the sources listed below. Testing the counterfactual module with real-world data will be tricky, due to the fundamental problem of causal inference (it's impossible to observe the causal effect on a single unit -- a given person either took the treatment or didn't).

Thursday, June 24, 2021


Week 7, Day 4

Met with the team this morning and discussed progress. Roger pointed out that the Prob class has an isDiscrete method, which I am now using to split the dataset into continuous and discrete vars in the closest worlds implementation. Now the code steps are as follows:
  1. Convert the dataset dictionary to a pandas DataFrame.
  2. Conditionalize on the counterfactual and filter columns on the conditional clause.
  3. Calculate k (how many worlds to find; hard-coded at 1% of dataset length).
  4. Split the dataset into two dataframes for continuous and discrete columns.
  5. Use Jaccard coefficient to compare discrete dataset against discrete RVs of the given unit observation.
  6. Use Euclidean distance* to compare continuous dataset against continuous RVs of the given unit observation.
  7. Use the above values to produce a weighted-average similarity score.
  8. Sort the records by similarity, and restrict the dataset to the k closest.
  9. Return a call to Prob.predictDist, targeting the RV and using only the k closest records as data.
* See Week 7, Day 3 -- I realized these limitations also apply to the continuous variables, as cos sim breaks for colinear variables: [1 1] = [9 9]. Thus, I changed the similarity metric to Euclidean distance; however, this doesn't produce scores in the range [0, 1]; TBD tomorrow.

Wednesday, June 23, 2021


Week 7, Day 3

After testing, I've realized that cosine similarity is not the best metric for a mix of continuous and discrete variables.  In particular, cosine similarity doesn't work for binary variables because cosine similarity does not account for zero vectors. these are an inevitability with conditionalization sets comprised only of binary variables. For example, if you have three binary variables you want to compare, you'll have the combination [0, 0, 0]
with a magnitude of zero, and since that's in the denominator of the cos sim function you'll get compilation errors every time. Moreover, cosine similarity is based on a normalized dot product. this means that the cosine similarity of 0 and 0 is the same as the similarity between 0 and 1, when it should be the same as 1 and 1. And conceptually, this representation doesn't capture the behavior of categorical variables, since two mutually exclusive categories shouldn't be placed on the same axis, they should be orthogonal. Instead of treating 0 and 1 as numbers, they should be treated as vectors on their own axes.

After research, I found the solution of using matching coefficients for categorical RVs, particularly using the Jaccard coefficient metric. This means we need a way of knowing which RVs are discrete. which I will bring up in tomorrow's team meeting.

Tuesday, June 22, 2021


Week 7, Day 2

  • Finished closest worlds implementation, using cosine similarity to compare each record (row) in the dataset against the given unit observation. 
  • Decided to filter on the counterfactual early on, to ensure that the number of closest worlds found is not reduced at the end.
  • Updated the variable data types to be consistent with the existing codebase, and added PyDoc.

Monday, June 21, 2021


Week 7, Day 1

Read Pearl's paper about the "effect of treatment on the treated" (source); this may be a useful tool a bit later on, as it allows estimation of a counterfactual distribution under conditional intervention.

Started implementing the "closest worlds" approach, using DataFrames to easily represent and transform each observation in the dataset.

Met with Roger and discussed reordered to-do list:

  1. Implement Lewis' imaging approach (closest worlds).
  2. Implement test cases.
  3. Search for real-world datasets.
  4. Implement Pearl's deterministic/probabilistic approaches.
  5. Implement ETT.
  6. Implement mediation/additive intervention.
  7. Implement PNS.

Friday, June 18, 2021


Week 6, Day 5

Wrote pseudocode/documentation for deterministic and probabilistic counterfactual implementations. 

Did some more research into the "find closest world" mechanism, and found this very useful paper from Pearl, 1994. It explains the drawbacks of this approach as follows:

"To account for such uncertainties, (Lewis 1976) has generalized the notion of “closest world” using the device of “imaging”; namely, the closest worlds are assigned probability scores, and these scores are combined to compute the probability of the consequent. The drawback of the “closest world” approach is that it leaves the precise specification of the closeness measure almost unconstrained. More specifically, it does not tell us how to encode distances in a way that would (1) conform to our perception of causal influences and (2) lend itself to economical machine representation."

As such, it appears as though Pearl's definitions of deterministic and probabilistic counterfactuals are more modern approaches to Lewis' imaging technique. I'll have to look further into Lewis' literature to understand how the closest worlds are assigned probability scores. 

Thursday, June 17, 2021


Week 6, Day 4

Met with the Causality team to discuss implementation progress and blockers. Roger requested that I give a PPT presentation to the team next week to explain the theory grounding my implementation; after our meeting I began working on this. 

Also looked into the concept of a "controlled indirect effect", as there's no mention of a CIE throughout the referenced Pearl literature. Found that the total effect (TE) can be decomposed into various ways including (1):

  1. controlled direct effect and eliminated effect
  2. natural direct effect and natural indirect effect
  3. 4-way decomposition: controlled direct effect, reference interaction, mediated interaction, and pure indirect effect 
As such, the controlled indirect effect is "usually not estimated", as "its sum with the direct effect does not equal the total effect", and is "notably difficult to conceptualize" (2, 3). Source 1 proposes a means of estimating the CIE, defined as "the total effect of mediator on outcome"; this is likely beyond the scope of my implementation, but may be useful for further research. For now, I'll include a calculation for the so-called "eliminated effect": TE-CDE, as used by Source 3.

Wednesday, June 16, 2021


Week 6, Day 3

I made great progress on my counterfactual implementation today! The counterfactual code skeleton is complete, with thorough documentation of the purpose of and inputs fleshed out for each of the algorithms.

I've extracted from the Pearl literature five main tasks for the counterfactual module. It should be able to:

  1. Compute deterministic and probabilistic unit-level queries given a structural equation model.
  2. Compute the effect of a treatment across a population given a cGraph.
  3. Perform additive intervention.
  4. Evaluate necessity and sufficiency probabilities.
  5. Generate a full mediation analysis report. This will allow the user to answer questions like  "Can we identify the natural direct and indirect effects given this cGraph?" and "What fraction of the given effect is necessarily due to mediation?"

I've also identified a stretch goal of determining path-specific causal effects; this would be an implementation of another Pearl paper, linked here.

Tomorrow, I'll be working on replicating some of Pearl's synthetic real-world scenarios for testing the counterfactual module, and hopefully finishing the first task of evaluating deterministic and probabilistic queries.

Tuesday, June 15, 2021


Week 6, Day 2

I experimented some more today with restructuring the Causality platform to improve readability and usability. Ultimately, I've concluded that accomplishing a valuable redesign without sacrificing ease of use (particularly with regard to command-line run configurations) is both a hefty feat and one which should be put on the back burner until my counterfactual implementation is complete. However, I learned a lot today about best practices in Python project structuring (source 1, source 2) through designing a new platform hierarchy.

With this research and experimentation done, I then continued my counterfactual module preliminary structuring. Though Pearl's paper is clear in building the theoretical basis of counterfactuals, it has been rather difficult to extract from the literature the functionalities I want the module to support. This involves using the theory to answer the question "What counterfactual queries would a user want to ask?". I also completed my class design for the module and updated the intervention module class today, consistent with the compositional platform structure Roger and I agreed on.

Also, today I met with Roger Dev and Lorraine Chapman to discuss progress and next steps at the midway point of my internship. While it was an important task to update my timeline and goals for this project, it was also a really valuable learning opportunity -- Lorraine and Roger gave me some solid advice on navigating communicational and workstyle obstacles, which I'll carry forth with me into the rest of my career.

Monday, June 14, 2021


Week 6, Day 1

  • Met with Roger and took notes regarding redesign of compositional design structure and "find closest world" counterfactual mechanism.
  • Finalized compositional design updates.
  • Decided to use Pythonic naming conventions going forward, and to update the existing codebase if needed towards the end of internship.
  • Updated the distribution prediction mechanism to consider each variable in the given dataset, rather than only the single most dependent (TBD: parameter specifying k-most dependent?)
  • Updated the distribution prediction mechanism to consider each variable in the given dataset, rather than only the most dependent (TBD: parameter specifying k-most dependent?)
  • Updated intervention code to find all backdoor blocking sets.
  • Started working on counterfactual implementation.