Causality Research Internship @ LexisNexis
Welcome! This blog is dedicated to documenting Mara Hubelbank's software development and machine learning research internship with the HPCC Systems group at LexisNexis in Summer 2021.
Friday, June 25, 2021
07.05
Thursday, June 24, 2021
07.04
- Convert the dataset dictionary to a pandas DataFrame.
- Conditionalize on the counterfactual and filter columns on the conditional clause.
- Calculate k (how many worlds to find; hard-coded at 1% of dataset length).
- Split the dataset into two dataframes for continuous and discrete columns.
- Use Jaccard coefficient to compare discrete dataset against discrete RVs of the given unit observation.
- Use Euclidean distance* to compare continuous dataset against continuous RVs of the given unit observation.
- Use the above values to produce a weighted-average similarity score.
- Sort the records by similarity, and restrict the dataset to the k closest.
- Return a call to Prob.predictDist, targeting the RV and using only the k closest records as data.
Wednesday, June 23, 2021
07.03
Tuesday, June 22, 2021
07.02
Week 7, Day 2
- Finished closest worlds implementation, using cosine similarity to compare each record (row) in the dataset against the given unit observation.
- Decided to filter on the counterfactual early on, to ensure that the number of closest worlds found is not reduced at the end.
- Updated the variable data types to be consistent with the existing codebase, and added PyDoc.
Monday, June 21, 2021
07.01
Week 7, Day 1
Read Pearl's paper about the "effect of treatment on the treated" (source); this may be a useful tool a bit later on, as it allows estimation of a counterfactual distribution under conditional intervention.
Started implementing the "closest worlds" approach, using DataFrames to easily represent and transform each observation in the dataset.
Met with Roger and discussed reordered to-do list:
- Implement Lewis' imaging approach (closest worlds).
- Implement test cases.
- Search for real-world datasets.
- Implement Pearl's deterministic/probabilistic approaches.
- Implement ETT.
- Implement mediation/additive intervention.
- Implement PNS.
Friday, June 18, 2021
06.05
Week 6, Day 5
Wrote pseudocode/documentation for deterministic and probabilistic counterfactual implementations.
Did some more research into the "find closest world" mechanism, and found this very useful paper from Pearl, 1994. It explains the drawbacks of this approach as follows:
"To account for such uncertainties, (Lewis 1976) has generalized the notion of “closest world” using the device of “imaging”; namely, the closest worlds are assigned probability scores, and these scores are combined to compute the probability of the consequent. The drawback of the “closest world” approach is that it leaves the precise specification of the closeness measure almost unconstrained. More specifically, it does not tell us how to encode distances in a way that would (1) conform to our perception of causal influences and (2) lend itself to economical machine representation."
As such, it appears as though Pearl's definitions of deterministic and probabilistic counterfactuals are more modern approaches to Lewis' imaging technique. I'll have to look further into Lewis' literature to understand how the closest worlds are assigned probability scores.
Thursday, June 17, 2021
06.04
Week 6, Day 4
Met with the Causality team to discuss implementation progress and blockers. Roger requested that I give a PPT presentation to the team next week to explain the theory grounding my implementation; after our meeting I began working on this.
Also looked into the concept of a "controlled indirect effect", as there's no mention of a CIE throughout the referenced Pearl literature. Found that the total effect (TE) can be decomposed into various ways including (1):
- controlled direct effect and eliminated effect
- natural direct effect and natural indirect effect
- 4-way decomposition: controlled direct effect, reference interaction, mediated interaction, and pure indirect effect
Wednesday, June 16, 2021
06.03
Week 6, Day 3
I made great progress on my counterfactual implementation today! The counterfactual code skeleton is complete, with thorough documentation of the purpose of and inputs fleshed out for each of the algorithms.
I've extracted from the Pearl literature five main tasks for the counterfactual module. It should be able to:
- Compute deterministic and probabilistic unit-level queries given a structural equation model.
- Compute the effect of a treatment across a population given a cGraph.
- Perform additive intervention.
- Evaluate necessity and sufficiency probabilities.
- Generate a full mediation analysis report. This will allow the user to answer questions like "Can we identify the natural direct and indirect effects given this cGraph?" and "What fraction of the given effect is necessarily due to mediation?"
I've also identified a stretch goal of determining path-specific causal effects; this would be an implementation of another Pearl paper, linked here.
Tomorrow, I'll be working on replicating some of Pearl's synthetic real-world scenarios for testing the counterfactual module, and hopefully finishing the first task of evaluating deterministic and probabilistic queries.
Tuesday, June 15, 2021
06.02
Week 6, Day 2
I experimented some more today with restructuring the Causality platform to improve readability and usability. Ultimately, I've concluded that accomplishing a valuable redesign without sacrificing ease of use (particularly with regard to command-line run configurations) is both a hefty feat and one which should be put on the back burner until my counterfactual implementation is complete. However, I learned a lot today about best practices in Python project structuring (source 1, source 2) through designing a new platform hierarchy.
With this research and experimentation done, I then continued my counterfactual module preliminary structuring. Though Pearl's paper is clear in building the theoretical basis of counterfactuals, it has been rather difficult to extract from the literature the functionalities I want the module to support. This involves using the theory to answer the question "What counterfactual queries would a user want to ask?". I also completed my class design for the module and updated the intervention module class today, consistent with the compositional platform structure Roger and I agreed on.
Also, today I met with Roger Dev and Lorraine Chapman to discuss progress and next steps at the midway point of my internship. While it was an important task to update my timeline and goals for this project, it was also a really valuable learning opportunity -- Lorraine and Roger gave me some solid advice on navigating communicational and workstyle obstacles, which I'll carry forth with me into the rest of my career.
Monday, June 14, 2021
06.01
Week 6, Day 1
- Met with Roger and took notes regarding redesign of compositional design structure and "find closest world" counterfactual mechanism.
- Finalized compositional design updates.
- Decided to use Pythonic naming conventions going forward, and to update the existing codebase if needed towards the end of internship.
- Updated the distribution prediction mechanism to consider each variable in the given dataset, rather than only the single most dependent (TBD: parameter specifying k-most dependent?)
- Updated the distribution prediction mechanism to consider each variable in the given dataset, rather than only the most dependent (TBD: parameter specifying k-most dependent?)
- Updated intervention code to find all backdoor blocking sets.
- Started working on counterfactual implementation.