Causality Research Internship @ LexisNexis: 2021

Friday, June 25, 2021

07.05

Week 7, Day 5

Improved code organization greatly, adding complete documentation to the closest worlds implementation. Wrote up Roger's test case for evaluating this implementation. Also, found a Euclidean distance transformation which yields values in the range [0, 1] using exponentiation: 1/e^distance. Since I noticed that the similarity scores get very small (>0.0001), I added in a power-scaling mechanism which brings the lowest nonzero similarity score to 0.01.

I began looking for real-world datasets, and found the sources listed below. Testing the counterfactual module with real-world data will be tricky, due to the fundamental problem of causal inference (it's impossible to observe the causal effect on a single unit -- a given person either took the treatment or didn't).

Thursday, June 24, 2021

07.04

Week 7, Day 4

Met with the team this morning and discussed progress. Roger pointed out that the Prob class has an isDiscrete method, which I am now using to split the dataset into continuous and discrete vars in the closest worlds implementation. Now the code steps are as follows:

Convert the dataset dictionary to a pandas DataFrame.
Conditionalize on the counterfactual and filter columns on the conditional clause.
Calculate k (how many worlds to find; hard-coded at 1% of dataset length).
Split the dataset into two dataframes for continuous and discrete columns.
Use Jaccard coefficient to compare discrete dataset against discrete RVs of the given unit observation.
Use Euclidean distance* to compare continuous dataset against continuous RVs of the given unit observation.
Use the above values to produce a weighted-average similarity score.
Sort the records by similarity, and restrict the dataset to the k closest.
Return a call to Prob.predictDist, targeting the RV and using only the k closest records as data.

* See Week 7, Day 3 -- I realized these limitations also apply to the continuous variables, as cos sim breaks for colinear variables: [1 1] = [9 9]. Thus, I changed the similarity metric to Euclidean distance; however, this doesn't produce scores in the range [0, 1]; TBD tomorrow.

Wednesday, June 23, 2021

07.03

Week 7, Day 3

After testing, I've realized that cosine similarity is not the best metric for a mix of continuous and discrete variables. In particular, cosine similarity doesn't work for binary variables because cosine similarity does not account for zero vectors. these are an inevitability with conditionalization sets comprised only of binary variables. For example, if you have three binary variables you want to compare, you'll have the combination [0, 0, 0]

with a magnitude of zero, and since that's in the denominator of the cos sim function you'll get compilation errors every time. Moreover, cosine similarity is based on a normalized dot product. this means that the cosine similarity of 0 and 0 is the same as the similarity between 0 and 1, when it should be the same as 1 and 1. And conceptually, this representation doesn't capture the behavior of categorical variables, since two mutually exclusive categories shouldn't be placed on the same axis, they should be orthogonal. Instead of treating 0 and 1 as numbers, they should be treated as vectors on their own axes.

After research, I found the solution of using matching coefficients for categorical RVs, particularly using the Jaccard coefficient metric. This means we need a way of knowing which RVs are discrete. which I will bring up in tomorrow's team meeting.

Tuesday, June 22, 2021

07.02

Week 7, Day 2

Finished closest worlds implementation, using cosine similarity to compare each record (row) in the dataset against the given unit observation.
Decided to filter on the counterfactual early on, to ensure that the number of closest worlds found is not reduced at the end.
Updated the variable data types to be consistent with the existing codebase, and added PyDoc.

Monday, June 21, 2021

07.01

Week 7, Day 1

Read Pearl's paper about the "effect of treatment on the treated" (source); this may be a useful tool a bit later on, as it allows estimation of a counterfactual distribution under conditional intervention.

Started implementing the "closest worlds" approach, using DataFrames to easily represent and transform each observation in the dataset.

Met with Roger and discussed reordered to-do list:

Implement Lewis' imaging approach (closest worlds).
Implement test cases.
Search for real-world datasets.
Implement Pearl's deterministic/probabilistic approaches.
Implement ETT.
Implement mediation/additive intervention.
Implement PNS.

Friday, June 18, 2021

06.05

Week 6, Day 5

Wrote pseudocode/documentation for deterministic and probabilistic counterfactual implementations.

Did some more research into the "find closest world" mechanism, and found this very useful paper from Pearl, 1994. It explains the drawbacks of this approach as follows:

"To account for such uncertainties, (Lewis 1976) has generalized the notion of “closest world” using the device of “imaging”; namely, the closest worlds are assigned probability scores, and these scores are combined to compute the probability of the consequent. The drawback of the “closest world” approach is that it leaves the precise specification of the closeness measure almost unconstrained. More specifically, it does not tell us how to encode distances in a way that would (1) conform to our perception of causal influences and (2) lend itself to economical machine representation."

As such, it appears as though Pearl's definitions of deterministic and probabilistic counterfactuals are more modern approaches to Lewis' imaging technique. I'll have to look further into Lewis' literature to understand how the closest worlds are assigned probability scores.

Thursday, June 17, 2021

06.04

Week 6, Day 4

Met with the Causality team to discuss implementation progress and blockers. Roger requested that I give a PPT presentation to the team next week to explain the theory grounding my implementation; after our meeting I began working on this.

Also looked into the concept of a "controlled indirect effect", as there's no mention of a CIE throughout the referenced Pearl literature. Found that the total effect (TE) can be decomposed into various ways including (1):

controlled direct effect and eliminated effect
natural direct effect and natural indirect effect
4-way decomposition: controlled direct effect, reference interaction, mediated interaction, and pure indirect effect

As such, the controlled indirect effect is "usually not estimated", as "its sum with the direct effect does not equal the total effect", and is "notably difficult to conceptualize" (2, 3). Source 1 proposes a means of estimating the CIE, defined as "the total effect of mediator on outcome"; this is likely beyond the scope of my implementation, but may be useful for further research. For now, I'll include a calculation for the so-called "eliminated effect": TE-CDE, as used by Source 3.

Wednesday, June 16, 2021

06.03

Week 6, Day 3

I made great progress on my counterfactual implementation today! The counterfactual code skeleton is complete, with thorough documentation of the purpose of and inputs fleshed out for each of the algorithms.

I've extracted from the Pearl literature five main tasks for the counterfactual module. It should be able to:

Compute deterministic and probabilistic unit-level queries given a structural equation model.
Compute the effect of a treatment across a population given a cGraph.
Perform additive intervention.
Evaluate necessity and sufficiency probabilities.
Generate a full mediation analysis report. This will allow the user to answer questions like "Can we identify the natural direct and indirect effects given this cGraph?" and "What fraction of the given effect is necessarily due to mediation?"

I've also identified a stretch goal of determining path-specific causal effects; this would be an implementation of another Pearl paper, linked here.

Tomorrow, I'll be working on replicating some of Pearl's synthetic real-world scenarios for testing the counterfactual module, and hopefully finishing the first task of evaluating deterministic and probabilistic queries.

Tuesday, June 15, 2021

06.02

Week 6, Day 2

I experimented some more today with restructuring the Causality platform to improve readability and usability. Ultimately, I've concluded that accomplishing a valuable redesign without sacrificing ease of use (particularly with regard to command-line run configurations) is both a hefty feat and one which should be put on the back burner until my counterfactual implementation is complete. However, I learned a lot today about best practices in Python project structuring (source 1, source 2) through designing a new platform hierarchy.

With this research and experimentation done, I then continued my counterfactual module preliminary structuring. Though Pearl's paper is clear in building the theoretical basis of counterfactuals, it has been rather difficult to extract from the literature the functionalities I want the module to support. This involves using the theory to answer the question "What counterfactual queries would a user want to ask?". I also completed my class design for the module and updated the intervention module class today, consistent with the compositional platform structure Roger and I agreed on.

Also, today I met with Roger Dev and Lorraine Chapman to discuss progress and next steps at the midway point of my internship. While it was an important task to update my timeline and goals for this project, it was also a really valuable learning opportunity -- Lorraine and Roger gave me some solid advice on navigating communicational and workstyle obstacles, which I'll carry forth with me into the rest of my career.

Monday, June 14, 2021

06.01

Week 6, Day 1

Met with Roger and took notes regarding redesign of compositional design structure and "find closest world" counterfactual mechanism.
Finalized compositional design updates.
Decided to use Pythonic naming conventions going forward, and to update the existing codebase if needed towards the end of internship.
Updated the distribution prediction mechanism to consider each variable in the given dataset, rather than only the single most dependent (TBD: parameter specifying k-most dependent?)
Updated the distribution prediction mechanism to consider each variable in the given dataset, rather than only the most dependent (TBD: parameter specifying k-most dependent?)
Updated intervention code to find all backdoor blocking sets.
Started working on counterfactual implementation.

Friday, June 11, 2021

05.05

Week 5, Day 5

Finished migrating the modules to a composition-based design.
Wrote up a list of questions regarding code design, extensibility, naming conventions, etc. for meeting with Roger.
Started to experiment with module-setting in order to prevent the need for hard-coding file paths with the new directory system.
Finished reading through and running probability, cGraph, and interventional tests to understand existing system.
Continued the Brady Neal lecture series.

Thursday, June 10, 2021

05.04

Week 5, Day 4

Forked the Causality repo; was previously only working with a local clone.
Read through the NetworkX library documentation, now understanding the cGraph generation code.
Re-read Pearl's Chapter 4 in Causal Inference in Statistics; wrote up a design plan for the CDE/NDE/NIE code.
Continued watching Brady Neal's Causal Inference lecture series.

Wednesday, June 9, 2021

05.03

Week 5, Day 3

Started restructuring the codebase to include composition for each of the four Causality levels.
Abstracted the predictDist code for modeling counterfactuals at the unit level; the abstracted method returns an array of subspaces from which predictDist extracts its output.
Experimented with using subspaces as a way to represent "worlds" of data.
Started watching Brady Neal's Causal Inference lecture series as a refresher on the theory, and to start getting ideas for the poster presentation to be made in the second half.

Tuesday, June 8, 2021

05.02

Week 5, Day 2

I met with Roger Dev for a couple hours today, and dove into the code for the causality toolkit. It was really useful overall; I feel much more comfortable with the codebase, and I'm well-equipped to start my implementation. My first steps will be to restructure the causal graph src to include encapsulation for each of the causality levels:

Probability Space
Causal Graph
Intervention
Counterfactuals

Here are a few more next steps, summarized from the notes taken during my meeting with Roger:

cGraph intervention only finds one backdoor blocking set (change to a loop).
Check if CDE code actually determines the CDE or if it's an NDE algorithm (CDE should have do-calculus).
Look into generation of numpy.random distributions.
interventionTest effectively calculates the ACE -- restructure to remove duplication.
Come up with synthetic example for testing (reference interventionTest.py)
Major question: How to model data for an individual (unit-level)? Essentially, we want to create "worlds" of synthetic data (for each observation). Then, the task is to find the closest "world" to this one, to get around the fundamental problem of causal inference.
Create another function like predictDist (called "findClosestWorld", etc) -- Instead of returning distribution of one variable, return the whole subspace and average them together. This function should use a new abstract function (also called by predictDist).

Side note: I turned 20 today! 🎉

Monday, June 7, 2021

05.01

Week 5, Day 1

Last week, I made a copy of the Causality GitHub repository and began reading through the documentation and source code. In order to successfully implement the counterfactual tool, I'll need a thorough understanding of the existing codebase -- especially the scripts for the data generation, causal graph, probability space, and interventional calculus components.

I'm currently struggling with getting started on my implementation phase, and I think it's because understanding the existing src has proven more difficult than expected. I've scheduled a meeting with Roger for tomorrow, and I'm hoping he can walk through the code with me to help foster my intuition on implementing causality algorithms.

First Post: Midterm!

I started with the HPCC Systems team at LexisNexis on May 10, 2021; as today is Monday, June 7, this post marks the beginning of my fifth week. Throughout this summer, I've been working with a team comprised of Roger Dev and two other LN interns on expanding and enhancing the HPCC Systems Causality toolkit, which provides implementations of core algorithms for modeling causal relationships within data.

I'm starting this blog a little late into my internship, but it happens to be great timing as this weeks marks the start of my implementation phase. I dedicated the first month of my internship to preliminary research, as this area of statistics -- causality -- is a field which is both very new and rather complex. Though I'd initially allotted two weeks for this phase, understanding the grounding theory for this project ended up being a much more challenging task than I'd anticipated.

Starting today, I'll be uploading daily posts which summarize my progress, challenges, and breakthroughs as I complete the second half of my summer internship; I think that documenting my learning journey for this remaining bloc will be a very useful exercise. Next step: understanding the causality toolkit src!