Causality Research Internship @ LexisNexis

Friday, June 11, 2021

05.05

Week 5, Day 5

Finished migrating the modules to a composition-based design.
Wrote up a list of questions regarding code design, extensibility, naming conventions, etc. for meeting with Roger.
Started to experiment with module-setting in order to prevent the need for hard-coding file paths with the new directory system.
Finished reading through and running probability, cGraph, and interventional tests to understand existing system.
Continued the Brady Neal lecture series.

Thursday, June 10, 2021

05.04

Week 5, Day 4

Forked the Causality repo; was previously only working with a local clone.
Read through the NetworkX library documentation, now understanding the cGraph generation code.
Re-read Pearl's Chapter 4 in Causal Inference in Statistics; wrote up a design plan for the CDE/NDE/NIE code.
Continued watching Brady Neal's Causal Inference lecture series.

Wednesday, June 9, 2021

05.03

Week 5, Day 3

Started restructuring the codebase to include composition for each of the four Causality levels.
Abstracted the predictDist code for modeling counterfactuals at the unit level; the abstracted method returns an array of subspaces from which predictDist extracts its output.
Experimented with using subspaces as a way to represent "worlds" of data.
Started watching Brady Neal's Causal Inference lecture series as a refresher on the theory, and to start getting ideas for the poster presentation to be made in the second half.

Tuesday, June 8, 2021

I met with Roger Dev for a couple hours today, and dove into the code for the causality toolkit. It was really useful overall; I feel much more comfortable with the codebase, and I'm well-equipped to start my implementation. My first steps will be to restructure the causal graph src to include encapsulation for each of the causality levels:

Probability Space
Causal Graph
Intervention
Counterfactuals

Here are a few more next steps, summarized from the notes taken during my meeting with Roger:

cGraph intervention only finds one backdoor blocking set (change to a loop).
Check if CDE code actually determines the CDE or if it's an NDE algorithm (CDE should have do-calculus).
Look into generation of numpy.random distributions.
interventionTest effectively calculates the ACE -- restructure to remove duplication.
Come up with synthetic example for testing (reference interventionTest.py)
Major question: How to model data for an individual (unit-level)? Essentially, we want to create "worlds" of synthetic data (for each observation). Then, the task is to find the closest "world" to this one, to get around the fundamental problem of causal inference.
Create another function like predictDist (called "findClosestWorld", etc) -- Instead of returning distribution of one variable, return the whole subspace and average them together. This function should use a new abstract function (also called by predictDist).

Side note: I turned 20 today! 🎉

Monday, June 7, 2021

05.01

Week 5, Day 1

Last week, I made a copy of the Causality GitHub repository and began reading through the documentation and source code. In order to successfully implement the counterfactual tool, I'll need a thorough understanding of the existing codebase -- especially the scripts for the data generation, causal graph, probability space, and interventional calculus components.

I'm currently struggling with getting started on my implementation phase, and I think it's because understanding the existing src has proven more difficult than expected. I've scheduled a meeting with Roger for tomorrow, and I'm hoping he can walk through the code with me to help foster my intuition on implementing causality algorithms.

First Post: Midterm!

I started with the HPCC Systems team at LexisNexis on May 10, 2021; as today is Monday, June 7, this post marks the beginning of my fifth week. Throughout this summer, I've been working with a team comprised of Roger Dev and two other LN interns on expanding and enhancing the HPCC Systems Causality toolkit, which provides implementations of core algorithms for modeling causal relationships within data.

I'm starting this blog a little late into my internship, but it happens to be great timing as this weeks marks the start of my implementation phase. I dedicated the first month of my internship to preliminary research, as this area of statistics -- causality -- is a field which is both very new and rather complex. Though I'd initially allotted two weeks for this phase, understanding the grounding theory for this project ended up being a much more challenging task than I'd anticipated.

Starting today, I'll be uploading daily posts which summarize my progress, challenges, and breakthroughs as I complete the second half of my summer internship; I think that documenting my learning journey for this remaining bloc will be a very useful exercise. Next step: understanding the causality toolkit src!