Beyond Correlation: Causal Inference
Denamo Markos

Denamo Markos

Jul 02, 2022
4 min read

Beyond Correlation: Causal Inference

ML#Causal Inference#Random Forest#Python

Background

A common frustration in the industry, especially when it comes to getting business insights from tabular data, is that the most exciting questions are often not answerable with observational data alone. These questions can be similar to:

  • "What will happen if I halve the price of my product?"
  • "Which clients will pay their debts only if I call them?"

Judea Pearl and his research group have developed in the last decades a solid theoretical framework to deal with that. Still, the first steps toward merging it with mainstream machine learning are just beginning.

Info: A causal graph is a central object in the framework mentioned above, but it is often unknown, subject to personal knowledge and bias, or loosely connected to the available data.

Objective

The main objective of the task is to:

  • Perform a causal inference task using Pearl's framework
  • Infer the causal graph from observational data and then validate the graph
  • Merge machine learning with causal inference

Data

The data is from Kaggle, focusing on breast cancer diagnosis. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.

Dataset Features

Feature TypeDescriptionExamples
Basic InfoIdentification and diagnosisID number, Diagnosis (M/B)
Cell Nucleus FeaturesTen real-valued measurementsRadius, Texture, Perimeter
Statistical MeasuresThree types for each featureMean, Standard Error, Worst

Key Measurements

Ten real-valued features computed for each cell nucleus:

  1. Radius (mean distances from center to perimeter)
  2. Texture (gray-scale values standard deviation)
  3. Perimeter
  4. Area
  5. Smoothness (radius lengths local variation)
  6. Compactness (perimeter² / area — 1.0)
  7. Concavity (contour concave portions severity)
  8. Concave points (contour concave portions count)
  9. Symmetry
  10. Fractal dimension ("coastline approximation" — 1)

Info: The dataset contains 357 benign (not cancer) and 212 malignant (cancer) cases, with no missing values.

Techniques

Meta-learning Causal Structures

In 2019, Yoshua Bengio's team proposed an approach to recognize simple cause-and-effect relationships using deep learning. They utilized:

  • Real-world causal relationship datasets
  • Synthetic causal relationship datasets
  • Probability-based mapping

Structural Equation Modeling (SEM)

1# Example of SEM implementation 2from semopy import Model 3 4model = """ 5# Define relationships 6radius ~ texture + area 7diagnosis ~ radius + texture 8""" 9 10sem = Model(model) 11sem.fit(data)

Causal Bayesian Network

This method:

  • Estimates relationships between all variables
  • Discovers multiple causal relationships simultaneously
  • Creates visual maps of variable influences
  • Enables simulation of multiple interventions

Insights

Initial dataset analysis revealed:

  1. Data Structure
  • 33 columns (32 numerical, 1 categorical)
  • Diagnosis column as label vector
  • Unique ID column
  • 30 feature columns in 3 blocks
  1. Selected Features
features = [ 'radius_mean', 'texture_mean', 'area_mean', 'area_se', 'area_worst' ]

Random Forest Classifier

1from sklearn.ensemble import RandomForestClassifier 2from sklearn.model_selection import train_test_split 3 4# Initialize classifier 5rf = RandomForestClassifier(n_estimators=100, random_state=42) 6 7# Train model 8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) 9rf.fit(X_train, y_train) 10 11# Evaluate 12accuracy = rf.score(X_test, y_test) 13print(f"Accuracy: {accuracy:.4f}") # Output: Accuracy: 0.9532

Info: The Random Forest Classifier achieved an accuracy above 95% after applying causal inference techniques.

Reference

Share this post