

Denamo Markos
Beyond Correlation: Causal Inference
Background
A common frustration in the industry, especially when it comes to getting business insights from tabular data, is that the most exciting questions are often not answerable with observational data alone. These questions can be similar to:
- "What will happen if I halve the price of my product?"
- "Which clients will pay their debts only if I call them?"
Judea Pearl and his research group have developed in the last decades a solid theoretical framework to deal with that. Still, the first steps toward merging it with mainstream machine learning are just beginning.
Info: A causal graph is a central object in the framework mentioned above, but it is often unknown, subject to personal knowledge and bias, or loosely connected to the available data.
Objective
The main objective of the task is to:
- Perform a causal inference task using Pearl's framework
- Infer the causal graph from observational data and then validate the graph
- Merge machine learning with causal inference
Data
The data is from Kaggle, focusing on breast cancer diagnosis. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.
Dataset Features
Feature Type | Description | Examples |
---|---|---|
Basic Info | Identification and diagnosis | ID number, Diagnosis (M/B) |
Cell Nucleus Features | Ten real-valued measurements | Radius, Texture, Perimeter |
Statistical Measures | Three types for each feature | Mean, Standard Error, Worst |
Key Measurements
Ten real-valued features computed for each cell nucleus:
- Radius (mean distances from center to perimeter)
- Texture (gray-scale values standard deviation)
- Perimeter
- Area
- Smoothness (radius lengths local variation)
- Compactness (perimeter² / area — 1.0)
- Concavity (contour concave portions severity)
- Concave points (contour concave portions count)
- Symmetry
- Fractal dimension ("coastline approximation" — 1)
Info: The dataset contains 357 benign (not cancer) and 212 malignant (cancer) cases, with no missing values.
Techniques
Meta-learning Causal Structures
In 2019, Yoshua Bengio's team proposed an approach to recognize simple cause-and-effect relationships using deep learning. They utilized:
- Real-world causal relationship datasets
- Synthetic causal relationship datasets
- Probability-based mapping
Structural Equation Modeling (SEM)
1# Example of SEM implementation 2from semopy import Model 3 4model = """ 5# Define relationships 6radius ~ texture + area 7diagnosis ~ radius + texture 8""" 9 10sem = Model(model) 11sem.fit(data)
Causal Bayesian Network
This method:
- Estimates relationships between all variables
- Discovers multiple causal relationships simultaneously
- Creates visual maps of variable influences
- Enables simulation of multiple interventions
Insights
Initial dataset analysis revealed:
- Data Structure
- 33 columns (32 numerical, 1 categorical)
- Diagnosis column as label vector
- Unique ID column
- 30 feature columns in 3 blocks
- Selected Features
features = [ 'radius_mean', 'texture_mean', 'area_mean', 'area_se', 'area_worst' ]
Random Forest Classifier
1from sklearn.ensemble import RandomForestClassifier 2from sklearn.model_selection import train_test_split 3 4# Initialize classifier 5rf = RandomForestClassifier(n_estimators=100, random_state=42) 6 7# Train model 8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) 9rf.fit(X_train, y_train) 10 11# Evaluate 12accuracy = rf.score(X_test, y_test) 13print(f"Accuracy: {accuracy:.4f}") # Output: Accuracy: 0.9532
Info: The Random Forest Classifier achieved an accuracy above 95% after applying causal inference techniques.