diye.dev

Background

A common frustration in the industry, especially when it comes to getting business insights from tabular data, is that the most exciting questions are often not answerable with observational data alone. These questions can be similar to:

"What will happen if I halve the price of my product?"
"Which clients will pay their debts only if I call them?"

Judea Pearl and his research group have developed in the last decades a solid theoretical framework to deal with that. Still, the first steps toward merging it with mainstream machine learning are just beginning.

Info: A causal graph is a central object in the framework mentioned above, but it is often unknown, subject to personal knowledge and bias, or loosely connected to the available data.

Objective

The main objective of the task is to:

Perform a causal inference task using Pearl's framework
Infer the causal graph from observational data and then validate the graph
Merge machine learning with causal inference

Data

The data is from Kaggle, focusing on breast cancer diagnosis. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.

Dataset Features

Feature Type	Description	Examples
Basic Info	Identification and diagnosis	ID number, Diagnosis (M/B)
Cell Nucleus Features	Ten real-valued measurements	Radius, Texture, Perimeter
Statistical Measures	Three types for each feature	Mean, Standard Error, Worst

Key Measurements

Ten real-valued features computed for each cell nucleus:

Radius (mean distances from center to perimeter)
Texture (gray-scale values standard deviation)
Perimeter
Area
Smoothness (radius lengths local variation)
Compactness (perimeter² / area — 1.0)
Concavity (contour concave portions severity)
Concave points (contour concave portions count)
Symmetry
Fractal dimension ("coastline approximation" — 1)

Info: The dataset contains 357 benign (not cancer) and 212 malignant (cancer) cases, with no missing values.

Techniques

Meta-learning Causal Structures

In 2019, Yoshua Bengio's team proposed an approach to recognize simple cause-and-effect relationships using deep learning. They utilized:

Real-world causal relationship datasets
Synthetic causal relationship datasets
Probability-based mapping

Structural Equation Modeling (SEM)

1# Example of SEM implementation
2from semopy import Model
3
4model = """
5# Define relationships
6radius ~ texture + area
7diagnosis ~ radius + texture
8"""
9
10sem = Model(model)
11sem.fit(data)

Causal Bayesian Network

This method:

Estimates relationships between all variables
Discovers multiple causal relationships simultaneously
Creates visual maps of variable influences
Enables simulation of multiple interventions

Insights

Initial dataset analysis revealed:

Data Structure

33 columns (32 numerical, 1 categorical)
Diagnosis column as label vector
Unique ID column
30 feature columns in 3 blocks

Selected Features

features = [
    'radius_mean',
    'texture_mean',
    'area_mean',
    'area_se',
    'area_worst'
]

Random Forest Classifier

1from sklearn.ensemble import RandomForestClassifier
2from sklearn.model_selection import train_test_split
3
4# Initialize classifier
5rf = RandomForestClassifier(n_estimators=100, random_state=42)
6
7# Train model
8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
9rf.fit(X_train, y_train)
10
11# Evaluate
12accuracy = rf.score(X_test, y_test)
13print(f"Accuracy: {accuracy:.4f}")  # Output: Accuracy: 0.9532

Info: The Random Forest Classifier achieved an accuracy above 95% after applying causal inference techniques.

Beyond Correlation: Causal Inference