Skip to main content

Rethinking PSM design in Production

· 20 min read
Sho SEKINE
Head of Applied Science at mercari, Principal Data Scientist at Fast Retailing, Co-founder AI Allye

Welcome to the world of Causal Inference.

If you work with data in a production environment, you've likely heard of Propensity Score Matching (PSM). You may have even implemented it using libraries like causalinference or DoWhy.

Writing the code isn't difficult. With the modern Python ecosystem, you can calculate propensity scores, perform matching, and estimate effects (ATE/ATT) with just a few lines of code.

But when asked, "Can we really trust these results?" can you confidently say "Yes"? If not, you lose credibility. Or, could you answer immediately without breaking a cold sweat if a Staff Data Scientist fired these sharp questions at you?

  • "If you change the random seed, does the result flip from positive to negative?"
  • "Are these matches actually similar? Are you forcing pairs?"
  • "How does the conclusion change if you tighten the caliper slightly?"
  • "What are the characteristics of the data that was excluded (trimmed)?"

In this blog, we will thoroughly rethink the "classical" method of PSM from the perspective of modern production data science. We will also dig deep into the philosophy and specifications of why Allye's PSM Widget is designed not just as a calculation tool, but as a "cockpit for protecting analysis quality."

This should serve as a "map of the field" for beginners and an "implementation answer key" for experts.

Vocational Training Program Really Work?

· 10 min read
Sho SEKINE
Head of Applied Science at mercari, Principal Data Scientist at Fast Retailing, Co-founder AI Allye

"Does vocational training truly boost participants' future earnings?"

For policymakers and business leaders, measuring the real impact of such programs is a critical challenge. A simple comparison between participants and non-participants is often misleading—for instance, highly motivated individuals might be more likely to sign up, skewing the results.

To solve this, we need Causal Inference.

In this post, we revisit a classic case study based on LaLonde's National Supported Work Demonstration (NSW) data. We will move beyond the textbook theory and demonstrate how to strip away bias to uncover the true program effect.

Today, let's analyze this data using Allye Pro.

CATE Prediction

1. Data Generation

The data is available in the causaldata package. We will use it to create a mixed dataset (nsw_cps_mixed_data) that combines the experimental treatment group with the observational control group.

You can use the code below to generate the data. Or you can also download csv file from here.

from causaldata import nsw_mixtape, cps_mixtape
import pandas as pd

# NSW randomized experiment
df_nsw = nsw_mixtape.load_pandas().data.copy()
# CPS observational data
df_cps = cps_mixtape.load_pandas().data.copy()
common_cols = [
"age", "educ", "black", "hisp", "marr",
"nodegree", "re74", "re75", "re78"
]
df_cps_use = df_cps[common_cols].copy()
df_cps_use["treat"] = 0
df_cps_use["source"] = "CPS"
# Select only the treated group from the experimental data
df_nsw_use = df_nsw[df_nsw["treat"] == 1][common_cols + ["treat"]].copy()
df_nsw_use["source"] = "NSW"
# Combine them to form a biased dataset
df_mixed = pd.concat(
[df_nsw_use, df_cps_use],
axis=0,
ignore_index=True
)
df_mixed['treat'] = df_mixed['treat'].astype('category')
df_mixed.head()

Here is a breakdown of the variables in the dataset:

VariableDefinitionRoleDetails
treatTreatment IndicatorTreatment1 = Received Job Training, 0 = Did not receive. This is the key variable for our analysis.
ageAgeCovariateAge of the participant.
educEducationCovariateYears of education completed (e.g., 12 = High School graduate).
blackBlack (Dummy)Covariate1 = Black, 0 = Otherwise.
hispHispanic (Dummy)Covariate1 = Hispanic, 0 = Otherwise.
marrMarried (Dummy)Covariate1 = Married, 0 = Single/Other.
nodegreeNo Degree (Dummy)Covariate1 = No High School Degree, 0 = Has Degree. Used to identify dropouts.
re74Real Earnings 1974CovariatePre-treatment Income 1. Indicates economic status before the program. Participants often have low values here.
re75Real Earnings 1975CovariatePre-treatment Income 2. Immediate pre-program income. Often zero for participants in this dataset.
re78Real Earnings 1978OutcomePost-treatment Income. The target variable. We want to see if treat=1 leads to an increase here.
sourceData SourceMetadataOrigin of the record ('NSW' for experimental treated, 'CPS' for observational control).

2. A/A Test and Checking Bias in Treatment Effects

The NSW dataset consists of individuals who sought and received vocational training. The cps_mixtape data, however, represents a general population sample.

There are likely many underlying factors that motivate someone to seek vocational training. First, let's perform a quick A/A Test to check if the two groups are homogeneous.

A/A Test Results

VariableGroupSample SizeAverage95% CIEffect ΔLift (%)p-valueSignificant
ageControl1599233.23[33.05, 33.40]---No
Treated18525.82[24.78, 26.85]-7.41-22.3%0.000Yes
educControl1599212.03[11.98, 12.07]---No
Treated18510.35[10.05, 10.64]-1.68-14.0%0.000Yes
blackControl159920.07[0.07, 0.08]---No
Treated1850.84[0.79, 0.90]+0.77+1046.7%0.000Yes
marrControl159920.71[0.70, 0.72]---No
Treated1850.19[0.13, 0.25]-0.52-73.4%0.000Yes
nodegreeControl159920.30[0.29, 0.30]---No
Treated1850.71[0.64, 0.77]+0.41+139.4%0.000Yes
re74Control1599214016.80[13868.47, 14165.13]---No
Treated1852095.57[1386.75, 2804.39]-11921.23-85.0%0.000Yes
re75Control1599213650.80[13507.11, 13794.49]---No
Treated1851532.06[1065.09, 1999.02]-12118.75-88.8%0.000Yes
re78Control1599214846.66[14697.13, 14996.19]---No
Treated1856349.14[5207.95, 7490.34]-8497.52-57.2%0.000Yes

Those who received vocational training are generally younger, have lower education levels, and significantly lower pre-training earnings (re74, re75).

Just because the re78 (earnings in 1978) is higher for the non-treated group doesn't mean the training was pointless. It simply suggests that even if the training had a positive effect, it wasn't enough to close the massive initial gap between the two groups. The A/B test reports a negative effect of -$8497.52, but we cannot conclude this is the causal effect of the intervention due to the severe selection bias.

3. Propensity Score Matching

To address this bias, we apply Propensity Score Matching (PSM), a standard technique in causal inference.

We select covariates for balancing (e.g., demographics, prior earnings) and choose the outcome variable.

PSM Report

Looking at the Love Plot and the balance table, we can see that the discrepancies identified in the A/A test have been successfully mitigated. The matching process has created a control group that is statistically very similar to the treated group.

Now, let's run an A/B Test on this matched dataset:

VariableGroupSample SizeAverage95% CIEffect ΔLift (%)p-valueSignificant
re78Control (0)1644564.52[3736.96, 5392.07]---No
Treated (1)1646429.95[5227.35, 7632.55]+1865.43+40.9%0.012Yes

Matched A/B Test

We now estimate a positive effect of $1865.43. This difference is statistically significant.

4. Validation: Checking the Answer Key

Since the original NSW dataset is from a Randomized Controlled Trial (RCT), we can calculate the true experimental effect by comparing the treated group with the experimental control group (not the CPS data). (While there is some slight bias in nodegree, the groups are largely balanced.)

True RCT Effect

Analysis Settings

  • Treatment Variable: treat
  • Control Group: 0
  • Test Type: Auto (based on variable type)
  • Confidence Level: 95%
  • Multiple Comparison Correction: None
OutcomeGroupSampleAverageAbs CIEffect ΔLift (%)Effect CI (Δ)p-valueSignificant
ageControl26025.05[24.19, 25.92]----No
Treatment18525.82[24.78, 26.85]+0.76+3.0%-0.266No
educControl26010.09[9.89, 10.29]----No
Treatment18510.35[10.05, 10.64]+0.26+2.6%-0.150No
blackControl2600.83[0.78, 0.87]----No
Treatment1850.84[0.79, 0.90]+0.02+2.0%-0.647No
hispControl2600.11[0.07, 0.15]----No
Treatment1850.06[0.03, 0.09]-0.05-44.8%-0.064No
marrControl2600.15[0.11, 0.20]----No
Treatment1850.19[0.13, 0.25]+0.04+23.0%-0.334No
nodegreeControl2600.83[0.79, 0.88]----No
Treatment1850.71[0.64, 0.77]-0.13-15.2%-0.002Yes
re74Control2602107.03[1412.41, 2801.65]----No
Treatment1852095.57[1386.75, 2804.39]-11.45-0.5%-0.982No
re75Control2601266.91[887.97, 1645.85]----No
Treatment1851532.06[1065.09, 1999.02]+265.15+20.9%-0.385No
re78Control2604554.80[3885.10, 5224.50]----No
Treatment1856349.14[5207.95, 7490.34]+1794.34+39.4%-0.008Yes

The true effect is +$1794.34. Our PSM estimate of $1865.43 differs by less than 4%, demonstrating that PSM was able to recover the causal effect with high accuracy from the observational data.

5. Advanced Topics: Heterogeneous Treatment Effects

CATE Estimation

Using machine learning, we can go a step further and estimate the Conditional Average Treatment Effect (CATE) for individuals. Given the small sample size and high variance, we'll use LinearDML, which provides robust CATE estimation.

LinearDML

By averaging the predicted CATE for the treated individuals (treat = 1), we can compare this result with our previous average treatment effects.

Mean CATE

The calculated result is $1495. While there is a ~16.7% deviation from the true $1794, it is a massive improvement over the naive observational comparison (-$8497) and provides a directional estimate good enough for decision-making.

One more tip for the accurate understanding

In the LinearDML report, the factors contributing to CATE showed that both re74 and re75 had negative coefficients, with re74 showing a particularly strong negative correlation.

Effect Model Coefficients

It makes intuitive sense that people with higher prior earnings might benefit less from basic vocational training. However, the fact that re74 (income 4 years prior) had a much stronger correlation than re75 (income 3 years prior) seemed odd.

Before jumping to conclusions, we should check for multicollinearity, as LinearDML (being a linear model) is sensitive to it.

Checking the scatter plot and correlation between re74 and re75, we find a high correlation coefficient (r=0.87). The plot also suggests a ceiling effect.

re74 vs re75

This collinearity might be distorting the coefficients. To fix this, we can filter out the ceiling values as outliers and apply Principal Component Analysis (PCA) to re74 and re75 to create orthogonal components.

  • PC1: Positively correlated with both re74 and re75 (represents overall income level).
  • PC2: Represents the difference/variance between the years.

Re-running LinearDML with PC1 and PC2 instead of the raw variables yields the following:

PCA LinearDML

Both components still show a negative correlation with CATE, but PC1 (overall income level) has the strongest negative correlation. This confirms our hypothesis: Vocational training is less effective for those who already have high earning potential. It wasn't about re74 specifically, but the general income level.

Additionally, age shows a positive correlation, suggesting that older participants (within this demographic) benefited more from the training than younger ones.

6. Conclusion and Summary

Our analysis of the NSW vocational training program revealed several key insights:

  1. Bias Correction: Simple comparison of observational data led to a misleading negative effect (-$8500). Propensity Score Matching successfully corrected this bias, estimating a positive effect (+$1865) very close to the true experimental benchmark (+$1794).
  2. Targeting Efficiency: Vocational training budgets and manpower are limited. To maximize effectiveness, our CATE analysis suggests a clear policy direction:
    • Focus on those with lower prior earnings. The training has diminishing returns for those with higher baseline income.
    • Prioritize older applicants. Within this group, older individuals showed higher treatment effects.

Simply looking at post-training income (re78) might tempt administrators to select candidates who are likely to earn more anyway (high prior earners). However, our causal analysis proves this would be a mistake—those individuals benefit the least from the program. The true value of the training is maximized by targeting those who need it most.

Data Science Is Fun! Getting It Right Is What Makes It Valuable.

Achieve deeper understanding and higher-quality outputs in data science—beyond your peers. If you want to explore the data yourself, grab the dataset and try reproducing these results in Allye!

You can try Allye Base for free.

Personalize Email Campaign-MineThatData Challenge

· 10 min read
Sho SEKINE
Head of Applied Science at mercari, Principal Data Scientist at Fast Retailing, Co-founder AI Allye

MineThatData E-Mail Analytics And Data Mining Challenge was originally published in 2008 from MineThatData, this dataset invites us to solve a timeless marketing problem: How do we personalize campaigns?

In this post, we'll dive into this dataset using Causal Inference to uncover not just which campaign worked best, but why and how to improve.

Hello! and Welcome

· 2 min read
Sho SEKINE
Head of Applied Science at mercari, Principal Data Scientist at Fast Retailing, Co-founder AI Allye
Nao SEKINE
Ex-Recruit Holdings Senior AI Engineer, Co-founder AI Allye

Hello! We are Sho and Nao, the founders of Allye. We are incredibly excited to announce the release of Allye, the ideal product we've always envisioned, built with the power of Generative AI.

Why we built Allye

With over 15 years of experience in Data Science, we’ve seen the landscape evolve. Data Science knowledge is now essential not just for specialists, but also for Product Managers, Marketers, and Engineers.

Meanwhile, business and research landscapes are becoming increasingly complex and personalized. As tasks multiply and the demand for efficiency grows, the time available for deep analysis shrinks. Additionally, with the widespread use of AI, the risk of data leakage is rising, making secure data handling more critical than ever.

Yet, the pressure to make correct, data-driven decisions remains high. We constantly need to "understand users," "measure effects," and "find strategies," but are often buried in operational tasks.

What makes Allye unique

We built Allye to solve this dilemma. It is distinct from other no-code tools:

  1. Obsession with Speed: Allye is optimized to process practical data sizes instantly. Speed is our UX promise.
  2. Deep Integration with Python: No-code is fast, but code is flexible. Allye bridges the gap—the AI writes the Python code for you.
  3. Comprehensive Causal Inference: Understanding "why" is crucial for business decisions. That is why causal inference is our core analytics engine.
  4. Local & Secure: Allye runs locally and offline. Your data stays on your device, so you never have to worry about data leakage to AI.

Allye Base is free. Check the Quick Start and start your journey now.


Learn More: Hands-on Tutorial

Data Science Is Fun! Getting It Right Is What Makes It Valuable.

Achieve deeper understanding and higher-quality outputs in data science—beyond your peers.