The fundamental problem of causal inference is that we do not observe both potential outcomes of a single unit across the same timeline. In other words, if we want to disentangle the causal effect of a treatment for a given unit, we need to compare the outcome of the unit in the presence and absense of the treatment (e.g., in a medical trial, a patient cannot be in both the treatment and control group at the same time, thus, we can't observe their outcome in both worlds).
However, we can approximate the counterfactual (the outcome of the treated unit had it not been subjected to treatment) using Synthetic Control Method designed by Abadie, Diamond, and Hainmueller (2010). The plan is to construct a synthetic version of the treated unit based on a convex combination of control group units such the the error between the treated unit and the synthetic unit is minimized during the pre-treatment period.
The aim of this final project is to
The original paper's aim was to disentangle the causal effect of German reunification in 1990 (treatment year) on West Germany's Gross Domestic Product. First, we gather data of the OECD countries from 1960 to 2003, then we split the data into two sections (pre-treatment & post-treatment). We then construct a covariate matrix of the treated unit (West Germany) $X_{1}$ and for the control units $X_{0}$ (rest of OECD). The benchmark is to have the best fit for West Germany' GDP trend and its constrcuted synthetic version during the pre-treatment period to have a credible comparaison in the post-treatment period. First, I addredd a simplified version of the problem on which the covariates have the same importance (weights), then I propose an approach to taking specifying weights for the covariates.
Note: The outcome variable (GDP) is excluded from the covariate matrices
Objective: $$\min\sum_{k=1}^k\left(X_1-W^{\ast}X_0\right)^2$$
Subject to: $$\sum_{j=1}^jW^{\ast}=1$$
Such as:
The aim is to minimize the error distance between the outcome of the treated unit $Y_{1}$ and the product of the optimal weights $w^\ast$ by the outcome of the control units in the pre-treatment period $T_{0}$ such that:
$$W^{\ast}\times Y_0\approx Y_1$$Where:
Once we establish a good fit for the synthetic version and the treated unit, we can extrapolate the comparaison to the post-treatment period $T_{1}$ such that the treatment effect $\hat{\alpha}$ is expressed as: $$\hat{\alpha}=Y_{1T_{1}}-\sum_{k=1}^kW^{\ast}Y_{0T_{1}}$$
As mentioned in the methodology, the simplified version attributes the same importance to covariates which can be problematic given their different units and how they contribute differently to the error we're trying to minimize. To illustrate the idea, imagine if we take as covariate
We pick a diagonal matrix V to find the optimal weights W: $$\min\left(X_1-X_0W\right)^TV\left(X_1-X_0W\right)$$
We use the matrix V to test how well the it fits the outcome of the treated unit: $$V^{\ast}=\min\left[\left(Y_1-Y_0W^{\ast}\left(V\right)\right)^T\left(Y_1-Y_0W^{\ast}\left(V\right)\right)\right]$$
The process of finding V can be seen as an optimization problem on its own as the values of W differ as we change our choice of the importance metrix V. As a result, the problem is rendred non-convex as the following objective function can't contain both W and V as unknown variable subject to optimization:
Objective: $$\min\left[V\left(X_1-X_0W\right)^2\right]$$
A strategy to conserve the DCP rules is to fix one variable to optimize for the other then vice versa, however, the attempt was fruitless as the process didn't converge to a reasonable distribution of weights (all weights were attributed to a single unit).
# Import packages
import matplotlib.pyplot as plt
import scipy.optimize
import pandas as pd
import cvxpy as cvx
import numpy as np
# Import and process data
data = pd.read_csv('german_reunification.csv')
tr_unit = data[data.country == 'West Germany']
tr_unit = tr_unit[tr_unit.year < 1990]
y_tr = np.array(tr_unit.gdp).reshape(1, 30)
ctr_units = data[data.country != 'West Germany']
ctr_units = ctr_units[ctr_units.year < 1990]
y_ctr = np.array(ctr_units.gdp).reshape(16, 30)
tr_unit_all = data[data.country == 'West Germany']
y_tr_all = np.array(tr_unit_all.gdp).reshape(1, 44)
ctr_units_all = data[data.country != 'West Germany']
y_ctr_all = np.array(ctr_units_all.gdp).reshape(16, 44)
# Create matrices for the problem formulation
X1 = data[data.country =='West Germany']
X1 = X1[X1.year < 1990]
X1 = X1.drop(['code','country','gdp', 'year'], axis=1)
X1 = X1.set_index(np.arange(len(X1)) // 30).mean(level=0)
X1 = X1.values
X0 = data[data.country != 'West Germany']
X0 = X0[X0.year < 1990]
X0 = X0.drop(['code','country','gdp', 'year'], axis=1)
X0 = X0.set_index(np.arange(len(X0)) // 30).mean(level=0)
X0 = X0.values
# Construct the problem
w = cvx.Variable((1, 16), nonneg=True)
objective = cvx.Minimize(cvx.sum_squares(X1 - w @ X0))
constraints = [cvx.sum(w) == 1]
prob = cvx.Problem(objective, constraints)
# The optimal objective value is returned by prob.solve()
result = prob.solve(verbose=True)
print('The optimal objective value: ',result,'\n\nWeights: ',w.value)
# The weights attributed to each country
result = pd.DataFrame({'Country':data[data.country!='West Germany'].country.unique(),
'Weight': np.round(w.value[0], decimals=3)})
result
# Comparing between the error of the optimizer fit and the Abadie et. al fit
w_opt = np.array([0.22, 0, 0.42, 0, 0, 0, 0, 0.09, 0, 0.11, 0.16, 0, 0, 0, 0, 0]).reshape(1, 16)
data_compare = pd.DataFrame({'Optimizer':(w @ X0).value[0],
'Optimizer vs Treated': (w @ X0).value[0] - X1[0],
'Paper':(w_opt @ X0)[0],
'Paper vs Treated': (w_opt @ X0)[0] - X1[0],
'Treated':X1[0]}, index= data.columns[4:])
data_compare
print('Error using CVXPY: {} \nError in the original paper: {}'\
.format(sum((w.value @ X0 - X1)[0]), sum((w_opt @ X0 - X1)[0])))
import plotly.graph_objs as go
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(range(1960, 1990)), y=w.value[0] @ y_ctr,
mode='lines', name='Optimizer'))
fig.add_trace(go.Scatter(x=list(range(1960, 1990)), y=(w_opt @ y_ctr)[0],
mode='lines', name='Original Paper'))
fig.update_layout(title='Synthetic West Germany<br>Optimizer vs. Paper',
xaxis_title='Time', yaxis_title='GDP per Capita (USD)')
fig.show()
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(data.year.unique()), y=w.value[0] @ y_ctr_all,
mode='lines', name='Optimizer'))
fig.add_trace(go.Scatter(x=list(data.year.unique()), y=y_tr_all[0],
mode='lines', name='Treated unit'))
fig.add_shape(dict(type="line", x0=.698, y0=0, x1=.698, y1=.8,
line=dict(color="Black", width=1)))
fig.add_trace(go.Scatter(x=[1990], y=[30000], mode="text",
name="Reunification", text=["German<br>Reunification"]))
fig.update_layout(title='Synthetic West Germany<br>Optimizer vs. Treated unit',
xaxis_title='Time', yaxis_title='GDP per Capita (USD)')
fig.show()
# All covariates
X0_T = (X0.T)
X0_T = np.array(X0_T).reshape(7, 16)
X1_T = (X1.T)
X1_T = np.array(X1_T).reshape(7, 1)
y_ctr_T = y_ctr.reshape(30, 16)
y_tr_T = y_tr.reshape(30, 1)
def total_loss(V_diag, details=False):
'''
Finds W that minimizes total loss given V,
Returns total loss
'''
V = np.zeros(shape=(7, 7))
np.fill_diagonal(V, V_diag)
# Construct the problem
w = cvx.Variable((16, 1), nonneg=True)
objective = cvx.Minimize(cvx.sum(V @ cvx.square(X1_T - X0_T @ w)))
constraints = [cvx.sum(w) == 1]
problem = cvx.Problem(objective, constraints)
result = problem.solve(verbose=details)
return float((y_tr_T - y_ctr_T @ w.value).T @ (y_tr_T - y_ctr_T @ w.value)), w
# Helpless attempt to avoid the DCP rules
record = []
for i in range(1000):
if (i+1)%100 == 0: print('{}%'.format((i+1)/10))
new_V = np.random.dirichlet(np.ones(7), size=1)
record.append([total_loss(new_V)[0], new_V])
results = pd.DataFrame(record, columns=['Error', 'Importance'])\
.sort_values(by='Error', ascending=True)
results = results.reset_index()
results.head(10)
W = total_loss(results.iloc[0][1], details=True)[1].value
pd.DataFrame({'Country':data[data.country!='West Germany'].country.unique(),
'Weight': np.round(W.T[0], decimals=3)})
pd.DataFrame({'Covariate': data.columns[4:],
'Weight': np.round(results['Importance'][0][0], 3)})
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(range(1960, 1990)), y=(W.T @ y_ctr)[0],
mode='lines', name='Optimizer'))
fig.add_trace(go.Scatter(x=list(range(1960, 1990)), y=y_tr[0],
mode='lines', name='Treated Unit'))
fig.update_layout(title='Synthetic West Germany<br>Optimizer vs. Treated Unit',
xaxis_title='Time', yaxis_title='GDP per Capita (USD)')
fig.show()
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(data.year.unique()), y=(W.T @ y_ctr_all)[0],
mode='lines', name='Optimizer'))
fig.add_trace(go.Scatter(x=list(data.year.unique()), y=y_tr_all[0],
mode='lines', name='Treated unit'))
fig.add_shape(dict(type="line", x0=.698, y0=0, x1=.698, y1=.8,
line=dict(color="Black", width=1)))
fig.add_trace(go.Scatter(x=[1990], y=[30000], mode="text",
name="Reunification", text=["German<br>Reunification"]))
fig.update_layout(title='Synthetic West Germany<br>Optimizer vs. Treated unit',
xaxis_title='Time', yaxis_title='GDP per Capita (USD)')
fig.show()
# Comparing between the error of the optimizer fit and the Abadie et. al fit
data_compare = pd.DataFrame({'Optimizer':(W.T @ X0)[0],
'Optimizer vs Treated': (W.T @ X0 - X1)[0],
'Paper':(w_opt @ X0)[0],
'Paper vs Treated': (w_opt @ X0 - X1)[0],
'Treated':X1[0]}, index= data.columns[4:])
data_compare
print('Error using CVXPY: {} \nError in the original paper: {}'\
.format(sum((W.T @ X0 - X1)[0]), sum((w_opt @ X0 - X1)[0])))
def total_loss_func(V_diag):
'''
Finds W that minimizes total loss given V,
Returns total loss
'''
V = np.zeros(shape=(7, 7))
np.fill_diagonal(V, V_diag)
# Construct the problem
w = cvx.Variable((16, 1), nonneg=True)
objective = cvx.Minimize(cvx.sum(V @ cvx.square(X1_T - X0_T @ w)))
constraints = [cvx.sum(w) == 10]
problem = cvx.Problem(objective, constraints)
result = problem.solve(verbose=False)
return float((y_tr_T - y_ctr_T @ w.value).T @ (y_tr_T - y_ctr_T @ w.value))
from scipy.optimize import differential_evolution
bounds = [(0, 1), (0, 1), (0, 1), (0, 1), (0, 1), (0, 1), (0, 1)]
result = differential_evolution(total_loss_func, bounds)
'{:.2e}'.format(result.fun)
gen_V = result.x / np.sum(result.x)
gen_V
pd.DataFrame({'Covariate': data.columns[4:], 'Weight': np.round(gen_V, 3)})
'{:.2e}'.format(total_loss(gen_V)[0])
gen_W = total_loss(gen_V)[1].value
pd.DataFrame({'Country':data[data.country!='West Germany'].country.unique(),
'Weight': np.round(gen_W.T[0], decimals=3)})
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(range(1960, 1990)), y=(gen_W.T @ y_ctr)[0],
mode='lines', name='Optimizer'))
fig.add_trace(go.Scatter(x=list(range(1960, 1990)), y=y_tr[0],
mode='lines', name='Treated Unit'))
fig.update_layout(title='Synthetic West Germany<br>Optimizer vs. Treated Unit',
xaxis_title='Time', yaxis_title='GDP per Capita (USD)')
fig.show()
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(data.year.unique()), y=(gen_W.T @ y_ctr_all)[0],
mode='lines', name='Optimizer'))
fig.add_trace(go.Scatter(x=list(data.year.unique()), y=y_tr_all[0],
mode='lines', name='Treated unit'))
fig.add_shape(dict(type="line", x0=.698, y0=0, x1=.698, y1=.8,
line=dict(color="Black", width=1)))
fig.add_trace(go.Scatter(x=[1990], y=[30000], mode="text",
name="Reunification", text=["German<br>Reunification"]))
fig.update_layout(title='Synthetic West Germany<br>Optimizer vs. Treated unit',
xaxis_title='Time', yaxis_title='GDP per Capita (USD)')
fig.show()
print('Error using CVXPY: {} \nError in the original paper: {}'\
.format(sum((gen_W.T @ X0 - X1)[0]), sum((w_opt @ X0 - X1)[0])))