sklearn
)¶from formulaic import Formula
import fastreg as fr
import fastreg.general as frg
from fastreg import I, R, C
from sklearn.linear_model import LinearRegression, PoissonRegressor, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.cluster import KMeans
import xgboost as xgb
import matplotlib.pyplot as plt
# configure matplotlib output
import matplotlib as mpl
mpl.style.use('config/clean.mplstyle') # this loads my personal plotting settings
col = mpl.rcParams['axes.prop_cycle'].by_key()['color']
%matplotlib inline
# if you have an HD display
%config InlineBackend.figure_format = 'retina'
# some warnings can get annoying
import warnings
warnings.filterwarnings('ignore')
from tools import visual
sklearn
Basics¶Most sklearn
models have a standardized interface. An estimator will have a fit
method that takes X
and y
in the form of arrays. Once you've estimated the model, you can also use the predict
method that takes X
values and returns predicted y
values. Certain estimators, especially classifiers, will have a combined fit_predict
method that does both in a single step.
This library wont do a lot of specification work for you and expects numeric arras (sometimes called "design matrices"). However, it does have a lot of preprocessing capabilities for numerical data. To get from something like a pandas
DataFrame to numeric arrays, you'll need an intermediate library. Some options include patsy
, formulaic
, and my very own fastreg
. In fact, formulaic
is in some sense the evolution of patsy
, so you might want to stick with that.
First let's generate some fake data to play with.
N = 10
data = pd.DataFrame({
'x1': np.arange(N),
'x2': np.random.normal(size=N)
})
data['y'] = 1 + 2*data['x1'] + 3*data['x2'] + np.random.normal(size=N)
data.head(5)
We can generate design matrices by creating a Formula
object with an R (or "Wilkinson") style formula specification that references the columns of data
. We then apply that to our DataFrame yielding a design matrix for the endogenous variable (y
) and the exogenous variables (X
)
form = Formula('y ~ x1 + x2')
y, X = form.get_model_matrix(data)
X
Next let's try a setting with categorical variables. We'll generate a fairly large dataset here just to make sure things scale well.
Nb = 100_000
datab = pd.DataFrame({
'x': np.random.uniform(size=Nb),
'id1': np.random.randint(10, size=Nb),
'id2': np.random.randint(1000, size=Nb),
})
datab['y'] = 1 + 2*datab['x'] + datab['id1'] + 0.01*datab['id2'] + np.random.randn(Nb)
The syntax for generating categoricals is putting a C()
around the name like a function. The design matrix for a categorical variable isn't just a single column with the value, since the value very well may not be a number. Instead it's a "one-hot" encoded matrix, where each row is a unit vector with a one in the position corresponding to which label that row is. So if the categorical was US states and our label was "NY", it would a be a length 50 vector with 49 zeros and a one in the position corresponding to the "NY" label. Even in this case, the matrix would be mostly zeros, and for larger categoricals, the problem would be even worse. For this reason, we ask it to return a sparse matrix.
forms = Formula('y ~ x + C(id1) + C(id2)')
yb, Xb = forms.get_model_matrix(datab, output='sparse')
Xb
Make sure you understand why the dimensions of X
are what they are. Here's some info on the values for the first row.
nonz = Xb[0].toarray().squeeze().nonzero()[0]
print(nonz)
[Xb.model_spec.feature_names[i] for i in nonz]
Note that the way we encode categoricals is important. Since we have included an explicit constant (Intercept) term, if we had an indicator for each possible value of a categorical (say id1=0
through id1=9
), our X
matrix would be rendered singular, as the sum of each of the id1
columns would be the same as the contant column (all ones). So we need to drop one value for each categorical variable. Typically this is the first value encountered, though most libraries allow you to choose which value to drop, or to drop the most common value encountered. And below is the original data for the first row.
datab.iloc[0]
Let's look at the speed dimension
If you want to use fastreg
, the syntax is pretty similar.
yn, yf, Xn, Xf = fr.design_matrices(y=R.y, x=I + R.x + C.id1 + C.id2, data=datab)
# yn, yf, Xn, Xf = fr.design_matrices(formula='y ~ 1 + x + C(id1) + C(id2)', data=datab) # equivalent R-style formula
First let's do our simple linear regression. Not so bad for N=10
. Remember to aware of that intercept, as the default is fit_intercept=True
, but we included the intercept in our design matrix already.
reg = LinearRegression(fit_intercept=False).fit(X, y['y'])
reg.coef_
Next up are the sparse results. Note that in both cases y
is returned as a matrix, so we need to turn it into just a column.
regb = LinearRegression(fit_intercept=False).fit(Xb, yb.toarray().squeeze())
regb.coef_[:11]
Our coefficients are all returned in one big vector. But we can extract the estimated values for the second categorical id2
. Here we see that we get a pretty good fit, though the estimating has some issue differentiating between the intercept and the base value of the categorical.
plt.plot(np.vstack([
(regb.coef_[0]-1) + regb.coef_[11:],
0.01*np.arange(1, 1000)
]).T);
Got integer or count data? Don't sacrifice your ideals! Run a Poisson regression. This is but one option in the broader class of GLM. These are models in which the outcome variable (y
) follows a certain distribution (say a Poisson distribution), and the parameters of this distribution (say the Poisson rate) are linear functions (or non-linear functions of linear forms) of the exogenous (X
) variables.
Let's use a handy routine from fastreg
to generate some random Poisson data.
params = {'I': 0.1, 'x1': 0.3, 'x2': 0.6, 'id1': 1.0, 'id2': 1.0}
datap = fr.dataset(N=10_000, K1=10, K2=100, models='poisson', params=params)
datap.head(5)
Next we'll use fastreg
again to generate the design matrices. Note that we could also have used formulaic
here.
ypn, yp, Xpn, Xp = fr.design_matrices(y=R.p, x=I + R.x1 + R.x2 + C.id1 + C.id2, data=datap)
Finally, let's run the Poisson estimation. Note that GLM estimation isn't quite as friendly and stable your typical OLS, as it's running some kind of maximum likelihood or moment routine in the background. For instance the below routine won't actually succeed unless you add in alpha=1e-12
. The default of alpha=0
leads to erroneous results.
regp = PoissonRegressor(fit_intercept=False, alpha=1e-12).fit(Xp, yp)
regp.coef_[:3]
Here we can again check the fit on the larger categorical and see that it's pretty good, though noisier than the OLS results.
plt.plot(np.vstack([
(regp.coef_[0]-params['I']) + regp.coef_[12:],
(params['id2']/100)*np.arange(1, 100)
]).T);
We can also run a full Poisson regression with fastreg
alone, though it uses a slower maximum likelihood routine.
retp = fr.general.poisson(y=R.p, x=I+R.x1+R.x2+C.id1+C.id2, data=datap)
coefI = retp.loc['I', 'coeff']
coef2 = retp.filter(regex='id2=\d+', axis=0)['coeff']
Here again the coefficients line up pretty well.
fig, ax = plt.subplots()
(coef2+(coefI-params['I'])).plot(ax=ax);
ax.plot((params['id2']/100)*np.arange(1, 100));
Let's do a brief interlude on classification before we get into the heavier inference models. With K-means clustering, for a given integer value of $K$, we wish to find the $K$ centroids in a given space that minimize the mean-squared distance overall points to their nearest centroid. This is by far the most common spatial clustering technique. What we get out at the end is a list of centroids $c_k$, and an assignment of each point $x_i$ to their nearest centroid. The objective function for K-means is then
The downside to K-means is that you have to pick $K$ in advance. You can fiddle around with different values, but there's not way to choose it automatically. Some other clustering algorithms go a bit in this direction, but there's no silver bullet.
Now let's generate some random clusters!
N, K = 1_000, 10
cent = np.random.normal(size=(K, 2))
grps = np.random.randint(K, size=N)
vals = cent[grps] + 0.2*np.random.normal(size=(N, 2))
fig, ax = plt.subplots()
ax.scatter(*vals.T, s=2, alpha=0.75)
ax.scatter(*cent.T, s=15)
ax.axis('square');
Running the estimation is pretty straightforward. There are number of runtime parameters that are listed in the sklearn
documentation. Finally, we extract the estimated centroid locations and labels (which centroid each point is assigned to, integers from 0
to K-1
).
retk = KMeans(n_clusters=K).fit(vals)
centk = retk.cluster_centers_
labsk = retk.labels_
Plotting the estimated labels, we can see that we did a pretty good job considering the level of overlap!
fig, ax = plt.subplots()
for k in range(K):
ax.scatter(*vals[labsk==k].T, s=2, alpha=0.75)
ax.scatter(*centk.T, s=15, c='k')
ax.axis('square');
Next we'll move on to some of the more advanced regression options. These start to tackle some of the issues of model selection. In the regular OLS world we just kind of choose our regressors and hope for the best. Handling things like non-linearities and interaction terms is up to our intuition. This more advanced set of estimators mostly uses decision trees and bootstrapping or model averaging to implicity do model selection and achieve much better prediction outcomes with minimal fine tuning. The cost is that they are much harder to interpret. You don't just get a regression coefficient that you can point to.
We need to make some data sufficiently complex to make the more advanced learning methods sweat a little bit. This would include things like non-linearities and interaction terms. Below, I'm putting in a cubic term in x1
and an interaction term whereby the effect of x2
is dependent on the value of the categorical id1
.
Nrf = 5_000
datarf = pd.DataFrame({
'x1': np.random.normal(size=Nrf),
'x2': np.random.normal(size=Nrf),
'id1': np.random.randint(10, size=Nrf)
})
poly = lambda x: x**3 - x**2
datarf['y'] = 1 + poly(datarf['x1']) + datarf['id1']*datarf['x2'] + 0.1*np.random.normal(size=Nrf)
yrfn, yrf, Xrfn, Xrf = fr.design_matrices(y=R.y, x=I + R.x1 + R.x2 + C.id1, data=datarf)
Same basic routine for making design matrices. Note that we just put in the raw variables here, figuring out non-linearities and interactions is up to the estimator. Our main hyperparameter here is the max_depth
option which determines the number of layers in the decision tree. Try playing around with different numbers to see how it does.
regrf = RandomForestRegressor(max_depth=10).fit(Xrf, yrf)
visual.plot_predict(datarf['y'], regrf.predict(Xrf), clip=100, smooth=50)
Now we can compare this to how a regular probably misspecified linear regression would do.
retrf1 = fr.ols(y=R.y, x=I+R.x1+R.x2+C.id1, data=datarf, output='dict')
visual.plot_predict(datarf['y'], retrf1['y_hat'], clip=100, smooth=50)
Okay, what if we magically knew the true underlying model? Well, then we can do really well!
retrf2 = fr.ols(y=R.y, x=I+R('x1**2')+R('x1**3')+C.id1*R.x2, data=datarf, output='dict')
visual.plot_predict(datarf['y'], retrf2['y_hat'], clip=100, smooth=50)
Finally, we have gradient boosting. This achieves similar goals to a random forest, but can often achieve better prediction accuracy. Because it operates using decision trees (like random forest) it's sort dual use as both a classified and regressor. We can use the same data as before and check the observed fit. Looks pretty similar to random forest in this case.
regrf = GradientBoostingRegressor().fit(Xrf, yrf)
visual.plot_predict(datarf['y'], regrf.predict(Xrf), clip=100, smooth=50)
There's another well known gradient boosting package called xgboost
that is widely used in industry. This conveniently has the same interface type as an sklearn
estimator. It has a ton of different options you can use, but below you can see that for the default options it does extremely well and is quite fast.
retxg = xgb.XGBRegressor().fit(Xrf, yrf)
visual.plot_predict(datarf['y'], retxg.predict(Xrf), clip=100, smooth=50)
Forgot about Lasso. Actually it does pretty bad in this case, but that's probably because we don't have any non-linearity
retla = Lasso(alpha=0.1).fit(Xrf, yrf)
visual.plot_predict(datarf['y'], retla.predict(Xrf), clip=100, smooth=50)
Let's see what happens if we manually introduce all cubic interaction terms and let Lasso figure it out. We're doing some advanced fastreg
here too. If we set $\alpha$ very low, we reduce to properly specified OLS, but in practice we would want to use a non-zero $\alpha$.
form = sum([(R.x1+R.x2+C.id1)**k for k in range(1, 4)], I)
Xrfn1, Xrf1 = fr.design_matrix(form, data=datarf)
retla1 = Lasso(alpha=0.1).fit(Xrf1, yrf)
visual.plot_predict(datarf['y'], retla1.predict(Xrf1), clip=100, smooth=50)
sla1 = pd.Series(retla1.coef_, index=Xrfn1)
sla1[sla1.abs()>1e-3]