Problem Set 0: Python Exercises¶

For a lot of these I'm going to go into way more detail that I'd expect on a regular answer. But you know, it's a teachable moment, and I'm a teacher.

In [1]:

# configure matplotlib output
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.style.use('config/clean.mplstyle') # this loads my personal plotting settings
col = mpl.rcParams['axes.prop_cycle'].by_key()['color']
%matplotlib inline

In [2]:

# if you have an HD display
%config InlineBackend.figure_format = 'retina'

Q1: Core Python¶

(a) Anagrams¶

Two strings are anagrams of one another if the histograms over their characters are the same. Since strings are just lists of characters, we can make quick work of this. I'm going to "cheat" a little bit and avail myself of some of the lesser known (but very useful parts of the core Python libraries). In this case we'll use Counter from the collections module.

In [3]:

from collections import Counter

Since Counter is a class, by calling it we invoke the constructor, which creates a new instance. The Counter constructor takes a list or list-like thing such as a str and turns it into a dictionary of counts (effectively a histogram). Here's an example

In [4]:

Counter('hello world')

Out[4]:

Counter({'h': 1, 'e': 1, 'l': 3, 'o': 2, ' ': 1, 'w': 1, 'r': 1, 'd': 1})

Since the Counter class overload the equality operator (take a look at its __eq__ function), we can directly check equality, which will return True when all the counts line up exactly. I'm also going to say we'll ignore case, since that's in line with the colloquial definition for anagrams. So it's a one-liner!

In [5]:

def check_anagram(a, b):
    return Counter(a.lower()) == Counter(b.lower())

Let's check that it works in a few cases, particularly edge cases

In [6]:

print(check_anagram('Doug Hanley', 'Doug Cranley'))
print(check_anagram('Doug Hanley', 'Gayle Hound'))
print(check_anagram('hello world!', ''))
print(check_anagram('', ''))

False
True
False
True

(b) Repeater¶

There are many reasonable ways to do this. The main way not to do it is to create an empty list ([]) then have a for loop populate it with append. It's just inelegant and slower too. Which usually won't matter, until it does. First we'll need some imports

In [7]:

from itertools import chain, repeat, product

Now let's go through some options. I can think of basically three distinct ones.

In [8]:

# the nested for loop approach, short and sweet
def repeat_list_0(xns):
    return [x for x, n in xns for _ in range(n)]
# the functional approach, elegant but baroque
def repeat_list_1(xns):
    return list(chain.from_iterable(repeat(x, n) for x, n in xns))
# here we used the sum function instead of chain
def repeat_list_2(xns):
    return sum(([x for _ in range(n)] for x, n in xns), [])

And now let's test it. I'm actually going to use the fact that this is a sort of pseudo-inverse of the Counter function to verify the results.

In [9]:

xns_list = [[(3, 5), (2, 3), (10, 1)], [(3, 5), (2, 0), (10, 1)], [(3, 0)], []]
func_list = [repeat_list_0, repeat_list_1, repeat_list_2]
for func, xns in product(func_list, xns_list):
    reps = func(xns)
    xns0 = [(x, n) for x, n in xns if n > 0]
    xns1 = list(Counter(reps).items())
    assert(xns0 == xns1)

(c) `dict` Merge¶

You know what time it is? It's defaultdict time! A defaultdict is like a normal dict, but when you try to access a key that doesn't exist it will create a user-provided entry in its place and return that. Let's import it first

In [10]:

from collections import defaultdict

Now let's implement this operation. Note that you have to give a function that returns the default value to defaultdict, not the default value itself.

In [11]:

def dict_merge_0(ds):
    out = defaultdict(lambda: [])
    for d in ds:
        for k, v in d.items():
            out[k].append(v)
    return dict(out)

And it seems to work

In [12]:

ds_list = [
    [{1: 2, 2: 3}, {1: 3, 2: 5}],
    [{1: 2, 2: 3}, {1: 3, 2: 5, 10: 20}],
    []
]
for ds in ds_list:
    print(dict_merge_0(ds))

{1: [2, 3], 2: [3, 5]}
{1: [2, 3], 2: [3, 5], 10: [20]}
{}

Note that this will produce lists of different size when keys are missing some dicts. So we might want to fill those in with None instead, to preserve the order (and hence make this operation invertible). That will also lead us to a solution that is potentially more performant.

In [13]:

def dict_merge_1(ds):
    keys = set().union(*(set(d) for d in ds))
    return {k: [d.get(k, None) for d in ds] for k in keys}

And now we get the expected behavior.

In [14]:

for ds in ds_list:
    print(dict_merge_1(ds))

{1: [2, 3], 2: [3, 5]}
{1: [2, 3], 2: [3, 5], 10: [None, 20]}
{}

(c) Ordered `set`¶

I actually happened to need such a thing the other day. Here's a basic implementation that creates a new subclass of list and implements a new method called add (which is the name for set) and just checks that you don't introduce any duplicates while preserving the order.

In [15]:

class OrderedSet(list):
    def __init__(self, data=()):
        super().__init__(data)

    def add(self, key):
        if key not in self:
            super().append(key)

Now let's add some elements and see what happens in the case of a duplicate add.

In [16]:

os = OrderedSet()
os.add(5)
os.add(3)
os.add(10)
os.add(3)
print(os)

[5, 3, 10]

Now let's do some profiling to see how fast this is

In [17]:

%%time
os = OrderedSet()
for i in range(50_000):
    os.add(i)

CPU times: user 6.36 s, sys: 0 ns, total: 6.36 s
Wall time: 6.37 s

One thing that will be slow about the above implementation is that checking for membership in add will be quite taxing for large data sizes. So let's change it up by keeping a set version of the data as well, which has must faster membership checks. So we're sacrificing memory usage for speed here.

In [18]:

class OrderedSetFast(list):
    def __init__(self, data=()):
        super().__init__(data)
        self._set = set(data)

    def add(self, key):
        if key not in self._set:
            self._set.add(key)
            super().append(key)

And it is indeed much (500x) faster now!

In [19]:

%%time
os = OrderedSetFast()
for i in range(50_000):
    os.add(i)

CPU times: user 8.38 ms, sys: 897 µs, total: 9.27 ms
Wall time: 9.26 ms

Q2: `numpy`¶

(a) Group Aggregation¶

This is a common database-style function where we want to find aggregations of the data at some group level. We know pandas already does this really well, but first let's implement it in pure numpy.

In [20]:

def group_sums(x, i):
    return {v: x[i==v].sum() for v in np.unique(i)}

In [21]:

x = np.linspace(0, 1, 20)
i = np.array([i for i in range(5) for _ in range(4)])
group_sums(x, i)

Out[21]:

{0: 0.3157894736842105,
 1: 1.1578947368421053,
 2: 2.0,
 3: 2.8421052631578947,
 4: 3.6842105263157894}

There may be a faster way though! Let's look into some other numpy functions here. First, in the process of doing unique, we're also doing a lot of the work involved in finding out where each of the group values is located, so we can use one of the special flags (return_inverse) to return that information, in addition to the unique values. Try running np.unique on your own to see how that works. Then we use bincount, which will aggregate over consecutive integer indices with optional weights.

In [22]:

def group_sums_fast(x, i):
    v, j = np.unique(i, return_inverse=True)
    s = np.bincount(j, weights=x)
    return dict(zip(v, s))

Now let's test it on a monster dataset and see how each version fares.

In [23]:

xl = np.linspace(0, 1, 100_000)
il = np.array([i for i in range(10_000) for _ in range(10)])

In [24]:

%time _ = group_sums(xl, il)

CPU times: user 217 ms, sys: 43 µs, total: 217 ms
Wall time: 217 ms

In [25]:

%time _ = group_sums_fast(xl, il)

CPU times: user 2.48 ms, sys: 78 µs, total: 2.55 ms
Wall time: 2.43 ms

Okay, so these gains are pretty niche in absolute terms, but still it's a 100x speedup! I wonder how pandas does here

In [26]:

%time _ = pd.Series(xl, index=il).groupby(level=0).sum()

CPU times: user 1.81 ms, sys: 0 ns, total: 1.81 ms
Wall time: 1.7 ms

Actually a bit faster than the fast numpy version! Though if you include an additional to_dict() at the end, it's a bit slower, so I'd say these are basically tied.

(b) Correlations¶

First let's make a random walk generator

In [27]:

def random_walk(n):
    return np.random.normal(size=n).cumsum()

I'm going to use seaborn, a well established fancy plotting library, with some R-like features.

In [28]:

import seaborn as sns

Now we can generate a bunch of random walks and see how they're correlated. I'm going to

In [29]:

n = 1_000
corrs = [np.corrcoef(random_walk(n), random_walk(n))[0, 1] for _ in range(1_000)]
ax = sns.kdeplot(corrs, fill=True); ax.set_xlim(-1, 1);

No description has been provided for this image

Q3: `pandas`¶

(a) Reshaping¶

First let's generate some fake cross-country style data. I'm going to randomly drop some years to make things tricky.

In [30]:

countries = ['USA', 'CHN', 'ZAF', 'ARG', 'NOR']
ymin, ymax = 1950, 2022
mean, stdv = 0.01, 0.05

In [31]:

def random_drift(n, mean, stdv):
    return (stdv*np.random.normal(size=n)+mean).cumsum()

In [32]:

df_long = pd.DataFrame(product(countries, range(ymin, ymax)), columns=['Country', 'Year'])
df_long['Population'] = np.exp(random_drift(len(df_long), mean, stdv))
df_long['GDP'] = np.exp(random_drift(len(df_long), mean, stdv))
df_long = df_long.sample(frac=0.75)
df_long = df_long.sort_values(by=['Country', 'Year'], ignore_index=True)
df_long.head(5)

Out[32]:

	Country	Year	Population	GDP
0	ARG	1952	9.570824	44.444957
1	ARG	1953	8.983750	44.170588
2	ARG	1954	9.206018	43.138836
3	ARG	1955	10.016157	42.722817
4	ARG	1956	10.490452	47.379580

Now let's handle the reshaping operation. There's just a way to do this that will save you a lot of grief once you understand it. Just get all the rows and columns you want to deal with in the index, then unstack the things you want to be columns.

In [33]:

def long_to_wide(df, rows, cols, vals=None):
    # assemble variables
    rows = [rows] if type(rows) is not list else rows
    cols = [cols] if type(cols) is not list else cols

    # do reshape
    df = df.set_index([*rows, *cols])
    if vals is not None:
        df = df[vals]
    df = df.unstack(cols)

    # return wide
    return df

Now we can test it out on our generated data, and indeed it seems to work.

In [34]:

df_wide = long_to_wide(df_long, 'Year', 'Country', 'GDP')
df_wide

Out[34]:

Country	ARG	CHN	NOR	USA	ZAF
Year
1950	NaN	NaN	87.142400	1.051588	9.857584
1951	NaN	2.549617	89.086559	1.021525	9.753643
1952	44.444957	2.614140	92.497893	1.118118	8.489252
1953	44.170588	2.464735	95.158492	NaN	8.461172
1954	43.138836	NaN	98.055941	1.125351	8.954090
...	...	...	...	...	...
2017	72.968384	11.195639	NaN	2.754084	36.586738
2018	NaN	10.378070	94.586793	NaN	36.591334
2019	NaN	10.870278	91.963998	2.678403	36.660267
2020	82.525899	9.462256	98.514846	NaN	37.985044
2021	87.487301	9.094091	NaN	2.580982	NaN

72 rows × 5 columns

There is also a built-in function to do this in pandas called pivot. This has basically the same syntax as the function we just made. Still, in some cases it's easier to do it the old fashioned way.

In [35]:

df_long.pivot(index='Year', columns='Country', values='GDP')

Out[35]:

Country	ARG	CHN	NOR	USA	ZAF
Year
1950	NaN	NaN	87.142400	1.051588	9.857584
1951	NaN	2.549617	89.086559	1.021525	9.753643
1952	44.444957	2.614140	92.497893	1.118118	8.489252
1953	44.170588	2.464735	95.158492	NaN	8.461172
1954	43.138836	NaN	98.055941	1.125351	8.954090
...	...	...	...	...	...
2017	72.968384	11.195639	NaN	2.754084	36.586738
2018	NaN	10.378070	94.586793	NaN	36.591334
2019	NaN	10.870278	91.963998	2.678403	36.660267
2020	82.525899	9.462256	98.514846	NaN	37.985044
2021	87.487301	9.094091	NaN	2.580982	NaN

72 rows × 5 columns

(b) Bollinger Bands¶

I'm going to change the question slightly and say we're looking for ±1 stdev breakouts. The below function will return a DataFrame of booleans saying whether we've exceeded these bounds. Note that in the case where the data is missing for that year, it will return False, which seems reasonable. Note the usage of min_periods=1 here is so that our NaN values don't propogate too much.

In [36]:

def bollinger_escape(df, K):
    roll = df.rolling(K, min_periods=1)
    rmean, rstdv = roll.mean(), roll.std()
    return (df - rmean).abs() > rstdv

In [37]:

bollinger_escape(df_wide, 5).head(5)

Out[37]:

Country	ARG	CHN	NOR	USA	ZAF
Year
1950	False	False	False	False	False
1951	False	False	False	False	False
1952	False	False	True	True	True
1953	False	True	True	False	False
1954	True	False	True	False	False

We can visualize this situation by plotting the data series and their (1 stdv) Bollinger bands. Note that we're dealing with a multi-level column index here, which is always kind of annoying, but be sure to understand what concat and swaplevel are doing here.

In [38]:

roll = df_wide.rolling(5, min_periods=1)
rmean, rstdv = roll.mean(), roll.std()
rhi, rlo = rmean + rstdv, rmean - rstdv
df_roll = pd.concat({'value': df_wide, 'high': rhi, 'low': rlo}, axis=1)
df_roll = df_roll.swaplevel(0, 1, axis=1).sort_index(axis=1)
df_roll.head(5)

Out[38]:

Country	ARG			CHN			NOR			USA			ZAF
	high	low	value	high	low	value	high	low	value	high	low	value	high	low	value
Year
1950	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	87.142400	NaN	NaN	1.051588	NaN	NaN	9.857584
1951	NaN	NaN	NaN	NaN	NaN	2.549617	89.489207	86.739752	89.086559	1.057814	1.015299	1.021525	9.879112	9.732115	9.753643
1952	NaN	NaN	44.444957	2.627503	2.536254	2.614140	92.286652	86.864583	92.497893	1.113174	1.014313	1.118118	10.128603	8.605050	8.489252
1953	44.501781	44.113764	44.170588	2.617764	2.467897	2.464735	94.533909	87.408763	95.158492	1.113174	1.014313	NaN	9.909777	8.371048	8.461172
1954	44.606815	43.229440	43.138836	2.617764	2.467897	NaN	96.810617	87.965897	98.055941	1.129917	1.028374	1.125351	9.774627	8.431669	8.954090

Let's do a little fancy plot here for Argentina.

In [39]:

ser = df_roll['ARG']
fig, ax = plt.subplots()
ax.plot(ser.index, ser['value'], c=col[1], marker='o', markersize=2.5)
ax.plot(ser.index, ser['low'], linewidth=1, c=col[0])
ax.plot(ser.index, ser['high'], linewidth=1, c=col[0])
ax.fill_between(ser.index, ser['low'], ser['high'], color=col[0], alpha=0.25);

Q4: `statsmodels`¶

(a) OLS¶

We'll use the statsmodels formula interface here.

In [40]:

import statsmodels.formula.api as smf

Let's simulate some random normal data. I'm going to induce some correlation between x1 and x2 so that our mis-specified model is wrong in an interesting way.

In [41]:

n = 10_000
df_sim = pd.DataFrame({
    'x1': 0.1*np.random.normal(size=n)
})
df_sim['x2'] = df_sim['x1'] + 0.1*np.random.normal(size=n)
df_sim['y'] = 1 + 2*df_sim['x1'] + 3*df_sim['x2'] + 0.1*np.random.normal(size=n)
df_sim.head(5)

Out[41]:

	x1	x2	y
0	-0.195548	-0.341791	-0.456166
1	0.036662	-0.013369	0.938582
2	0.086877	0.075009	1.265736
3	-0.027318	0.039419	1.153315
4	-0.089984	-0.177162	0.192880

Now we run the properly specified regression

In [42]:

smf.ols('y ~ 1 + x1 + x2', df_sim).fit().params

Out[42]:

Intercept    1.000903
x1           1.984739
x2           3.001422
dtype: float64

And then an improperly specified one. Here let's omit x1 as a regressor.

In [43]:

smf.ols('y ~ 1 + x2', df_sim).fit().params

Out[43]:

Intercept    0.999440
x2           3.994577
dtype: float64

Which is expected due to the positive correlation between x1 and x2. We would anticipate an estimated value of $\hat{\beta}_2 = \beta_2 + \rho(x_1,x_2) \beta_1 = 4$, which is nearly exactly what we find.

(b) Poisson¶

Let's add in some Poisson data to our existing DataFrame, and let's in fact use the same linear part of the model, which is to say $\mathbb{E}[p] = \exp(y)$.

In [44]:

df_sim['p'] = np.random.poisson(np.exp(df_sim['y']), size=n)
df_sim.head(5)

Out[44]:

	x1	x2	y	p
0	-0.195548	-0.341791	-0.456166	1
1	0.036662	-0.013369	0.938582	2
2	0.086877	0.075009	1.265736	2
3	-0.027318	0.039419	1.153315	2
4	-0.089984	-0.177162	0.192880	2

Now we again run a properly specified Poisson regression.

In [45]:

smf.poisson('y ~ 1 + x1 + x2', data=df_sim).fit().params

Optimization terminated successfully.
         Current function value: 0.953151
         Iterations 5

Out[45]:

Intercept   -0.172245
x1           1.937055
x2           3.044282
dtype: float64

And finally we do the mis-specified version

In [46]:

smf.poisson('y ~ 1 + x2', data=df_sim).fit().params

Optimization terminated successfully.
         Current function value: 0.962609
         Iterations 5

Out[46]:

Intercept   -0.164420
x2           4.019739
dtype: float64

Same basic results!

In [ ]: