Problem Set 0: Python Exercises

For a lot of these I'm going to go into way more detail that I'd expect on a regular answer. But you know, it's a teachable moment, and I'm a teacher.

In [1]:
# configure matplotlib output
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.style.use('config/clean.mplstyle') # this loads my personal plotting settings
col = mpl.rcParams['axes.prop_cycle'].by_key()['color']
%matplotlib inline
In [2]:
# if you have an HD display
%config InlineBackend.figure_format = 'retina'

Q1: Core Python

(a) Anagrams

Two strings are anagrams of one another if the histograms over their characters are the same. Since strings are just lists of characters, we can make quick work of this. I'm going to "cheat" a little bit and avail myself of some of the lesser known (but very useful parts of the core Python libraries). In this case we'll use Counter from the collections module.

In [3]:
from collections import Counter

Since Counter is a class, by calling it we invoke the constructor, which creates a new instance. The Counter constructor takes a list or list-like thing such as a str and turns it into a dictionary of counts (effectively a histogram). Here's an example

In [4]:
Counter('hello world')
Out[4]:
Counter({'h': 1, 'e': 1, 'l': 3, 'o': 2, ' ': 1, 'w': 1, 'r': 1, 'd': 1})

Since the Counter class overload the equality operator (take a look at its __eq__ function), we can directly check equality, which will return True when all the counts line up exactly. I'm also going to say we'll ignore case, since that's in line with the colloquial definition for anagrams. So it's a one-liner!

In [5]:
def check_anagram(a, b):
    return Counter(a.lower()) == Counter(b.lower())

Let's check that it works in a few cases, particularly edge cases

In [6]:
print(check_anagram('Doug Hanley', 'Doug Cranley'))
print(check_anagram('Doug Hanley', 'Gayle Hound'))
print(check_anagram('hello world!', ''))
print(check_anagram('', ''))
False
True
False
True

(b) Repeater

There are many reasonable ways to do this. The main way not to do it is to create an empty list ([]) then have a for loop populate it with append. It's just inelegant and slower too. Which usually won't matter, until it does. First we'll need some imports

In [7]:
from itertools import chain, repeat, product

Now let's go through some options. I can think of basically three distinct ones.

In [8]:
# the nested for loop approach, short and sweet
def repeat_list_0(xns):
    return [x for x, n in xns for _ in range(n)]
# the functional approach, elegant but baroque
def repeat_list_1(xns):
    return list(chain.from_iterable(repeat(x, n) for x, n in xns))
# here we used the sum function instead of chain
def repeat_list_2(xns):
    return sum(([x for _ in range(n)] for x, n in xns), [])

And now let's test it. I'm actually going to use the fact that this is a sort of pseudo-inverse of the Counter function to verify the results.

In [9]:
xns_list = [[(3, 5), (2, 3), (10, 1)], [(3, 5), (2, 0), (10, 1)], [(3, 0)], []]
func_list = [repeat_list_0, repeat_list_1, repeat_list_2]
for func, xns in product(func_list, xns_list):
    reps = func(xns)
    xns0 = [(x, n) for x, n in xns if n > 0]
    xns1 = list(Counter(reps).items())
    assert(xns0 == xns1)

(c) dict Merge

You know what time it is? It's defaultdict time! A defaultdict is like a normal dict, but when you try to access a key that doesn't exist it will create a user-provided entry in its place and return that. Let's import it first

In [10]:
from collections import defaultdict

Now let's implement this operation. Note that you have to give a function that returns the default value to defaultdict, not the default value itself.

In [11]:
def dict_merge_0(ds):
    out = defaultdict(lambda: [])
    for d in ds:
        for k, v in d.items():
            out[k].append(v)
    return dict(out)

And it seems to work

In [12]:
ds_list = [
    [{1: 2, 2: 3}, {1: 3, 2: 5}],
    [{1: 2, 2: 3}, {1: 3, 2: 5, 10: 20}],
    []
]
for ds in ds_list:
    print(dict_merge_0(ds))
{1: [2, 3], 2: [3, 5]}
{1: [2, 3], 2: [3, 5], 10: [20]}
{}

Note that this will produce lists of different size when keys are missing some dicts. So we might want to fill those in with None instead, to preserve the order (and hence make this operation invertible). That will also lead us to a solution that is potentially more performant.

In [13]:
def dict_merge_1(ds):
    keys = set().union(*(set(d) for d in ds))
    return {k: [d.get(k, None) for d in ds] for k in keys}

And now we get the expected behavior.

In [14]:
for ds in ds_list:
    print(dict_merge_1(ds))
{1: [2, 3], 2: [3, 5]}
{1: [2, 3], 2: [3, 5], 10: [None, 20]}
{}

(c) Ordered set

I actually happened to need such a thing the other day. Here's a basic implementation that creates a new subclass of list and implements a new method called add (which is the name for set) and just checks that you don't introduce any duplicates while preserving the order.

In [15]:
class OrderedSet(list):
    def __init__(self, data=()):
        super().__init__(data)

    def add(self, key):
        if key not in self:
            super().append(key)

Now let's add some elements and see what happens in the case of a duplicate add.

In [16]:
os = OrderedSet()
os.add(5)
os.add(3)
os.add(10)
os.add(3)
print(os)
[5, 3, 10]

Now let's do some profiling to see how fast this is

In [17]:
%%time
os = OrderedSet()
for i in range(50_000):
    os.add(i)
CPU times: user 6.36 s, sys: 0 ns, total: 6.36 s
Wall time: 6.37 s

One thing that will be slow about the above implementation is that checking for membership in add will be quite taxing for large data sizes. So let's change it up by keeping a set version of the data as well, which has must faster membership checks. So we're sacrificing memory usage for speed here.

In [18]:
class OrderedSetFast(list):
    def __init__(self, data=()):
        super().__init__(data)
        self._set = set(data)

    def add(self, key):
        if key not in self._set:
            self._set.add(key)
            super().append(key)

And it is indeed much (500x) faster now!

In [19]:
%%time
os = OrderedSetFast()
for i in range(50_000):
    os.add(i)
CPU times: user 8.38 ms, sys: 897 µs, total: 9.27 ms
Wall time: 9.26 ms

Q2: numpy

(a) Group Aggregation

This is a common database-style function where we want to find aggregations of the data at some group level. We know pandas already does this really well, but first let's implement it in pure numpy.

In [20]:
def group_sums(x, i):
    return {v: x[i==v].sum() for v in np.unique(i)}
In [21]:
x = np.linspace(0, 1, 20)
i = np.array([i for i in range(5) for _ in range(4)])
group_sums(x, i)
Out[21]:
{0: 0.3157894736842105,
 1: 1.1578947368421053,
 2: 2.0,
 3: 2.8421052631578947,
 4: 3.6842105263157894}

There may be a faster way though! Let's look into some other numpy functions here. First, in the process of doing unique, we're also doing a lot of the work involved in finding out where each of the group values is located, so we can use one of the special flags (return_inverse) to return that information, in addition to the unique values. Try running np.unique on your own to see how that works. Then we use bincount, which will aggregate over consecutive integer indices with optional weights.

In [22]:
def group_sums_fast(x, i):
    v, j = np.unique(i, return_inverse=True)
    s = np.bincount(j, weights=x)
    return dict(zip(v, s))

Now let's test it on a monster dataset and see how each version fares.

In [23]:
xl = np.linspace(0, 1, 100_000)
il = np.array([i for i in range(10_000) for _ in range(10)])
In [24]:
%time _ = group_sums(xl, il)
CPU times: user 217 ms, sys: 43 µs, total: 217 ms
Wall time: 217 ms
In [25]:
%time _ = group_sums_fast(xl, il)
CPU times: user 2.48 ms, sys: 78 µs, total: 2.55 ms
Wall time: 2.43 ms

Okay, so these gains are pretty niche in absolute terms, but still it's a 100x speedup! I wonder how pandas does here

In [26]:
%time _ = pd.Series(xl, index=il).groupby(level=0).sum()
CPU times: user 1.81 ms, sys: 0 ns, total: 1.81 ms
Wall time: 1.7 ms

Actually a bit faster than the fast numpy version! Though if you include an additional to_dict() at the end, it's a bit slower, so I'd say these are basically tied.

(b) Correlations

First let's make a random walk generator

In [27]:
def random_walk(n):
    return np.random.normal(size=n).cumsum()

I'm going to use seaborn, a well established fancy plotting library, with some R-like features.

In [28]:
import seaborn as sns

Now we can generate a bunch of random walks and see how they're correlated. I'm going to

In [29]:
n = 1_000
corrs = [np.corrcoef(random_walk(n), random_walk(n))[0, 1] for _ in range(1_000)]
ax = sns.kdeplot(corrs, fill=True); ax.set_xlim(-1, 1);
No description has been provided for this image

Q3: pandas

(a) Reshaping

First let's generate some fake cross-country style data. I'm going to randomly drop some years to make things tricky.

In [30]:
countries = ['USA', 'CHN', 'ZAF', 'ARG', 'NOR']
ymin, ymax = 1950, 2022
mean, stdv = 0.01, 0.05
In [31]:
def random_drift(n, mean, stdv):
    return (stdv*np.random.normal(size=n)+mean).cumsum()
In [32]:
df_long = pd.DataFrame(product(countries, range(ymin, ymax)), columns=['Country', 'Year'])
df_long['Population'] = np.exp(random_drift(len(df_long), mean, stdv))
df_long['GDP'] = np.exp(random_drift(len(df_long), mean, stdv))
df_long = df_long.sample(frac=0.75)
df_long = df_long.sort_values(by=['Country', 'Year'], ignore_index=True)
df_long.head(5)
Out[32]:
Country Year Population GDP
0 ARG 1952 9.570824 44.444957
1 ARG 1953 8.983750 44.170588
2 ARG 1954 9.206018 43.138836
3 ARG 1955 10.016157 42.722817
4 ARG 1956 10.490452 47.379580

Now let's handle the reshaping operation. There's just a way to do this that will save you a lot of grief once you understand it. Just get all the rows and columns you want to deal with in the index, then unstack the things you want to be columns.

In [33]:
def long_to_wide(df, rows, cols, vals=None):
    # assemble variables
    rows = [rows] if type(rows) is not list else rows
    cols = [cols] if type(cols) is not list else cols

    # do reshape
    df = df.set_index([*rows, *cols])
    if vals is not None:
        df = df[vals]
    df = df.unstack(cols)

    # return wide
    return df

Now we can test it out on our generated data, and indeed it seems to work.

In [34]:
df_wide = long_to_wide(df_long, 'Year', 'Country', 'GDP')
df_wide
Out[34]:
Country ARG CHN NOR USA ZAF
Year
1950 NaN NaN 87.142400 1.051588 9.857584
1951 NaN 2.549617 89.086559 1.021525 9.753643
1952 44.444957 2.614140 92.497893 1.118118 8.489252
1953 44.170588 2.464735 95.158492 NaN 8.461172
1954 43.138836 NaN 98.055941 1.125351 8.954090
... ... ... ... ... ...
2017 72.968384 11.195639 NaN 2.754084 36.586738
2018 NaN 10.378070 94.586793 NaN 36.591334
2019 NaN 10.870278 91.963998 2.678403 36.660267
2020 82.525899 9.462256 98.514846 NaN 37.985044
2021 87.487301 9.094091 NaN 2.580982 NaN

72 rows × 5 columns

There is also a built-in function to do this in pandas called pivot. This has basically the same syntax as the function we just made. Still, in some cases it's easier to do it the old fashioned way.

In [35]:
df_long.pivot(index='Year', columns='Country', values='GDP')
Out[35]:
Country ARG CHN NOR USA ZAF
Year
1950 NaN NaN 87.142400 1.051588 9.857584
1951 NaN 2.549617 89.086559 1.021525 9.753643
1952 44.444957 2.614140 92.497893 1.118118 8.489252
1953 44.170588 2.464735 95.158492 NaN 8.461172
1954 43.138836 NaN 98.055941 1.125351 8.954090
... ... ... ... ... ...
2017 72.968384 11.195639 NaN 2.754084 36.586738
2018 NaN 10.378070 94.586793 NaN 36.591334
2019 NaN 10.870278 91.963998 2.678403 36.660267
2020 82.525899 9.462256 98.514846 NaN 37.985044
2021 87.487301 9.094091 NaN 2.580982 NaN

72 rows × 5 columns

(b) Bollinger Bands

I'm going to change the question slightly and say we're looking for ±1 stdev breakouts. The below function will return a DataFrame of booleans saying whether we've exceeded these bounds. Note that in the case where the data is missing for that year, it will return False, which seems reasonable. Note the usage of min_periods=1 here is so that our NaN values don't propogate too much.

In [36]:
def bollinger_escape(df, K):
    roll = df.rolling(K, min_periods=1)
    rmean, rstdv = roll.mean(), roll.std()
    return (df - rmean).abs() > rstdv
In [37]:
bollinger_escape(df_wide, 5).head(5)
Out[37]:
Country ARG CHN NOR USA ZAF
Year
1950 False False False False False
1951 False False False False False
1952 False False True True True
1953 False True True False False
1954 True False True False False

We can visualize this situation by plotting the data series and their (1 stdv) Bollinger bands. Note that we're dealing with a multi-level column index here, which is always kind of annoying, but be sure to understand what concat and swaplevel are doing here.

In [38]:
roll = df_wide.rolling(5, min_periods=1)
rmean, rstdv = roll.mean(), roll.std()
rhi, rlo = rmean + rstdv, rmean - rstdv
df_roll = pd.concat({'value': df_wide, 'high': rhi, 'low': rlo}, axis=1)
df_roll = df_roll.swaplevel(0, 1, axis=1).sort_index(axis=1)
df_roll.head(5)
Out[38]:
Country ARG CHN NOR USA ZAF
high low value high low value high low value high low value high low value
Year
1950 NaN NaN NaN NaN NaN NaN NaN NaN 87.142400 NaN NaN 1.051588 NaN NaN 9.857584
1951 NaN NaN NaN NaN NaN 2.549617 89.489207 86.739752 89.086559 1.057814 1.015299 1.021525 9.879112 9.732115 9.753643
1952 NaN NaN 44.444957 2.627503 2.536254 2.614140 92.286652 86.864583 92.497893 1.113174 1.014313 1.118118 10.128603 8.605050 8.489252
1953 44.501781 44.113764 44.170588 2.617764 2.467897 2.464735 94.533909 87.408763 95.158492 1.113174 1.014313 NaN 9.909777 8.371048 8.461172
1954 44.606815 43.229440 43.138836 2.617764 2.467897 NaN 96.810617 87.965897 98.055941 1.129917 1.028374 1.125351 9.774627 8.431669 8.954090

Let's do a little fancy plot here for Argentina.

In [39]:
ser = df_roll['ARG']
fig, ax = plt.subplots()
ax.plot(ser.index, ser['value'], c=col[1], marker='o', markersize=2.5)
ax.plot(ser.index, ser['low'], linewidth=1, c=col[0])
ax.plot(ser.index, ser['high'], linewidth=1, c=col[0])
ax.fill_between(ser.index, ser['low'], ser['high'], color=col[0], alpha=0.25);
No description has been provided for this image

Q4: statsmodels

(a) OLS

We'll use the statsmodels formula interface here.

In [40]:
import statsmodels.formula.api as smf

Let's simulate some random normal data. I'm going to induce some correlation between x1 and x2 so that our mis-specified model is wrong in an interesting way.

In [41]:
n = 10_000
df_sim = pd.DataFrame({
    'x1': 0.1*np.random.normal(size=n)
})
df_sim['x2'] = df_sim['x1'] + 0.1*np.random.normal(size=n)
df_sim['y'] = 1 + 2*df_sim['x1'] + 3*df_sim['x2'] + 0.1*np.random.normal(size=n)
df_sim.head(5)
Out[41]:
x1 x2 y
0 -0.195548 -0.341791 -0.456166
1 0.036662 -0.013369 0.938582
2 0.086877 0.075009 1.265736
3 -0.027318 0.039419 1.153315
4 -0.089984 -0.177162 0.192880

Now we run the properly specified regression

In [42]:
smf.ols('y ~ 1 + x1 + x2', df_sim).fit().params
Out[42]:
Intercept    1.000903
x1           1.984739
x2           3.001422
dtype: float64

And then an improperly specified one. Here let's omit x1 as a regressor.

In [43]:
smf.ols('y ~ 1 + x2', df_sim).fit().params
Out[43]:
Intercept    0.999440
x2           3.994577
dtype: float64

Which is expected due to the positive correlation between x1 and x2. We would anticipate an estimated value of $\hat{\beta}_2 = \beta_2 + \rho(x_1,x_2) \beta_1 = 4$, which is nearly exactly what we find.

(b) Poisson

Let's add in some Poisson data to our existing DataFrame, and let's in fact use the same linear part of the model, which is to say $\mathbb{E}[p] = \exp(y)$.

In [44]:
df_sim['p'] = np.random.poisson(np.exp(df_sim['y']), size=n)
df_sim.head(5)
Out[44]:
x1 x2 y p
0 -0.195548 -0.341791 -0.456166 1
1 0.036662 -0.013369 0.938582 2
2 0.086877 0.075009 1.265736 2
3 -0.027318 0.039419 1.153315 2
4 -0.089984 -0.177162 0.192880 2

Now we again run a properly specified Poisson regression.

In [45]:
smf.poisson('y ~ 1 + x1 + x2', data=df_sim).fit().params
Optimization terminated successfully.
         Current function value: 0.953151
         Iterations 5
Out[45]:
Intercept   -0.172245
x1           1.937055
x2           3.044282
dtype: float64

And finally we do the mis-specified version

In [46]:
smf.poisson('y ~ 1 + x2', data=df_sim).fit().params
Optimization terminated successfully.
         Current function value: 0.962609
         Iterations 5
Out[46]:
Intercept   -0.164420
x2           4.019739
dtype: float64

Same basic results!

In [ ]: