For a lot of these I'm going to go into way more detail that I'd expect on a regular answer. But you know, it's a teachable moment, and I'm a teacher.

In [1]:

```
# configure matplotlib output
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.style.use('config/clean.mplstyle') # this loads my personal plotting settings
col = mpl.rcParams['axes.prop_cycle'].by_key()['color']
%matplotlib inline
```

In [2]:

```
# if you have an HD display
%config InlineBackend.figure_format = 'retina'
```

Two strings are anagrams of one another if the histograms over their characters are the same. Since strings are just lists of characters, we can make quick work of this. I'm going to "cheat" a little bit and avail myself of some of the lesser known (but very useful parts of the core Python libraries). In this case we'll use `Counter`

from the `collections`

module.

In [3]:

```
from collections import Counter
```

Since `Counter`

is a class, by calling it we invoke the constructor, which creates a new instance. The `Counter`

constructor takes a `list`

or `list`

-like thing such as a `str`

and turns it into a dictionary of counts (effectively a histogram). Here's an example

In [4]:

```
Counter('hello world')
```

Out[4]:

Since the `Counter`

class overload the equality operator (take a look at its `__eq__`

function), we can directly check equality, which will return `True`

when all the counts line up exactly. I'm also going to say we'll ignore case, since that's in line with the colloquial definition for anagrams. So it's a one-liner!

In [5]:

```
def check_anagram(a, b):
return Counter(a.lower()) == Counter(b.lower())
```

Let's check that it works in a few cases, particularly edge cases

In [6]:

```
print(check_anagram('Doug Hanley', 'Doug Cranley'))
print(check_anagram('Doug Hanley', 'Gayle Hound'))
print(check_anagram('hello world!', ''))
print(check_anagram('', ''))
```

There are many reasonable ways to do this. The main way *not* to do it is to create an empty list (`[]`

) then have a for loop populate it with `append`

. It's just inelegant and slower too. Which usually won't matter, until it does. First we'll need some imports

In [7]:

```
from itertools import chain, repeat, product
```

Now let's go through some options. I can think of basically three distinct ones.

In [8]:

```
# the nested for loop approach, short and sweet
def repeat_list_0(xns):
return [x for x, n in xns for _ in range(n)]
# the functional approach, elegant but baroque
def repeat_list_1(xns):
return list(chain.from_iterable(repeat(x, n) for x, n in xns))
# here we used the sum function instead of chain
def repeat_list_2(xns):
return sum(([x for _ in range(n)] for x, n in xns), [])
```

And now let's test it. I'm actually going to use the fact that this is a sort of pseudo-inverse of the `Counter`

function to verify the results.

In [9]:

```
xns_list = [[(3, 5), (2, 3), (10, 1)], [(3, 5), (2, 0), (10, 1)], [(3, 0)], []]
func_list = [repeat_list_0, repeat_list_1, repeat_list_2]
for func, xns in product(func_list, xns_list):
reps = func(xns)
xns0 = [(x, n) for x, n in xns if n > 0]
xns1 = list(Counter(reps).items())
assert(xns0 == xns1)
```

`dict`

Merge¶You know what time it is? It's `defaultdict`

time! A `defaultdict`

is like a normal dict, but when you try to access a key that doesn't exist it will create a user-provided entry in its place and return that. Let's import it first

In [10]:

```
from collections import defaultdict
```

Now let's implement this operation. Note that you have to give a *function* that returns the default value to `defaultdict`

, not the default value itself.

In [11]:

```
def dict_merge_0(ds):
out = defaultdict(lambda: [])
for d in ds:
for k, v in d.items():
out[k].append(v)
return dict(out)
```

And it seems to work

In [12]:

```
ds_list = [
[{1: 2, 2: 3}, {1: 3, 2: 5}],
[{1: 2, 2: 3}, {1: 3, 2: 5, 10: 20}],
[]
]
for ds in ds_list:
print(dict_merge_0(ds))
```

Note that this will produce lists of different size when keys are missing some dicts. So we might want to fill those in with `None`

instead, to preserve the order (and hence make this operation invertible). That will also lead us to a solution that is potentially more performant.

In [13]:

```
def dict_merge_1(ds):
keys = set().union(*(set(d) for d in ds))
return {k: [d.get(k, None) for d in ds] for k in keys}
```

And now we get the expected behavior.

In [14]:

```
for ds in ds_list:
print(dict_merge_1(ds))
```

`set`

¶I actually happened to need such a thing the other day. Here's a basic implementation that creates a new subclass of `list`

and implements a new method called `add`

(which is the name for `set`

) and just checks that you don't introduce any duplicates while preserving the order.

In [15]:

```
class OrderedSet(list):
def __init__(self, data=()):
super().__init__(data)
def add(self, key):
if key not in self:
super().append(key)
```

Now let's add some elements and see what happens in the case of a duplicate add.

In [16]:

```
os = OrderedSet()
os.add(5)
os.add(3)
os.add(10)
os.add(3)
print(os)
```

Now let's do some profiling to see how fast this is

In [17]:

```
%%time
os = OrderedSet()
for i in range(50_000):
os.add(i)
```

One thing that will be slow about the above implementation is that checking for membership in `add`

will be quite taxing for large data sizes. So let's change it up by keeping a `set`

version of the data as well, which has must faster membership checks. So we're sacrificing memory usage for speed here.

In [18]:

```
class OrderedSetFast(list):
def __init__(self, data=()):
super().__init__(data)
self._set = set(data)
def add(self, key):
if key not in self._set:
self._set.add(key)
super().append(key)
```

And it is indeed much (500x) faster now!

In [19]:

```
%%time
os = OrderedSetFast()
for i in range(50_000):
os.add(i)
```

`numpy`

¶This is a common database-style function where we want to find aggregations of the data at some group level. We know `pandas`

already does this really well, but first let's implement it in pure `numpy`

.

In [20]:

```
def group_sums(x, i):
return {v: x[i==v].sum() for v in np.unique(i)}
```

In [21]:

```
x = np.linspace(0, 1, 20)
i = np.array([i for i in range(5) for _ in range(4)])
group_sums(x, i)
```

Out[21]:

There may be a faster way though! Let's look into some other `numpy`

functions here. First, in the process of doing `unique`

, we're also doing a lot of the work involved in finding out where each of the group values is located, so we can use one of the special flags (`return_inverse`

) to return that information, in addition to the unique values. Try running `np.unique`

on your own to see how that works. Then we use `bincount`

, which will aggregate over consecutive integer indices with optional weights.

In [22]:

```
def group_sums_fast(x, i):
v, j = np.unique(i, return_inverse=True)
s = np.bincount(j, weights=x)
return dict(zip(v, s))
```

Now let's test it on a monster dataset and see how each version fares.

In [23]:

```
xl = np.linspace(0, 1, 100_000)
il = np.array([i for i in range(10_000) for _ in range(10)])
```

In [24]:

```
%time _ = group_sums(xl, il)
```

In [25]:

```
%time _ = group_sums_fast(xl, il)
```

Okay, so these gains are pretty niche in absolute terms, but still it's a 100x speedup! I wonder how `pandas`

does here

In [26]:

```
%time _ = pd.Series(xl, index=il).groupby(level=0).sum()
```

Actually a bit faster than the fast `numpy`

version! Though if you include an additional `to_dict()`

at the end, it's a bit slower, so I'd say these are basically tied.

First let's make a random walk generator

In [27]:

```
def random_walk(n):
return np.random.normal(size=n).cumsum()
```

I'm going to use `seaborn`

, a well established fancy plotting library, with some R-like features.

In [28]:

```
import seaborn as sns
```

Now we can generate a bunch of random walks and see how they're correlated. I'm going to

In [29]:

```
n = 1_000
corrs = [np.corrcoef(random_walk(n), random_walk(n))[0, 1] for _ in range(1_000)]
ax = sns.kdeplot(corrs, fill=True); ax.set_xlim(-1, 1);
```

`pandas`

¶First let's generate some fake cross-country style data. I'm going to randomly drop some years to make things tricky.

In [30]:

```
countries = ['USA', 'CHN', 'ZAF', 'ARG', 'NOR']
ymin, ymax = 1950, 2022
mean, stdv = 0.01, 0.05
```

In [31]:

```
def random_drift(n, mean, stdv):
return (stdv*np.random.normal(size=n)+mean).cumsum()
```

In [32]:

```
df_long = pd.DataFrame(product(countries, range(ymin, ymax)), columns=['Country', 'Year'])
df_long['Population'] = np.exp(random_drift(len(df_long), mean, stdv))
df_long['GDP'] = np.exp(random_drift(len(df_long), mean, stdv))
df_long = df_long.sample(frac=0.75)
df_long = df_long.sort_values(by=['Country', 'Year'], ignore_index=True)
df_long.head(5)
```

Out[32]:

Now let's handle the reshaping operation. There's just a way to do this that will save you a lot of grief once you understand it. Just get all the rows and columns you want to deal with in the index, then `unstack`

the things you want to be columns.

In [33]:

```
def long_to_wide(df, rows, cols, vals=None):
# assemble variables
rows = [rows] if type(rows) is not list else rows
cols = [cols] if type(cols) is not list else cols
# do reshape
df = df.set_index([*rows, *cols])
if vals is not None:
df = df[vals]
df = df.unstack(cols)
# return wide
return df
```

Now we can test it out on our generated data, and indeed it seems to work.

In [34]:

```
df_wide = long_to_wide(df_long, 'Year', 'Country', 'GDP')
df_wide
```

Out[34]:

There is also a built-in function to do this in `pandas`

called `pivot`

. This has basically the same syntax as the function we just made. Still, in some cases it's easier to do it the old fashioned way.

In [35]:

```
df_long.pivot(index='Year', columns='Country', values='GDP')
```

Out[35]:

I'm going to change the question slightly and say we're looking for ±1 stdev breakouts. The below function will return a `DataFrame`

of booleans saying whether we've exceeded these bounds. Note that in the case where the data is missing for that year, it will return `False`

, which seems reasonable. Note the usage of `min_periods=1`

here is so that our `NaN`

values don't propogate too much.

In [36]:

```
def bollinger_escape(df, K):
roll = df.rolling(K, min_periods=1)
rmean, rstdv = roll.mean(), roll.std()
return (df - rmean).abs() > rstdv
```

In [37]:

```
bollinger_escape(df_wide, 5).head(5)
```

Out[37]:

We can visualize this situation by plotting the data series and their (1 stdv) Bollinger bands. Note that we're dealing with a multi-level column index here, which is always kind of annoying, but be sure to understand what `concat`

and `swaplevel`

are doing here.

In [38]:

```
roll = df_wide.rolling(5, min_periods=1)
rmean, rstdv = roll.mean(), roll.std()
rhi, rlo = rmean + rstdv, rmean - rstdv
df_roll = pd.concat({'value': df_wide, 'high': rhi, 'low': rlo}, axis=1)
df_roll = df_roll.swaplevel(0, 1, axis=1).sort_index(axis=1)
df_roll.head(5)
```

Out[38]:

Let's do a little fancy plot here for Argentina.

In [39]:

```
ser = df_roll['ARG']
fig, ax = plt.subplots()
ax.plot(ser.index, ser['value'], c=col[1], marker='o', markersize=2.5)
ax.plot(ser.index, ser['low'], linewidth=1, c=col[0])
ax.plot(ser.index, ser['high'], linewidth=1, c=col[0])
ax.fill_between(ser.index, ser['low'], ser['high'], color=col[0], alpha=0.25);
```

`statsmodels`

¶We'll use the `statsmodels`

formula interface here.

In [40]:

```
import statsmodels.formula.api as smf
```

Let's simulate some random normal data. I'm going to induce some correlation between `x1`

and `x2`

so that our mis-specified model is wrong in an interesting way.

In [41]:

```
n = 10_000
df_sim = pd.DataFrame({
'x1': 0.1*np.random.normal(size=n)
})
df_sim['x2'] = df_sim['x1'] + 0.1*np.random.normal(size=n)
df_sim['y'] = 1 + 2*df_sim['x1'] + 3*df_sim['x2'] + 0.1*np.random.normal(size=n)
df_sim.head(5)
```

Out[41]:

Now we run the properly specified regression

In [42]:

```
smf.ols('y ~ 1 + x1 + x2', df_sim).fit().params
```

Out[42]:

And then an improperly specified one. Here let's omit `x1`

as a regressor.

In [43]:

```
smf.ols('y ~ 1 + x2', df_sim).fit().params
```

Out[43]:

Which is expected due to the positive correlation between `x1`

and `x2`

. We would anticipate an estimated value of $\hat{\beta}_2 = \beta_2 + \rho(x_1,x_2) \beta_1 = 4$, which is nearly exactly what we find.

Let's add in some Poisson data to our existing `DataFrame`

, and let's in fact use the same linear part of the model, which is to say $\mathbb{E}[p] = \exp(y)$.

In [44]:

```
df_sim['p'] = np.random.poisson(np.exp(df_sim['y']), size=n)
df_sim.head(5)
```

Out[44]:

Now we again run a properly specified Poisson regression.

In [45]:

```
smf.poisson('y ~ 1 + x1 + x2', data=df_sim).fit().params
```

Out[45]:

And finally we do the mis-specified version

In [46]:

```
smf.poisson('y ~ 1 + x2', data=df_sim).fit().params
```

Out[46]:

Same basic results!

In [ ]:

```
```