For a lot of these I'm going to go into way more detail that I'd expect on a regular answer. But you know, it's a teachable moment, and I'm a teacher.
# configure matplotlib output
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.style.use('config/clean.mplstyle') # this loads my personal plotting settings
col = mpl.rcParams['axes.prop_cycle'].by_key()['color']
%matplotlib inline
# if you have an HD display
%config InlineBackend.figure_format = 'retina'
Two strings are anagrams of one another if the histograms over their characters are the same. Since strings are just lists of characters, we can make quick work of this. I'm going to "cheat" a little bit and avail myself of some of the lesser known (but very useful parts of the core Python libraries). In this case we'll use Counter
from the collections
module.
from collections import Counter
Since Counter
is a class, by calling it we invoke the constructor, which creates a new instance. The Counter
constructor takes a list
or list
-like thing such as a str
and turns it into a dictionary of counts (effectively a histogram). Here's an example
Counter('hello world')
Since the Counter
class overload the equality operator (take a look at its __eq__
function), we can directly check equality, which will return True
when all the counts line up exactly. I'm also going to say we'll ignore case, since that's in line with the colloquial definition for anagrams. So it's a one-liner!
def check_anagram(a, b):
return Counter(a.lower()) == Counter(b.lower())
Let's check that it works in a few cases, particularly edge cases
print(check_anagram('Doug Hanley', 'Doug Cranley'))
print(check_anagram('Doug Hanley', 'Gayle Hound'))
print(check_anagram('hello world!', ''))
print(check_anagram('', ''))
There are many reasonable ways to do this. The main way not to do it is to create an empty list ([]
) then have a for loop populate it with append
. It's just inelegant and slower too. Which usually won't matter, until it does. First we'll need some imports
from itertools import chain, repeat, product
Now let's go through some options. I can think of basically three distinct ones.
# the nested for loop approach, short and sweet
def repeat_list_0(xns):
return [x for x, n in xns for _ in range(n)]
# the functional approach, elegant but baroque
def repeat_list_1(xns):
return list(chain.from_iterable(repeat(x, n) for x, n in xns))
# here we used the sum function instead of chain
def repeat_list_2(xns):
return sum(([x for _ in range(n)] for x, n in xns), [])
And now let's test it. I'm actually going to use the fact that this is a sort of pseudo-inverse of the Counter
function to verify the results.
xns_list = [[(3, 5), (2, 3), (10, 1)], [(3, 5), (2, 0), (10, 1)], [(3, 0)], []]
func_list = [repeat_list_0, repeat_list_1, repeat_list_2]
for func, xns in product(func_list, xns_list):
reps = func(xns)
xns0 = [(x, n) for x, n in xns if n > 0]
xns1 = list(Counter(reps).items())
assert(xns0 == xns1)
dict
Merge¶You know what time it is? It's defaultdict
time! A defaultdict
is like a normal dict, but when you try to access a key that doesn't exist it will create a user-provided entry in its place and return that. Let's import it first
from collections import defaultdict
Now let's implement this operation. Note that you have to give a function that returns the default value to defaultdict
, not the default value itself.
def dict_merge_0(ds):
out = defaultdict(lambda: [])
for d in ds:
for k, v in d.items():
out[k].append(v)
return dict(out)
And it seems to work
ds_list = [
[{1: 2, 2: 3}, {1: 3, 2: 5}],
[{1: 2, 2: 3}, {1: 3, 2: 5, 10: 20}],
[]
]
for ds in ds_list:
print(dict_merge_0(ds))
Note that this will produce lists of different size when keys are missing some dicts. So we might want to fill those in with None
instead, to preserve the order (and hence make this operation invertible). That will also lead us to a solution that is potentially more performant.
def dict_merge_1(ds):
keys = set().union(*(set(d) for d in ds))
return {k: [d.get(k, None) for d in ds] for k in keys}
And now we get the expected behavior.
for ds in ds_list:
print(dict_merge_1(ds))
set
¶I actually happened to need such a thing the other day. Here's a basic implementation that creates a new subclass of list
and implements a new method called add
(which is the name for set
) and just checks that you don't introduce any duplicates while preserving the order.
class OrderedSet(list):
def __init__(self, data=()):
super().__init__(data)
def add(self, key):
if key not in self:
super().append(key)
Now let's add some elements and see what happens in the case of a duplicate add.
os = OrderedSet()
os.add(5)
os.add(3)
os.add(10)
os.add(3)
print(os)
Now let's do some profiling to see how fast this is
%%time
os = OrderedSet()
for i in range(50_000):
os.add(i)
One thing that will be slow about the above implementation is that checking for membership in add
will be quite taxing for large data sizes. So let's change it up by keeping a set
version of the data as well, which has must faster membership checks. So we're sacrificing memory usage for speed here.
class OrderedSetFast(list):
def __init__(self, data=()):
super().__init__(data)
self._set = set(data)
def add(self, key):
if key not in self._set:
self._set.add(key)
super().append(key)
And it is indeed much (500x) faster now!
%%time
os = OrderedSetFast()
for i in range(50_000):
os.add(i)
numpy
¶This is a common database-style function where we want to find aggregations of the data at some group level. We know pandas
already does this really well, but first let's implement it in pure numpy
.
def group_sums(x, i):
return {v: x[i==v].sum() for v in np.unique(i)}
x = np.linspace(0, 1, 20)
i = np.array([i for i in range(5) for _ in range(4)])
group_sums(x, i)
There may be a faster way though! Let's look into some other numpy
functions here. First, in the process of doing unique
, we're also doing a lot of the work involved in finding out where each of the group values is located, so we can use one of the special flags (return_inverse
) to return that information, in addition to the unique values. Try running np.unique
on your own to see how that works. Then we use bincount
, which will aggregate over consecutive integer indices with optional weights.
def group_sums_fast(x, i):
v, j = np.unique(i, return_inverse=True)
s = np.bincount(j, weights=x)
return dict(zip(v, s))
Now let's test it on a monster dataset and see how each version fares.
xl = np.linspace(0, 1, 100_000)
il = np.array([i for i in range(10_000) for _ in range(10)])
%time _ = group_sums(xl, il)
%time _ = group_sums_fast(xl, il)
Okay, so these gains are pretty niche in absolute terms, but still it's a 100x speedup! I wonder how pandas
does here
%time _ = pd.Series(xl, index=il).groupby(level=0).sum()
Actually a bit faster than the fast numpy
version! Though if you include an additional to_dict()
at the end, it's a bit slower, so I'd say these are basically tied.
First let's make a random walk generator
def random_walk(n):
return np.random.normal(size=n).cumsum()
I'm going to use seaborn
, a well established fancy plotting library, with some R-like features.
import seaborn as sns
Now we can generate a bunch of random walks and see how they're correlated. I'm going to
n = 1_000
corrs = [np.corrcoef(random_walk(n), random_walk(n))[0, 1] for _ in range(1_000)]
ax = sns.kdeplot(corrs, fill=True); ax.set_xlim(-1, 1);
pandas
¶First let's generate some fake cross-country style data. I'm going to randomly drop some years to make things tricky.
countries = ['USA', 'CHN', 'ZAF', 'ARG', 'NOR']
ymin, ymax = 1950, 2022
mean, stdv = 0.01, 0.05
def random_drift(n, mean, stdv):
return (stdv*np.random.normal(size=n)+mean).cumsum()
df_long = pd.DataFrame(product(countries, range(ymin, ymax)), columns=['Country', 'Year'])
df_long['Population'] = np.exp(random_drift(len(df_long), mean, stdv))
df_long['GDP'] = np.exp(random_drift(len(df_long), mean, stdv))
df_long = df_long.sample(frac=0.75)
df_long = df_long.sort_values(by=['Country', 'Year'], ignore_index=True)
df_long.head(5)
Now let's handle the reshaping operation. There's just a way to do this that will save you a lot of grief once you understand it. Just get all the rows and columns you want to deal with in the index, then unstack
the things you want to be columns.
def long_to_wide(df, rows, cols, vals=None):
# assemble variables
rows = [rows] if type(rows) is not list else rows
cols = [cols] if type(cols) is not list else cols
# do reshape
df = df.set_index([*rows, *cols])
if vals is not None:
df = df[vals]
df = df.unstack(cols)
# return wide
return df
Now we can test it out on our generated data, and indeed it seems to work.
df_wide = long_to_wide(df_long, 'Year', 'Country', 'GDP')
df_wide
There is also a built-in function to do this in pandas
called pivot
. This has basically the same syntax as the function we just made. Still, in some cases it's easier to do it the old fashioned way.
df_long.pivot(index='Year', columns='Country', values='GDP')
I'm going to change the question slightly and say we're looking for ±1 stdev breakouts. The below function will return a DataFrame
of booleans saying whether we've exceeded these bounds. Note that in the case where the data is missing for that year, it will return False
, which seems reasonable. Note the usage of min_periods=1
here is so that our NaN
values don't propogate too much.
def bollinger_escape(df, K):
roll = df.rolling(K, min_periods=1)
rmean, rstdv = roll.mean(), roll.std()
return (df - rmean).abs() > rstdv
bollinger_escape(df_wide, 5).head(5)
We can visualize this situation by plotting the data series and their (1 stdv) Bollinger bands. Note that we're dealing with a multi-level column index here, which is always kind of annoying, but be sure to understand what concat
and swaplevel
are doing here.
roll = df_wide.rolling(5, min_periods=1)
rmean, rstdv = roll.mean(), roll.std()
rhi, rlo = rmean + rstdv, rmean - rstdv
df_roll = pd.concat({'value': df_wide, 'high': rhi, 'low': rlo}, axis=1)
df_roll = df_roll.swaplevel(0, 1, axis=1).sort_index(axis=1)
df_roll.head(5)
Let's do a little fancy plot here for Argentina.
ser = df_roll['ARG']
fig, ax = plt.subplots()
ax.plot(ser.index, ser['value'], c=col[1], marker='o', markersize=2.5)
ax.plot(ser.index, ser['low'], linewidth=1, c=col[0])
ax.plot(ser.index, ser['high'], linewidth=1, c=col[0])
ax.fill_between(ser.index, ser['low'], ser['high'], color=col[0], alpha=0.25);
statsmodels
¶We'll use the statsmodels
formula interface here.
import statsmodels.formula.api as smf
Let's simulate some random normal data. I'm going to induce some correlation between x1
and x2
so that our mis-specified model is wrong in an interesting way.
n = 10_000
df_sim = pd.DataFrame({
'x1': 0.1*np.random.normal(size=n)
})
df_sim['x2'] = df_sim['x1'] + 0.1*np.random.normal(size=n)
df_sim['y'] = 1 + 2*df_sim['x1'] + 3*df_sim['x2'] + 0.1*np.random.normal(size=n)
df_sim.head(5)
Now we run the properly specified regression
smf.ols('y ~ 1 + x1 + x2', df_sim).fit().params
And then an improperly specified one. Here let's omit x1
as a regressor.
smf.ols('y ~ 1 + x2', df_sim).fit().params
Which is expected due to the positive correlation between x1
and x2
. We would anticipate an estimated value of $\hat{\beta}_2 = \beta_2 + \rho(x_1,x_2) \beta_1 = 4$, which is nearly exactly what we find.
Let's add in some Poisson data to our existing DataFrame
, and let's in fact use the same linear part of the model, which is to say $\mathbb{E}[p] = \exp(y)$.
df_sim['p'] = np.random.poisson(np.exp(df_sim['y']), size=n)
df_sim.head(5)
Now we again run a properly specified Poisson regression.
smf.poisson('y ~ 1 + x1 + x2', data=df_sim).fit().params
And finally we do the mis-specified version
smf.poisson('y ~ 1 + x2', data=df_sim).fit().params
Same basic results!