Yes, we're zero indexing!
Git is a version control system that runs on your computer. You can commit
changes to the repository and it will save a history of these commits, allowing you to reference them or go back to them at any time. GitHub is a hosting service that essentially backs up your Git repositories (when you push
to them or pull
from them) so that you can acccess them from anywhere and there is a reduced chance of data loss. You can also check out other people's GitHub repository by clone
-ing them.
There are basically two ways to intall Python. One is the "native" way using whatever facilities exist with your operating system (for OSX this would be homebrew), and then using pip
to install packages. The other way, which is generally more user-friendly and OSX/Windows friendly is to use the Anaconda distribtuion and conda
to install packages.
The most bare-bones way to run Python is to just execute python
in the terminal. Depending on your setup, this might actually run Python 2, which is quite old at this point, so it's often safer to run python3
explicitly. Anyway, you'll almost never need to do this. At the very least you'll want to run in a more user friendly environment like IPython
by executing ipython3
, which is sort of a wrapper around python3
.
But even then, most of you will prefer to use Jupyterlab. This is a web-based graphical interface that is primary centered on "notebooks", which is what you're looking at now basically. It's just a series of cell (code or markdown) that when run produce some kind of output. Notebooks are a way to store the code you've written and the resulting output in one place.
The other place where Python code might "live" is in Python files, which are just text files with the extension .py
. If you have a file named model.py
, you can run its contents directly with python3 model.py
. If you're in an IPython terminal and you want to run its contents interactively, you can run run -i model.py
. Finally, you can use it as a module. In this case you can run the Python command import model
and there will be a new variable model
that contains any variables defined therein. So if you'd defined abc = 5
in model.py
then model.abc
will be 5
.
Nowadays, you can do almost everything in Jupyterlab. That can be useful, especially if you're working on a remote machine like the Pitt cluster. However, I must emphasize that you shouldn't be doing everything in notebooks. For suffiently complex code, you'll want to put some portions of it proper Python modules (.py
files, basically) and import them for usage in a notebook. Thus your notebook will contain mostly high level commands and their resulting outputs.
You can create a new notebook (.ipynb
file) by clicking on the blue "+" on the left and chooseing a Python version (something like 3.9 or higher is recommended). You can also create other file types like Python (.py
) or markdown (.md
) or open a system terminal. Finally, you can edit any of these files by double clicking on them in the filesystem pane on the left.
You'll want to stick mostly to the keyboard. To run a cell, press Shift+Enter
. To run a bunch of cells in a row, just hold down Shift
and keep pressing Enter
. To enter edit mode on the selected cell, press Enter
. To exit edit mode on a cell, just press Esc
. To interrupt ongoing execution, press i
twice. To completely restart a notebook press0
twice. Create new cells above or below with a
and b
.
You can make a cell into a markdown cell by pressing m
. Press y
to turn it back into a code cell. In markdown mode, you can make headings with one or more #
s, amongst other markdown features such as pairs of **
for bold text. You can also do inline $\LaTeX$-style math with pairs of $
, as in $x^2$, or display style math with pairs of $$
, as in
$$ \int_{-\infty}^{\infty} \exp(-x^2) dx = \sqrt{\pi} $$
There are a small number of core data types that are quite powerful. First there's the tuple
which is basically a list of objects
a = (1, 2, 'abc')
a
When the grouping is not ambiguous, you can omit the parenthases
a = 1, 2, 'abc'
a
In the other direction, you can unpack tuples and assign their members to separate variables
b, c, d = a
c
You can select subsets of a tuple by slicing them
a[1:]
Tuples aren't super flexible. Once you've created them, you can't resassign their elements, though you can append new ones to the end. For more interactive use cases, you'll want to use a list
. They look and act a lot like tuples, but you can modify them
a = [1, 2, 'abc']
print(a)
a.append(5)
print(a)
a[1] = 10
print(a)
There are a couple of fancy operations you can do with lists that use overloaded algebraic operators
a = [1, 2, 3]
b = [4, *a, 10]
print(b)
c = a + b
print(c)
d = 3*a
print(d)
Here you can see that using *
in front of a list variable acts as if you had typed out the contents.
I would say that the dict
is the quintessential type on Python. They are extremely useful and many things use them. I dict
is just a mapping between different objects, from keys
to values
. The values can be of any type, which they keys are restricted to being "hashable", which includes things like numbers, strings, and tuples (but not lists).
d = {1: 2, 'abc': 10, 12: 'foo'}
d
You can access the elements of dictionaries with square brackets
d[12]
You can combine dicts as we saw with lists but using **
instead.
e = {**d, 15: 1}
e
You can loop over iterables like tuples, lists, and other things using for loops.
for i in [1, 2, 3, 4]:
print(2*i)
There's also something known as a list comprehension that lets you do this in more compact form
a = range(5) # generates a list from 0 to 4, inclusive
b = [2*i for i in a]
print(b)
We can also do comprehensions on dictionaries
a = [1, 2, 3, 4]
{i: 2*i for i in a}
Functions are similar to other programming languages. But they can also be assigned and passed around like variables
def add(x, y):
return x + y
add(1, 5)
For smaller functions, you can also use the lambda function notation
add = lambda x, y: x + y
add(1, 5)
You can combine multiple iterables together using zip
. This turns out to pretty useful
a = [1, 2, 3, 4, 5]
b = [10, 11, 12, 13, 14]
zip(a, b)
Ok, that seems less useful. It turns out zip
returns an iterator object instead of the real thing. There are good efficiency reasons for this, but to get it to give you the real values, you need to do
list(zip(a, b))
There are quite a few built in modules that have useful functions. There are also many third-party modules that we'll use extensively.
import re # regular expressions
re.sub(r'\d', 'x', 'My phone number is 123-4567')
Here's an example using itertools.chain
which is often useful for chaining iterators together. In addition to itertools
, other all-star built-in modules include operator
, functools
, and operator
.
from itertools import chain
a = [range(i) for i in range(5)]
print(a)
b = chain(*a)
list(b)
numpy
¶# it's pronounced num-pie :)
import numpy as np
The central object in numpy
is an N-dimensional array type np.ndarray
. Lots of stuff here is going to be similar to matlab arrays.
np.ones(10)
You can create an array from a list with np.array
, but the inputs should be numerical
np.array([1, 2, 3, 4])
Note that when generating ranges, the left limit is inclusive while the right limit is non-inclusive
a = np.arange(10)
print(a)
There are a bunch of different ways to slice arrays, much like lists but more powerful.
a = np.arange(10)
print(a)
print(a[0]) # zero indexed!
print(a[3:]) # no 'end' needed
print(a[:-1]) # negatives index from end
print(a[3:5]) # second index is non-inclusive
print(a[[4,1,9,2]]) # index with a list
We can "broadcast" new dimensions at will. Here we make a column vector. Note that the row dimension is the first index (row-major)
a[:, None]
Here's same thing but for a row vector
a[None, :]
We can construct complex matrices using indexing and broadcasting. Do this instead of repmat!
a[:, None] + a[None, :]
Multiplication is element-wise by default (like .* in matlab)
np.arange(10) * np.arange(10)
Broadcasting works for a variety of operators, not just addition.
np.arange(10, 20)[None, :]*np.arange(5,15)[:, None]
You can always get shape/size information about an array.
a = np.ones((3, 5))
print(a.shape, a.size)
a
Reshaping is a whole thing.
a = np.ones((4, 5))
print(a.reshape((10, 2)).shape)
print(a.T.shape)
print(a.flatten().shape)
There is basic linear algebra in numpy
, but you'll want to see scipy
for more advanced operations and for statistical distributions.
m = np.random.rand(5, 5)
mi = np.linalg.inv(m)
print(mi)
print((np.dot(m, mi)-np.eye(5)).max())
But numpy
has many different routines for random number generation.
np.random.randint(5, size=10)
matplotlib
¶First I'm going to do some non-required stuff to configure graph appearance to my liking
import matplotlib as mpl
mpl.style.use('./config/clean.mplstyle') # this loads my personal plotting settings
%config InlineBackend.figure_format = 'retina' # if you have an HD display
For most use cases, this is the only import you need. Note that it is a little non-standard.
import matplotlib.pyplot as plt
First let's do a simple line plot example. You'll usually be passing numpy
arrays to matplotlib
, but it also accepts lists.
plt.plot(np.arange(10, 20), np.arange(10));
Another useful plot is the histogram for a given array.
plt.hist(np.random.randn(1000));
Passing a 2d array will treat each column as a separate series.
plt.plot(np.cumsum(np.random.randn(1000, 2), axis=0));
pandas
¶import pandas as pd
A Series
is a 1-D array with an attached index, which defaults to range(n)
.
s = pd.Series(np.random.rand(10), index=np.arange(10, 20))
s
Let's look at the underlying data
print(s.index)
print(s.values)
Or get a quick summary of a numeric series
s.describe()
A DataFrame
is like a dictionary of Series
with a common index
df = pd.DataFrame({'ser1': s, 'ser2': np.random.randn(10)})
df.head()
We can get summary stats for each column.
df.describe()
This makes plotting much more convenient and powerful.
df.plot(title='Random Stuff');
Accessing individual columns yields a Series
df['ser1']
We can perform vector operations on these
df['ser1'] > 0.5
We can select particular rows in this way.
df1 = df[(df['ser1']>0.5) & (df['ser2']<1.0)]
df1
statsmodels
¶This import is also non-standard. We're going to use the formula based API.
import statsmodels.formula.api as smf
Gererate some random data with a known causal structure.
N = 100
x = np.random.randn(N)
y = 3*np.random.randn(N)
z = 1 + 2*x + 3*y + 4*x*y + np.random.randn(N)
df0 = pd.DataFrame({'x': x, 'y': y, 'z': z})
Run an OLS regression with a properly specified model.
ret = smf.ols('z ~ 1 + x + y + x:y', data=df0).fit()
ret.summary()
You can access the parameters and standard errors directly as a series and dataframe
print(ret.params)
print(ret.cov_params())