Custom logging in python
9/1/19
When working with data, there are a lot of things I need to keep track of in my head.
- How many observations do I have in total?
- I just filtered some out, how many did I throw out?
- Am I aggregating before or after I filter?
- Where should I add this chunk of code?
In order to keep track of these questions when working in a jupyter notebook I end up having tons of cells that look like this:
df.head()
or
df.shape
By using decorators and the .pipe
method I can develop an analysis path that will give me customized output and automate this tedious cycle of .head()
and .shape
. Let’s take a look.
import pandas as pd
import numpy as np
import functools
np.random.seed(5)
df = pd.DataFrame({
'group':np.random.choice(['a', 'b', 'c'], 10),
'x':np.random.randint(0, 10, 10),
'y':np.random.normal(0, 10, 10)
}); df.head()
group | x | y | |
---|---|---|---|
0 | c | 0 | 9.118736 |
1 | b | 7 | -14.438416 |
2 | c | 1 | 18.244402 |
3 | c | 5 | 14.576251 |
4 | a | 7 | -9.102582 |
Now I’ll define some processing functions.
These functions all take the dataframe as an argument and pass the dataframe back. A few notes:
- The
pDoc
decorator is what allows me to print out the docstring and the shapes of the df at each step of the process. - The
startPipe
function may seem useless, but I’m just using it the get the size of the dataframe at the beginning of the analysis path.
def pDoc(func):
"""Print the docstring of a function."""
@functools.wraps(func)
def wrapper(*args, **kwargs):
rv = func(*args, **kwargs)
print("{}(): \n\t{} -> {}".format(func.__name__, func.__doc__, rv.shape))
return rv
return wrapper
@pDoc
def startPipe(df):
"""Begin pipeline"""
return df
@pDoc
def filterGroups(df):
"""Remove group b from the analysis."""
return df.query('group != "b"')
@pDoc
def capVal(df):
"""Cap the value of y at 10."""
dat = df.copy()
dat['y'] = dat['y'].apply(lambda x: 10 if x > 10 else x)
return dat
@pDoc
def getMean(df):
"""Add column as mean value of x by group."""
dat = df.copy()
dat['g_mean'] = dat.groupby('group')['x'].transform(np.mean)
return dat
Now I’ll tie all these functions together using .pipe
.
(df
.pipe(startPipe)
.pipe(filterGroups)
.pipe(getMean)
.pipe(capVal)).head()
startPipe():
Begin pipeline -> (10, 3)
filterGroups():
Remove group b from the analysis. -> (8, 3)
getMean():
Add column as mean value of x by group. -> (8, 4)
capVal():
Cap the value of y at 10. -> (8, 4)
group | x | y | g_mean | |
---|---|---|---|---|
0 | c | 0 | 9.118736 | 3.0 |
2 | c | 1 | 10.000000 | 3.0 |
3 | c | 5 | 10.000000 | 3.0 |
4 | a | 7 | -9.102582 | 3.5 |
6 | a | 1 | -8.175481 | 3.5 |
As you can see, I get a really nice log output that shows the function name, docstring, and the shape of its output. I like this solution because it automates the really tedious process of having to ask myself “how many records did I just throw out”. By using decorators, the function will always show me the shape of the output.
Also, this solution can be really easily extended / modified. Don’t like what my pDoc
decorator is doing? It’s really easy to change and customize. You’re really only limited by your imagination (and python).