Custom logging in python

9/1/19

When working with data, there are a lot of things I need to keep track of in my head.

How many observations do I have in total?
I just filtered some out, how many did I throw out?
Am I aggregating before or after I filter?
Where should I add this chunk of code?

In order to keep track of these questions when working in a jupyter notebook I end up having tons of cells that look like this:

df.head()

df.shape

By using decorators and the .pipe method I can develop an analysis path that will give me customized output and automate this tedious cycle of .head() and .shape. Let’s take a look.

import pandas as pd
import numpy as np
import functools

np.random.seed(5)
df = pd.DataFrame({
    'group':np.random.choice(['a', 'b', 'c'], 10),
    'x':np.random.randint(0, 10, 10),
    'y':np.random.normal(0, 10, 10)
}); df.head()

	group	x	y
0	c	0	9.118736
1	b	7	-14.438416
2	c	1	18.244402
3	c	5	14.576251
4	a	7	-9.102582

Now I’ll define some processing functions.

These functions all take the dataframe as an argument and pass the dataframe back. A few notes:

The pDoc decorator is what allows me to print out the docstring and the shapes of the df at each step of the process.
The startPipe function may seem useless, but I’m just using it the get the size of the dataframe at the beginning of the analysis path.

def pDoc(func):
    """Print the docstring of a function."""
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        rv = func(*args, **kwargs)
        print("{}(): \n\t{} -> {}".format(func.__name__, func.__doc__, rv.shape))
        return rv
    return wrapper

@pDoc
def startPipe(df):
    """Begin pipeline"""
    return df

@pDoc
def filterGroups(df):
    """Remove group b from the analysis."""
    return df.query('group != "b"')

@pDoc
def capVal(df):
    """Cap the value of y at 10."""
    dat = df.copy()
    dat['y'] = dat['y'].apply(lambda x: 10 if x > 10 else x)
    return dat

@pDoc
def getMean(df):
    """Add column as mean value of x by group."""
    dat = df.copy()
    dat['g_mean'] = dat.groupby('group')['x'].transform(np.mean)
    return dat

Now I’ll tie all these functions together using .pipe.

(df
    .pipe(startPipe)
    .pipe(filterGroups)
    .pipe(getMean)
    .pipe(capVal)).head()

startPipe(): 
	Begin pipeline -> (10, 3)
filterGroups(): 
	Remove group b from the analysis. -> (8, 3)
getMean(): 
	Add column as mean value of x by group. -> (8, 4)
capVal(): 
	Cap the value of y at 10. -> (8, 4)

	group	x	y	g_mean
0	c	0	9.118736	3.0
2	c	1	10.000000	3.0
3	c	5	10.000000	3.0
4	a	7	-9.102582	3.5
6	a	1	-8.175481	3.5

As you can see, I get a really nice log output that shows the function name, docstring, and the shape of its output. I like this solution because it automates the really tedious process of having to ask myself “how many records did I just throw out”. By using decorators, the function will always show me the shape of the output.

Also, this solution can be really easily extended / modified. Don’t like what my pDoc decorator is doing? It’s really easy to change and customize. You’re really only limited by your imagination (and python).