Do you want your receipt? (Yes)

11/3/24

Recently I’ve been responding with a resounding “Yes” whenever a grocery store employee has asked me if I wanted a receipt. Why? A vague idea of a fun project has been floating around in my mind. After a few months of “Yes”, I decided the pile of receipts in the drawer was sufficient, so it was time for project kick off. I had a few burning questions:

  1. How frequently am I going to the grocery store?
  2. What does an average trip look like?
  3. Does the answer to question 1 or 2 vary over time?
  4. How much price variability is there within item type? e.g. The price of a lemon
  5. How much price variability is there across different supermarkets?

Tooling wise, I could start getting my questions answered by using Textract, AWS’s OCR service, to extract out the key information. For each receipt, I wanted:

To get there, I:

  1. Took pictures of all the receipts and uploaded them to an S3 bucket in AWS
    • A lot of receipts were folded or crumpled up, so I set a big piece of clear plastic on top of each receipt so it would lay flat before I snapped its photo lol
  2. Called the AWS AnalyzeExpense API from Python using the boto3 package to create the OCR results
    • This produces very rich data, including bounding boxes for each word and a confidence level from 0 to 1
  3. Wrote some code to retrieve / reshape the json response from Textract into a pandas dataframe
  4. Did some data cleanup (turned price values like $0.89 FT into numbers like 0.89)

I ended up with data that looks like this:

I was excited to dive in and start answering my questions, but then I took a closer look at the item_name column. My questions relied on being able to group items by their name, but the name that appears on the receipt isn’t a value that’s easily understood by a human. A few examples:

I’m in a bit of a pickle. I could spend my time writing some heuristics to try and categorize these, maybe like:

def classify_item_name(item_name: str) -> str:
    item_name = item_name.lower().strip()
    if 'soap' in item_name:
        return 'soap'
    elif 'chicken' in item_name:
        return 'chicken'
    ...

However, that gets ugly after like the third classification and would be a nightmare to maintain. I thought about using an prebuilt LLM to classify my items, but I wanted to gain experience interacting with a labeling interface.

After some googling, I came across Label Studio, an open source data labeling interface built with Python. After installing it as a Python package with pip, you can set up labeling jobs for classification and segmentation of text, audio, and video data. The best part? The results of the all the labeling is stored in a local sqlite database, AND they have an API you can use to retrieve the labels.

Pickle resolved. To get Label Studio set up, I:

<!-- Sample XML config file for the labeling interface -->
<View>
  <Text name="text" value="$item_name"/>
  <Taxonomy name="taxonomy" toName="text">
    <Choice value="Meat">
      <Choice value="Chicken">
        <Choice value="Chicken - Thighs"/>
        <Choice value="Chicken - Breasts"/>
        <Choice value="Chicken - Rotisserie"/>
      </Choice>
      <Choice value="Beef"/>
      <Choice value="Pork"/>
      <Choice value="Turkey"/>
      <Choice value="Seafood">
        <Choice value="Fish"/>
        <Choice value="Shrimp"/>
        <Choice value="Crab"/>
      </Choice>
    </Choice>
</Taxonomy>
</View>

Using that configuration file, the labeling interface looks like this:

Using this UI I can quickly go through the list of my messy item names and put them into organized buckets I can use for further analysis. I haven’t classified my list of more than 500 items yet, but once I was a decent way through I pulled the label data out of Label Studio and combined that into my cleaned receipt data:

The label column holds the data that comes from Label Studio, and the label_n / label_broadest / label_specificist columns are my attempt to restructure those in a useful manner. Certainly wet paint at the moment but it feels like I’m on the right track.

At this point I’ve got a lot more labeling to do, but I just couldn’t resist taking a look at what I had so far just for my own interest…

That shows the average price within each label, including some error bars I’m eager to explain away.

Anyways, next steps for me: