Who's Bezos' Best Friend?

I just finished up making a dashboard in Dash that summarizes all the Amazon shareholder letters since 1998. I learned a TON during this project and still feel like there are a million things to do, but I’m callin it done for now! When I started this project I had 2 goals:

  1. Learn more about building dashboards with Dash
  2. Gain more experience with natural language processing (I used nltk quite a bit during my analysis). I gained experience with:
    • Stemming words
    • Lemmatizing words
    • Tokenizing strings
    • Computing sentiment

I accomplished both of these and built my confidence with Dash, it was easy to use and I’ll definitely keep that tool in my pocket for future projects.

Random Thoughts

My data cleaning process isn’t perfect, but for my original intended goals I’m happy with where I ended up. One thing I noticed was that some of the phrases marked as negative sentiment weren’t actually negative when put into the whole paragraph’s context. Take the following passage:

Many of you may already know something of Kindle—we’re fortunate (and grateful) that it has been broadly written and talked about. Briefly, Kindle is a purpose-built reading device with wireless access to more than 110,000 books, blogs, magazines, and newspapers. The wireless connectivity isn’t WiFi—instead it uses the same wireless network as advanced cell phones, which means it works when you’re at home in bed or out and moving around. You can buy a book directly from the device, and the whole book will be downloaded wirelessly, ready for reading, in less than 60 seconds.There is no “wireless plan,” no year-long contract you must commit to, and no monthly service fee. It has a paper-like electronic-ink display that’s easy to read even in bright daylight. Folks who see the display for the first time do a double-take. It’s thinner and lighter than a paperback, and can hold 200 books. Take a look at the Kindle detail page on Amazon.com to see what customers think—Kindle has already been reviewed more than 2,000 times.

The highlighted text was classified as having a negative sentiment, which does makes sense when you take the sentence by itself. However, when put into context, it’s clear that the “negativity” of this phrase is actually making the paragraph more positive (the fact that the Kindle has none of this is a positive thing).

Definitely an area of the analysis I could improve. I do think the sentiment calculation I have is valuable for automatically identifying “interesting” parts of the document, or at least phrases that are opinionated.


2 last things I think would be interesting further work:

  1. Train a classifier to predict the author
    • author = f(phrases)
  2. Cluster phrases into a few different categories, instead of just positive / neutral / negative.
    • cluster = f(phrases, year, etc.)

On to the next one!