Wall Street Journal, the dataset

11/10/23

I get a digital copy of the daily Wall Street Journal in my email inbox every day, and since natural language processing has been on my mind lately, I had an idea.

I used the Gmail API to download all 684 of them to put in an s3 bucket, but it wasn’t as easy as I thought.

It looks like the link in the email is some sort of intermediate link, and when you click it, some magic happens that then redirects you to the actual pdf of the news. Unfortunately this meant no sneaky use of the requests. I ended up using selenium but had to add some experimental options:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

options = Options()
options.add_argument("--window-size=1920x1080")
options.add_argument("--verbose")

# Thank you to this post:
# https://stackoverflow.com/questions/43149534/selenium-webdriver-how-to-download-a-pdf-file-with-python
options.add_experimental_option('prefs', {
    "plugins.always_open_pdf_externally": True # It will not show PDF directly in chrome 
})

service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=options)

for link in LINKS:
    # Each loop downloads the pdf immediately thanks to the 
    # options.add_experimental_option line
    driver.get(link)

I sense some AWS Textract in my future…