Wall Street Journal, the dataset
11/10/23
I get a digital copy of the daily Wall Street Journal in my email inbox every day, and since natural language processing has been on my mind lately, I had an idea.
I used the Gmail API to download all 684 of them to put in an s3 bucket
, but it wasn’t as easy as I thought.
It looks like the link in the email is some sort of intermediate link, and when you click it, some magic happens that then redirects you to the actual pdf of the news. Unfortunately this meant no sneaky use of the requests
. I ended up using selenium
but had to add some experimental options:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
options = Options()
options.add_argument("--window-size=1920x1080")
options.add_argument("--verbose")
# Thank you to this post:
# https://stackoverflow.com/questions/43149534/selenium-webdriver-how-to-download-a-pdf-file-with-python
options.add_experimental_option('prefs', {
"plugins.always_open_pdf_externally": True # It will not show PDF directly in chrome
})
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=options)
for link in LINKS:
# Each loop downloads the pdf immediately thanks to the
# options.add_experimental_option line
driver.get(link)
I sense some AWS Textract
in my future…