Web Scraping My Recycling Schedule (using Python)

The Problem

Today, if I want to know when my garbage or recycling will be picked up, I would have to open a web browser, navigate to the city of Milwaukee's recycling website, type in my address, and hit submit. While the garbage pickup is fairly regular each week, the recycling pickup is on a schedule I do not understand. Those 60 seconds of work almost weekly are not only annoying, but they are not a good use of my time. (My desire to automate the boring stuff in my life is another post.) 

Ultimately, I want to be able to ask Alexa when the recycling (or garbage) will be picked up and get a response. The first step is to get the relevant information from the city's website. That's what I'll be demonstrating today.

The Setup

If you're not familiar with Python, you should start to dabble. It's a great data analysis and data science tool, but also helps you automate those tedious tasks you hate. Oh, and it's free! And it integrates with lots of awesome tools, such as Tableau and Alteryx.
  • Install Anaconda (this installs the Python language and lots of great packages)
  • Install packages pandas, bs4, and selenium (some of them might already come with Anaconda)
  • Download Chrome driver (this will be needed for Selenium)

The Code

I'm still pretty new to Python and I've learned that while everyone structures their code differently, there are certain best practices. This is a caveat that I'm 100% sure I do not follow the best practices. But I organize my code in a way that is helpful for me. 

Import Packages

Selenium is used to replicate what a human might experience by opening a web browser, navigating, entering an address, and hitting a submit button. Automating those actions requires Selenium and Selenium needs the Chrome web driver. BeautifulSoup retrieves and parses the information provided by the website. Finally, Pandas is used to store the resulting information. It's possible Pandas isn't needed in the future.

from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import pandas as pd

Set Variables

The URL is the website I'm interested in getting information from and the driver is where you downloaded the Chrome web driver. You'll also need to set the street address. The address fields correspond directly to the website I'm using.

# what's the url and where is your Selenium Chrome driver saved?
url = "https://city.milwaukee.gov/sanitation/GarbageRecyclingSchedules"
driver = webdriver.Chrome('C:/Users/bbeals/Selenium/chromedriver')
driver.implicitly_wait(30)
driver.get(url)

# what's your street address? save it here!
address = '3536'
direction = 'W'
street = 'FOND DU LAC'
streettype = 'AV'

Identify Form Elements

You'll figure out pretty fast that web scraping relies heavily on a website's format and structure. If that structure changes, you'll have to update your code. Fun stuff. This code segment finds the elements of the form on the website so we can refer to them when we enter the address. A computer doesn't see things the way humans do, instead they see the code that lives behind the scenes of a website. You can find this code by hitting F12 in Chrome. Alternatively, you can right-click on the element you're interested in a select 'Inspect'. In this case, I'm finding form elements by it's name.

# this particular website has an embedded iframe so select that iframe first
iframe = driver.find_element(By.CSS_SELECTOR, 'iframe')
driver.switch_to.frame(iframe)

# now find the various elements where you need to enter your address
address_element = driver.find_element(By.NAME,'laddr')
direction_element = driver.find_element(By.NAME,'sdir')
street_element = driver.find_element(By.NAME,'sname')
streettype_element = driver.find_element(By.NAME,'stype')

Enter Address

At this point, we have set variables with the information to be entered into the form and identified the elements of the form, but we have yet to actually do anything. Here's where the magic happens. For each form element, I will select the element by simulating a click and then enter the appropriate information. Finally, I will submit the form to retrieve the results.

# street number first
address_element.click()
address_element.send_keys(address)

# then street direction info
direction_element.click()
direction_element.send_keys(direction)

# then what street you live on
street_element.click()
street_element.send_keys(street)

# then fill in the street type
streettype_element.click()
streettype_element.send_keys(streettype)

# finally, find and click the Submit button
submit = driver.find_element(By.NAME, 'Submit')
submit.click()

Scrape Results

Once the form has been submitted, the website now displays the results and this is where BeautifulSoup comes in. The display needs to be captured and parsed. Given the format of the website, I know the type of pickup (garbage or recycling) is a header (called h2 in HTML-speak). Similarly, the pickup dates are in bold, which is one of the only differentiators I have found that allow me to identify that info. In the code segment below, I'm saving a screenshot just because, capturing all the HTML, and finding all h2 and bolded (also called strong) elements.

# save a screenshot (for funsies)
driver.save_screenshot('image.png')

# get the html to parse later
html = driver.page_source

# create soup object
soup = BeautifulSoup(html, 'html.parser')

# save header text to list
category = []
for text in soup.find_all('h2'):
    category.append(text.get_text())

# save bold text to list
date = []
for text in soup.find_all('strong'):
    date.append(text.get_text())

Format Results

Once the details I'm interested in have been downloaded, I want to format the results into a table.

# keep 2nd and 4th item in the list
# these correspond to the 1st and 3rd 0-indexed items
# so keep items, stepping by 2
date = date[1:4:2]
df = pd.DataFrame(list(zip(category, date)), 
                  columns = ['Category','Date'])

Quit

driver.quit()

The Results

After all that code, I'm left with a 2x2 matrix.


The Future

This code was a fun little project and does a decent job at retrieving the information I need. So, what's next?
  • Modify code to solve for complexities such as the garbage being picked up on the same day as the code is run (I've found that if this situation occurs, extra bolding is included)
  • Modify code to solve for leaf pickup during autumn months (maybe I want to know about this)
  • Set up an Alexa skill to comprehend what I'm asking for (garbage or recycling)
  • Create an AWS Lambda function to run this Python script on demand
  • Leverage AWS services to read the results to me