Web Scraper

A simple web scraper framework, hosted on Heroku.


Created December 2020
Python
Selenium
Discord
Heroku
Requests
    1. Overview 2. Design 3. Results

Overview

The goal of this application is to periodically retrieve a specified HTML attribute from a given website and send it's value in a notification. This application also logs the attribute's value over time for reference.

This application represents a simple framework for building more complex web crawlers: retrieve, parse, log, & send.

Design

The application utilizes Selenium to go the given website using a headless Chrome driver. This essentially opens a Chrome session without a browser interface and goes to the site.

Now at the given website, the web scraper can parse the HTML. In this simple application, only website's title is retrieved.

The retrieved value is then logged to a JSON file with a timestamp. This log file can be fetched by sending a command to a bot on the Discord server.

This value is sent to the Discord server via a POST request to a Discord server webhook.

Within the app’s config variables I added the following values:

Design

The diagram below maps out the flow of files and functions within the app’s design.

In order for your app to run on heroku, you need to tell it where to start with a shell command.

* There is no extension to the procfile

This file specifies which libraries are required for the app to run.

This file specifies which version of python is required for the app to run.

This python file stores app variables. These can also be stored as configuration variables within Heroku.

When Heroku runs the Procfile, the shell command will execute, calling __main__ within app.py. From here two processes are created, one for web scraping and the other for launching the Discord bot. Launching the web scraper calls main(), while the Discord bot process calls activate() within bot.py. The app’s main() function calls a succession of helper functions.

Various arguments are added to the Chrome options to optimize the app’s performance on the server. The created webdriver is returned by the function.

The webdriver uses headless Chrome to open the website specified in the config file. The website’s HTML is copied, headless Chrome is closed, and the HTML is returned by the function.

Beautiful soup 4 is used to parse the HTML and find the title attribute.

The title is saved in a json file, using the timestamp as the dictionary key.

A HTML POST request is used to send the title and timestamp to the Discord server.

The function activate() opens a client session with the bot’s token. When the bot is online, the function on_ready() is triggered, logging ‘Bot is ready.’ within Heroku. The command > view ./data.json triggers the bot to send the JSON data file in the Discord server’s chat.

Results

Launching the app produces this output to the log:

The initial message is sent via webhook to the Discord server. The JSON data file is sent in the server’s chat invoked by the >view command