A simple web scraper framework, hosted on Heroku.
The goal of this application is to periodically retrieve a specified HTML attribute from a given website and send it's value in a notification. This application also logs the attribute's value over time for reference.
This application represents a simple framework for building more complex web crawlers: retrieve, parse, log, & send.
The application utilizes Selenium to go the given website using a headless Chrome driver. This essentially opens a Chrome session without a browser interface and goes to the site.
Now at the given website, the web scraper can parse the HTML. In this simple application, only website's title is retrieved.
The retrieved value is then logged to a JSON file with a timestamp. This log file can be fetched by sending a command to a bot on the Discord server.
This value is sent to the Discord server via a POST request to a Discord server webhook.
Within the app’s config variables I added the following values:
In order for your app to run on heroku, you need to tell it where to start with a shell command.
* There is no extension to the procfile
This python file stores app variables. These can also be stored as configuration variables within Heroku.
When Heroku runs the Procfile, the shell command will execute, calling
__main__ within app.py.
From here two processes are created, one for web scraping and the other for launching the Discord bot.
Launching the web scraper calls
main(), while the Discord bot process calls
main() function calls a succession of helper functions.
Various arguments are added to the Chrome options to optimize the app’s performance on the server. The created webdriver is returned by the function.
The webdriver uses headless Chrome to open the website specified in the config file. The website’s HTML is copied, headless Chrome is closed, and the HTML is returned by the function.
Launching the app produces this output to the log:
The initial message is sent via webhook to the Discord server. The JSON data file is sent in the server’s chat invoked by the