kurtcms.org

Thinking. Writing. Philosophising.

Email Github LinkedIn

Web Monitoring: Monitor Web Page for Changes

Posted on March 30, 2021 — 12 Minutes Read

The internet is one of the modern wonders. It houses a wealth of content, information and knowledge, of the time past and present. It is created and shared by the thousands of millions of users connected to it, and is available to all with access, the means of which for most people are right at their fingertips. For all of its glory, and with all of its changing content that evolves with time and with the advancement in science and technology, it is still a passive resource with changes unannounced and with no mechanism for version control. It stands to reason that there comes increasingly a need for a way to keep tap of a web page of interest should its content be changed to reflect new finding and the new reality. What follows will be a Python app that monitors any web page on the internet and receive email notification when the page has been changed.

The rest of the code is containerised with Docker Compose for a modular and cloud-native deployment that fits in any microservice architecture, and is shared on Github for reference and further development. For monitoring the increasingly popular dynamic web pages that render content on the client side with JavaScript on a Document Object Model (DOM), another Python script should be used instead.

Writing a Python Script

Now your server needs to be instructed to read a webpage, and to compare it to a previous read to see if there has been any change. There are many tools or scripting languages that one can use to automate tasks on a server and Python is one of them.

Code readability with the use of indentation instead of brackets to signify code blocks, and object-oriented design are among the many strengths of Python. Most signature of which however is the massive amount of readily available libraries i.e. packages of pre-written codes that may be imported to perform tasks of all kinds. Like a Swiss army knife or a multi-purpose tool with battery included, an initiated developer could perform wonders with Python right out of the box. For the task of monitoring a web page, the library to use is Requests, a neatly built library for initiating HTTP requests.

Before writing the script, the latest version of Python and the corresponding package manager need to be properly installed.

$ apt install python3 python3-pip

Once these are set, the relevant libraries may be installed with the package manager as well.

$ pip3 install requests

With everything in place, use the nano text editor to open a new file and start writing the script.

$ nano

Thereafter goes the script.

import requests
import smtplib, ssl
from sys import path, argv
from getopt import getopt, GetoptError
from textwrap import dedent
from hashlib import sha256
from os import mkdir, environ
from dotenv import load_dotenv, find_dotenv
from datetime import datetime

class monitor:
    ERR_USAGE = '''\
    Usage: web-monitor.py [-e] -u 
    Option:
      -h
        Display usage
      -e, --email
        Send email notification for changes
      -u, --url
        The URL of interest
    '''
    ERR_USAGE = dedent(ERR_USAGE)
    '''
    Usage example of this script
    '''

    ERR_INVALID_ENV = 'Problem locating the .env file'
    '''
    Error message to display when python-dotenv fails to read
    environment variables
    '''

    def __init__(self, argv):
        '''
        Read the argument(s) supplied and set variables depending
        on their values
        '''
        try:
            opts, args = getopt(argv, 'heu:',['email', 'url='])
        except GetoptError:
            '''
            Raise a system exit with the script usage
            on error reading the arguments
            '''
            raise SystemExit(self.ERR_USAGE)

        self.email_noti = False
        for opt, arg in opts:
            if opt == '-h':
                raise SystemExit(self.ERR_USAGE)
            elif opt in ('-e', '--email'):
                self.email_noti = True
            elif opt in ('-u', '--url'):
                self.url = arg

        if not hasattr(self, 'url'):
            '''
            Raise a system exit with the script usage in the
            absence of an URL supplied by the -u or --url argument
            '''
            raise SystemExit(self.ERR_USAGE)

        if self.email_noti == True:
            '''
            Construct the email message and read environment variables
            with python-dotenv if email notification is requested
            '''
            self.email_msg = f'''\
                Subject: {self.url} has been updated\n
                Updates are stored in separated files'''
            self.email_msg = dedent(self.email_msg)

            if load_dotenv(find_dotenv()) == False:
                '''
                Raise a system exit on error reading the
                environment variables
                '''
                raise SystemExit(self.ERR_INVALID_ENV)

            try:
                '''
                Read and set the environment variables needed for
                email notification
                '''
                self.email_sslp = environ['EMAIL_SSL_PORT']
                self.email_smtp = environ['EMAIL_SMTP_SERVER']
                self.email_sender = environ['EMAIL_SENDER']
                self.email_receiver = environ['EMAIL_RECEIVER']
                self.email_sender_pw = environ['EMAIL_SENDER_PASSWORD']
            except KeyError as e:
                '''
                Raise a system exit on error reading the environment variables
                '''
                raise SystemExit(e)

        '''
        Download a copy of the URL and raise a system exit
        on connection error
        '''
        try:
            page = requests.get(self.url)
        except requests.exceptions.RequestException as e:
            raise SystemExit(e)

        self.page_content = page.content

        '''
        Generate a SHA 256-bit checksum of the downloaded contents
        '''
        self.page_content_hash = sha256(self.page_content).hexdigest()

        '''
        Set the file name for the SHA 256-bit checksum file
        '''
        self.dir_name_url_domain = self.url.split('//')[-1].split('/')[0]
        self.file_name_page_content_hash = self.dir_name_url_domain \
                                            + '-sha256hash'

    def match(self):
        '''
        Call the __write method to write the SHA 256-bit checksum
        and a copy of the downloaded content if a previous checksum
        is not found. Otherwise read and match the previous checksum
        against the current one and on mismatch call the __write method
        to overwrite the checksum file and write a copy of the downloaded
        content. Call the __email method to send an email notification
        if it is requested
        '''

        try:
            mkdir(path[0] + '/' + self.dir_name_url_domain)
        except FileExistsError:
            pass
        finally:
            self.working_dir = path[0] + '/' + self.dir_name_url_domain + '/'

        try:
            with open(self.working_dir \
            + self.file_name_page_content_hash,'r') as f:
                if f.read() != self.page_content_hash:
                    self.__write()
                    if self.email_noti == True:
                        self.__email()
        except FileNotFoundError:
            self.__write()

    def __write(self):
        '''
        Write the SHA 256-bit checksum in a directory named by the
        sanitised URL and the downloaded content in a nested directory
        named by the full date and time now to ease access
        .
        └── url/
            ├── url-sha256hash
            └── YYYY-MM-DD-HH-MM-SS/
                └── index.html
        '''
        try:
            with open(self.working_dir \
            + self.file_name_page_content_hash, 'x') as f:
                f.write(self.page_content_hash)
        except FileExistsError:
            with open(self.working_dir \
            + self.file_name_page_content_hash, 'w') as f:
                f.write(self.page_content_hash)

        time_now = datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
        try:
            working_dir_time_now = self.working_dir + time_now + '/'
            mkdir(working_dir_time_now)
        except FileExistsError:
            pass

        with open(working_dir_time_now + 'index.html', 'x') as f:
            f.write(self.page_content.decode())

    def __email(self):
        '''
        Send an email notification
        '''
        ssl_context = ssl.create_default_context()
        with smtplib.SMTP_SSL(self.email_smtp, self.email_sslp,
        context=ssl_context) as server:
            server.login(self.email_sender, self.email_sender_pw)
            server.sendmail(self.email_sender, self.email_receiver,
                            self.email_msg)

if __name__ == '__main__':
    '''
    Create the object and subsequently call its corresponding
    method for matching
    '''
    web = monitor(argv[1:])
    web.match()

Control + o to save the script. Enter a name for the new file e.g. web-monitor.py, with a full path if it is to be saved in a location other than the current working directory e.g. /app/web-monitor.py. Be sure to add .py at the end of the file name to remark the file as a Python script Press enter to save.

Control + x to exit nano.

The key elements will be examined in the discussion that follows.

Import

The import statement that tells Python to import objects and methods from other standard or installed libraries goes before all else.

import requests
...
from sys import path, argv
...

Class

Following the Object-Oriented programming, this script is constructed around a main object aptly named monitor within which variables and methods are housed.

class monitor:
    ...

Declaring Variables

The error messages are defined at the top of the monitor object. They are referenced to throughout the class and as such declaring them once and right at the top of the class makes modification easier if there comes a time in the future that these need to be changed.

class monitor:
    ERR_USAGE = '''\
    Usage: web-monitor.py [-e] -u 
    Option:
      -h
        Display usage
      -e, --email
        Send email notification for changes
      -u, --url
        The URL of interest
    '''
    ...

Downloading and Matching

The available methods are defined in the rest of the monitor object. First among all is of course the reserved __init__ method which is the compulsory method for any class in Python and is called once an instance of the class is initialised. For the monitor object this method is where the URL of interest and whether or not email notification is needed, are read from the arguments supplied to the script. If email notification is requested, the script will read the corresponding environment variables that supply the Simple Mail Transfer Protocol over SSL (SMTPS) port number, the server address of the email sender, the email sender and receiver addresses, and the sender email password, that will be needed for the email notification. If Gmail is used as the sender email address, given that by default on security consideration it does not allow access from scripts or applications that do not meet their security standards. For this script to sign into the sender Gmail account and to send an email to the designated recipient, the less secure app access needs to be enabled. Follow these steps to turn it on before proceeding.

The method will then proceed to download a copy of the URL and generate a SHA 256 checksum of the downloaded web page. Comments in the form of a multi-line string starting and ending with triple quotes (''') or leading with a hashtag (#) are left for easier navigation along the way.

class monitor:
    ...

    def __init__(self, argv):
        ...

Following the __init__ method is a method named match that does precisely what its name suggests i.e. matching the generated SHA 256 checksum to a previous build. There are three exhaustive outcomes, namely a match or otherwise, together with a non-match in the absence of a previous build. It is the job of this match method to determine which outcome the script finds itself in and call the other two methods that handle the output and the notification as appropriate.

class monitor:
    ...

    def __init__(self, argv):
        ...

    def match(self):
        ...

The two remaining methods that follow, namely __write and __email manage the output and the notification. These are private methods that are not designed to be called by anything outside of the class itself for reason that should be obvious. Without the previous match method that determines which reality the script finds itself in, it makes no sense to dump any output or trigger any email notification. For keeping these two methods private to the class, they are named with two leading underscores that triggers name mingling which will among other things prevents them from being called outside the class.

class monitor:
    ...

    def __init__(self, argv):
        ...

    def match(self):
        ...

    def __write(self):
        ...

    def __email(self):
        ...

Right after the monitor object is defined, comes the instruction to initiate an instance of it with the rest of the arguments supplied from when this script is called, if and only if the script is called directly instead of being imported as a library. For otherwise it is up to the other script to initiate the monitor object as appropriate.

if __name__ == '__main__':
    ...

Testing and Troubleshooting

With everything in place, the script can now be tested by simply calling Python to interpret and execute it with the URL of interest e.g. https://lookingglass.pccwglobal.com/.

$ python3 /app/web-monitor.py -e -u https://lookingglass.pccwglobal.com/

No output message will be printed if all goes well. In the same directory of the script, there will be a new directory named by the URL of the web page of interest, with the checksum stored in a file directly under it and the downloaded web page in a nested directory named by the full date and time of the download to ease access.

.
└── url/
    ├── url-sha256hash
    └── YYYY-MM-DD-HH-MM-SS/
        └── index.html

Simulating a change in the web page content is cumbersome and next to impossible if it is a web site off access, easier it would be to edit the SHA-256 checksum file to trick the script into believing the web page content has been changed next time it is called.

Print null to the SHA-256 checksum file to overwrite its content.

echo > *full-path-to-sha-256-checksum*

$ echo > /app/lookingglass.pccwglobal.com/lookingglass.pccwglobal.com-sha256hash

Instructing Python to execute the script on the same URL again will print no output if all goes well. In the directory named by the URL there will be yet a different set of the downloaded web page, and an email will be sent to the designated recipient which will note that a change has been detected and the updates are stored in separate files.

Below are some of the common errors.

python3: command not found

Be sure that the latest version of Python is properly installed. Please refer to the previous discussion.

ImportError: No module named requests

Be sure that the Python library, requests, is properly installed. Please refer to the previous discussion.

Scheduling a Python Script

With the script working as expected, it may now be scheduled to run on a regular basis to monitor the web page of interest for change. Cron is the task scheduler for Linux. Tasks can be scheduled by adding them to the crontab, which is short for cron table.

To edit the crontab.

$ crontab -e

If asked which text editor should be used to edit the crontab, select nano or any one of preference.

Add a new schedule at the end of the crontab.

*/15 * * * * python3 *full-path-to-script* -e -u *url-of-interest* >> *full-path-to-script.log* 2>&1

The first part of the line i.e. */15 * * * * informs cron when and at what interval this task should be executed which in this case is once every 15 minutes. The remaining part of the line tells cron which script to execute when the time comes. It also redirects the standard output and standard error of the script to a log file of your choice.

When done, control + x to exit, and y and enter to save.

Thoughts

This is just one example of using a simple Python script and a system scheduler to automate tasks. There are many more amazing things one can do with Python. Explore and be amazed.