kurtcms.org

Thinking. Writing. Philosophising.

Email Github LinkedIn

Web Monitoring: Monitor a JavaScript Rendered Web Page for Updates

Posted on May 15, 2021 — 13 Minutes Read

Static web pages of age often rely on the server for the rendering before delivering it to the client for display. Dynamic web pages that are increasingly popular now rely instead on the rendering mechanism on the client side for which the de facto standard is JavaScript on a Document Object Model (DOM) that populates the web page with the relevant content once it is delivered to the client. This allows part of the web page to be repopulated and re-rendered on condition or on demand in a process known as Asynchronous JavaScript and XML (AJAX) that enhances the overall user experience. It however makes web monitoring more difficult now that the changes or updates to a page may reside solely in the JavaScript code that will go undetected if the monitoring was done on the HTML source. That said, enlarging the scope of a web monitoring script to cover the JavaScripts on page is merely a matter of a few lines of code.

The rest of the code is containerised with Docker Compose for a modular and cloud-native deployment that fits in any microservice architecture, and is shared on Github for reference and further development.

Writing a Python Script

Task at hand now is to write a script to instruct your server to read a web page, and to compare it to a previous read to see if there has been any update. There are many tools or scripting languages for this purpose and Python is one of them.

Python is known for its object-oriented design and its code readability with the use of indentation instead of otherwise brackets of various kinds as in other programming languages to signify code blocks. Most importantly Python is renowned for its huge amount of readily available libraries i.e. blocks of pre-written codes that may be commanded by the initiated to perform tasks of all kinds. Like a Swiss army knife or a multi-purpose tool with battery included, an informed developer could perform wonders with Python right out of the box. For the task of monitoring a web page with JavaScript rendered content, the library to use is Beautiful Soup, a neatly built library for scraping web pages.

Before writing the script, the latest version of Python and the corresponding package manager need to be properly installed.

$ apt install python3 python3-pip

Once these are set, the relevant libraries may be installed with the package manager as well.

$ pip3 install requests bs4

With everything in place, use the nano text editor to open a new file and start writing the script.

$ nano

Thereafter goes the script.

import requests
import smtplib, ssl
from sys import path, argv
from getopt import getopt, GetoptError
from textwrap import dedent
from hashlib import sha256
from os import mkdir, environ
from dotenv import load_dotenv, find_dotenv
from datetime import datetime
from bs4 import BeautifulSoup

class monitor:
    ERR_USAGE = '''\
    Usage: web-js-monitor.py [-e] -u 
    Option:
      -h
        Display usage
      -e, --email
        Send email notification for changes
      -u, --url
        The URL of interest
    '''
    ERR_USAGE = dedent(ERR_USAGE)
    '''
    Usage example of this script
    '''

    ERR_INVALID_ENV = 'Problem locating the .env file'
    '''
    Error message to display when python-dotenv fails to read
    environment variables
    '''

    def __init__(self, argv):
        '''
        Read the argument(s) supplied and set variables depending
        on their values
        '''
        try:
            opts, args = getopt(argv, 'heu:',['email', 'url='])
        except GetoptError:
            '''
            Raise a system exit with the script usage
            on error reading the arguments
            '''
            raise SystemExit(self.ERR_USAGE)

        self.email_noti = False
        for opt, arg in opts:
            if opt == '-h':
                raise SystemExit(self.ERR_USAGE)
            elif opt in ('-e', '--email'):
                self.email_noti = True
            elif opt in ('-u', '--url'):
                self.url = arg

        if not hasattr(self, 'url'):
            '''
            Raise a system exit with the script usage in the
            absence of an URL supplied by the -u or --url argument
            '''
            raise SystemExit(self.ERR_USAGE)

        if self.email_noti == True:
            '''
            Construct the email message and read environment variables
            with python-dotenv if email notification is requested
            '''
            self.email_msg = f'''\
                Subject: {self.url} has been updated\n
                Updates are stored in separated files'''
            self.email_msg = dedent(self.email_msg)

            if load_dotenv(find_dotenv()) == False:
                '''
                Raise a system exit on error reading the
                environment variables
                '''
                raise SystemExit(self.ERR_INVALID_ENV)

            try:
                '''
                Read and set the environment variables needed for
                email notification
                '''
                self.email_sslp = environ['EMAIL_SSL_PORT']
                self.email_smtp = environ['EMAIL_SMTP_SERVER']
                self.email_sender = environ['EMAIL_SENDER']
                self.email_receiver = environ['EMAIL_RECEIVER']
                self.email_sender_pw = environ['EMAIL_SENDER_PASSWORD']
            except KeyError as e:
                '''
                Raise a system exit on error reading the environment variables
                '''
                raise SystemExit(e)

        '''
        Download a copy of the URL and raise a system exit
        on connection error
        '''
        try:
            page = requests.get(self.url)
        except requests.exceptions.RequestException as e:
            raise SystemExit(e)

        self.url_soup = BeautifulSoup(page.content, 'html.parser')
        self.url_sources = self.url_soup.prettify()

        '''
        Download a copy of the external JavaScript files if any
        that are referred to in the page
        '''
        self.page_dict = {}
        self.page_dict['index.html'] = self.url_sources
        for js in self.url_soup.find_all('script'):
            js_src_url = js.get('src')
            if not js_src_url == None:
                '''
                Append the FQDN to the URL of the JavaScript file if
                it is missing and download a copy of it. Raise a system
                exit on connection error.
                '''
                try:
                    if '//' in js_src_url:
                        js_file = requests.get(js_src_url)
                    else:
                        js_file = requests.get(self.url + js_src_url
                            if self.url[-1] == '/'
                            else self.url + '/' + js_src_url)
                except requests.exceptions.RequestException as e:
                    raise SystemExit(e)

                js_key = js_src_url.split('/')[-1]
                js_soup = BeautifulSoup(js_file.content, 'html.parser')
                self.page_dict[js_key] = js_soup.prettify()

        '''
        Generate a SHA 256-bit checksum of the downloaded contents
        '''
        self.page_dict_value = ''
        for each in self.page_dict:
            self.page_dict_value += self.page_dict[each]

        self.page_dict_value_hash = sha256(self.page_dict_value.encode()
                                        ).hexdigest()

        '''
        Set the file name for the SHA 256-bit checksum file
        '''
        self.dir_name_url_domain = self.url.split('//')[-1].split('/')[0]
        self.file_name_url_sources_hash = self.dir_name_url_domain \
                                            + '-sha256hash'

    def match(self):
        '''
        Call the __write method to write the SHA 256-bit checksum
        and a copy of the downloaded contents if a previous checksum
        is not found. Otherwise read and match the previous checksum
        against the current one and on mismatch call the __write method
        to overwrite the checksum file and write a copy of the downloaded
        contents. Call the __email method to send an email notification
        if it is requested
        '''

        try:
            mkdir(path[0] + '/' + self.dir_name_url_domain)
        except FileExistsError:
            pass
        finally:
            self.working_dir = path[0] + '/' + self.dir_name_url_domain + '/'

        try:
            with open(self.working_dir \
            + self.file_name_url_sources_hash,'r') as f:
                if f.read() != self.page_dict_value_hash:
                    self.__write()
                    if self.email_noti == True:
                        self.__email()
        except FileNotFoundError:
            self.__write()

    def __write(self):
        '''
        Write the SHA 256-bit checksum in a directory named by the
        sanitised URL and the rest of the downloaded contents in a
        nested directory named by the full date and time now to
        ease access
        .
        └── url/
            ├── url-sha256hash
            └── YYYY-MM-DD-HH-MM-SS/
                ├── index.html
                ├── javaScript1.js
                ├── javaScript2.js
                └── javaScript3.js
        '''
        try:
            with open(self.working_dir \
            + self.file_name_url_sources_hash, 'x') as f:
                f.write(self.page_dict_value_hash)
        except FileExistsError:
            with open(self.working_dir \
            + self.file_name_url_sources_hash, 'w') as f:
                f.write(self.page_dict_value_hash)

        time_now = datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
        try:
            working_dir_time_now = self.working_dir + time_now + '/'
            mkdir(working_dir_time_now)
        except FileExistsError:
            pass

        for each in self.page_dict:
            with open(working_dir_time_now + each, 'x') as f:
                f.write(self.page_dict[each])

    def __email(self):
        '''
        Send an email notification
        '''
        ssl_context = ssl.create_default_context()
        with smtplib.SMTP_SSL(self.email_smtp, self.email_sslp,
        context=ssl_context) as server:
            server.login(self.email_sender, self.email_sender_pw)
            server.sendmail(self.email_sender, self.email_receiver,
                            self.email_msg)

if __name__ == '__main__':
    '''
    Create the object and subsequently call its corresponding
    method for matching
    '''
    web = monitor(argv[1:])
    web.match()

Control + o to save the script. Enter a name for the new file e.g. web-js-monitor.py, with a full path if it is to be saved in a location other than the current working directory e.g. /app/web-js-monitor.py. Be sure to add .py at the end of the file name to remark the file as a Python script Press enter to save.

Control + x to exit nano.

The key elements will be examined in the discussion that follows.

Import

Before anything else goes the import statement that instructs Python to literally import objects and methods from other standard or installed libraries that may be leveraged to code upon.

import requests
...
from sys import path, argv
...

Class

Following the Object-Oriented programming, this script is constructed around a main object aptly named monitor within which variables and methods are housed.

class monitor:
    ...

Declaring Variables

The top part of the monitor object define the error messages. These messages are referenced to throughout the class. Declaring them once and right at the top of the class makes modification easier if comes a time in the future that these need to be changed.

class monitor:
    ERR_USAGE = '''\
    Usage: web-js-monitor.py [-e] -u 
    Option:
      -h
        Display usage
      -e, --email
        Send email notification for changes
      -u, --url
        The URL of interest
    '''
    ...

Downloading and Matching

The rest of the monitor object defines the available methods. First among all is of course the reserved __init__ method which is the compulsory method for any class in Python and is called once an instance of the class is initialised. For the monitor object this method is where the URL of interest and whether or not email notification is needed, are read from the arguments supplied to the script. If email notification is requested, the script will read the corresponding environment variables that supply the Simple Mail Transfer Protocol over SSL (SMTPS) port number, the server address of the email sender, the email sender and receiver addresses, and the sender email password, that will be needed for the email notification. If Gmail is used as the sender email address, given that by default on security consideration it does not allow access from scripts or applications that do not meet their security standards. For this script to sign into the sender Gmail account and to send an email to the designated recipient, the less secure app access needs to be enabled. Follow these steps to turn it on before proceeding.

The method will then proceed to download a copy of the URL and the external JavaScript files if any that are referred to in the page, before generating a SHA 256 checksum of the downloaded contents. Comments in the form of a multi-line string starting and ending with triple quotes (''') or leading with a hashtag (#) are left for easier navigation along the way.

class monitor:
    ...

    def __init__(self, argv):
        ...

Following the __init__ method is a method named match that does precisely what its name says i.e. matching the generated SHA 256 checksum to a previous build. Logic dictates that there are three exhaustive outcomes, namely a match or otherwise, together with a non-match on ground of a missing previous build. It is the job of this match method to determine which outcome is prevalent and call the other two methods that handle the output and the notification as appropriate.

class monitor:
    ...

    def __init__(self, argv):
        ...

    def match(self):
        ...

The two remaining methods that follow, namely __write and __email manage the output and the notification. These are private methods that are not designed to be called by anything outside of the class itself for obvious reason. Without the previous match method that determines which reality the script finds itself in, dumping any output or triggering any email notification could have dire consequences. For keeping these two methods private to the class, they are named with two leading underscores that triggers name mingling which will among other things prevents them from being called outside the class.

class monitor:
    ...

    def __init__(self, argv):
        ...

    def match(self):
        ...

    def __write(self):
        ...

    def __email(self):
        ...

Right after the monitor object is defined, comes the instruction to initiate an instance of it with the rest of the arguments supplied from when this script is called, if and only if the script is called directly instead of being imported as a library. For otherwise it is up to the other script to initiate the monitor object as appropriate.

if __name__ == '__main__':
    ...

Testing and Troubleshooting

With everything in place, the script can now be tested by simply calling Python to interpret and execute it with the URL of interest e.g. https://lookingglass.pccwglobal.com/.

$ python3 /app/web-js-monitor.py -e -u https://lookingglass.pccwglobal.com/

No output message will be printed if all goes well. In the same directory of the script, there will be a new directory named by the URL of the web page of interest, with the checksum stored in a file directly under it and the rest of the downloaded contents in a nested directory named by the full date and time of the download to ease access.

.
└── url/
    ├── url-sha256hash
    └── YYYY-MM-DD-HH-MM-SS/
        ├── index.html
        ├── javaScript1.js
        ├── javaScript2.js
        └── javaScript3.js

Simulating a change in the web page content is cumbersome and next to impossible if it is a web site off access, easier it would be to edit the SHA-256 checksum file to simulate a change in the web page.

Print null to the SHA-256 checksum file to overwrite its content.

echo > *full-path-to-sha-256-checksum*

$ echo > /app/lookingglass.pccwglobal.com/lookingglass.pccwglobal.com-sha256hash

Execute the script again with Python on the same URL will print no output if all goes well. In the directory named by the URL there will be yet a different set of the downloaded contents, and an email will be sent to the designated recipient which will note that a change has been detected and the updates are stored in separate files.

Below are some of the common errors.

python3: command not found

Be sure that the latest version of Python is properly installed. Please refer to the previous discussion.

ImportError: No module named requests

Be sure that the Python library, requests, is properly installed. Please refer to the previous discussion.

ImportError: No module named bs4

Be sure that the Python library, Beautiful Soup, is properly installed. Please refer to the previous discussion.

Scheduling a Python Script

With the script working as expected, it may now be scheduled to run on a regular basis to monitor the web page of interest for update. Cron is the task scheduler for Linux. Tasks can be scheduled by adding them to the crontab, which is short for cron table.

To edit the crontab.

$ crontab -e

If asked which text editor should be used to edit the crontab, select nano or any one of preference.

Add a new schedule at the end of the crontab.

*/15 * * * * python3 *full-path-to-script* -e -u *url-of-interest* >> *full-path-to-script.log* 2>&1

*/15 * * * * python3 /app/web-js-monitor.py -e -u https://lookingglass.pccwglobal.com/ >> /app/web-js-monitor.py.log 2>&1

The first part of the line i.e. */15 * * * * informs cron when and at what interval this task should be executed which in this case is once every 15 minutes. The remaining part of the line tells cron which script to execute when the time comes. It also redirects the standard output and standard error of the script to a log file of your choice.

When done, control + x to exit, and y and enter to save.

Thoughts

Python is orders of magnitude more powerful than what this discussion could demonstrate. For this reason, it should come as no surprise that Python is by one measure one of the most popular programming languages for years on end. Experiment and be amazed.