src.verify_links module¶
Checks html links in the output of the files returned by a run of the MDTF package and verifies that all linked files exist.
This is called by default at the end of each run, to determine if any PODs have failed without raising errors.
Based on test_website by Dani Coleman, bundy@ucar.edu.
- class src.verify_links.Link(origin, target)¶
Bases:
tuple
Class representing individual links, to simplify bookkeeping.
- Attributes:
origin (str) – URL of the document containing the link.
target (str) – URL referred to by the link.
- origin¶
- target¶
- class src.verify_links.LinkParser(*, convert_charrefs=True)[source]¶
Bases:
HTMLParser
Custom subclass of
HTMLParser
which constructs an iterable over each<a>
tag. Adapted from https://stackoverflow.com/a/41663924.- handle_starttag(tag, attrs)[source]¶
Custom code for this subclass that extracts contents of
<a>
tags.
- CDATA_CONTENT_ELEMENTS = ('script', 'style')¶
- __init__(*, convert_charrefs=True)¶
Initialize and reset this instance.
If convert_charrefs is True (the default), all character references are automatically converted to the corresponding Unicode characters.
- class src.verify_links.LinkVerifier(root, rel_path_root=None, verbose=False, log=None)[source]¶
Bases:
object
- __init__(root, rel_path_root=None, verbose=False, log=None)[source]¶
Initialize search for broken links.
- Parameters:
root (str) – Either a URL or path on the local filesystem. Location of the top-level html file to begin the search from.
rel_path_root (str, optional) – Either a URL or path on the local filesystem. If given, used as the path that relative paths to missing files are given relative to. Defaults to root (if root is a directory) or the directory containing root (if root is a file.)
verbose (bool, default False) – Set to True to print each file examined.
- static gen_links(f, parser)[source]¶
Generator which parses the contents of an HTML file f and yields targets of all the links it contains. Adapted from https://stackoverflow.com/a/41663924.
- Parameters:
f –
urllib.respose
object of the form returned byurlopen()
: eitherHTTPResponse
for http or https, oraddinfourl
for files.parser – instance of
LinkParser
.
- Yields:
Contents of the href attribute of each
<a>
tag of f, as extracted by parser.
- breadth_first(root_url)[source]¶
Breadth-first search of all files linked from an initial root_url.
The search correctly handles cycles (ie, A.html links to B.html and B.html links to A.html) and only examines files in subdirectories of root_url’s directory, so that links to external sites are ignored, rather than trying to trace the link structure of the whole internet.
- group_relative_links(missing)[source]¶
Format paths to missing linked files as relative paths, grouped by POD.
- Parameters:
missing (list) – List of
Link
objects found bybreadth_first()
, whose targets correspond to missing files.- Returns:
Dict, with keys given by the short names of PODs with missing files and values given by a list of the files that POD is missing. Missing files are listed by their path relative to the POD’s output directory.
- verify_pod_links(pod_name)[source]¶
Perform search for missing linked files that were supposed to have been output by pod_name.
- Parameters:
pod_name – Name of the POD to check for missing files.
- Returns:
A list of the files that POD is missing. Missing files are listed by their path relative to the POD’s output directory.
- verify_all_links()[source]¶
Perform search for any missing linked files from a run of the MDTF framework and collect them by POD.
- Returns:
Dict, with keys given by the short names of PODs with missing files and values given by a list of the files that POD is missing. Missing files are listed by their path relative to the POD’s output directory.