This page shows 2 scripts to validate HTML local file links (i.e. check file existence of local links). One written in perl, one in elisp.
The 2 script's algorithms are not artificially made to be the same, but follow the natural style/idiom for each lang. They do the same job for my need.
For each file, call “process_file”. That function then calls
get_links($file_full_path) to get a list of links, then print the link if it leads to a non-existent file.
The heart of the algorithm is the “get_links” function. It reads the whole file content as one big string, then split the string by the char “<”, then, for each segment of text, it proceed to find a link using regex.
For each file, call “my-process-file”. Then, the file is put into a buffer. Then, it uses regex search, and moving cursor, etc, to make sure that we find links we want to check.
Note that the HTML files are assumed to be W3C valid (i.e. no missing closing tags or missing “>”). However, my code are not general enough to cover arbitrary valid HTML. SGML based HTML are very complex, and isn't just nested tags, but such HTML are basically never used. The perl and elisp code here work correctly (get all links) for perhaps 99.9% HTML files out there. (symbolic links or other alias mechanisms on file system are not considered.)
Here's some edge cases. These examples show that you cannot simply use regex to search for the pattern
<a href="…" …>. Here's a most basic example:
<p><a href="cat.html" title="x > y">cat</a></p>
Note that the above is actually valid HTML according to W3C's validator. 〔☛ html test page: greater/less sign as title value〕 Also, note that pages passing W3C validator are not necessarily valid by W3C's HTML spec. 〔☛ W3C HTML Validator Invalid〕
One cannot simply use regex to search for pattern
<a href="…" …>, and this is especially so because some HTML pages are programing tutorials containing perl/python/elisp code example of using regex to parse HTML. So, that HTML page is code inside code inside code. All the following examples are from actual HTML pages on my site. (the HTML pages on my site are all W3C valid.)
The following shows that patterns such as
src="…" are not necessarily URL.
You can view these in plain text: elisp_vs_perl_validate_links_edge_cases.txt.
One thing interesting is to compare the approaches in perl and emacs lisp.
For our case, regex is not powerful enough to deal with the problem by itself, due to the nested nature of HTML. This is why, in my perl code, i split the file by “<” into segments first, then, use regex to deal with now the non-nested segment. This will break if you have
<a title="x < href=z" href="math.html">math</a>. This cannot be worked around unless you really start to write a real parser.
The elisp here is more powerful, not because of any lisp features, but because emacs's buffer datatype. You can think of it as a glorified string datatype, that you can move a cursor back and forth, or use regex to search forward or backward, or save cursor positions (index) and grab parts of text for further analysis.