Elisp vs Perl: Validate File Links
This page shows 2 scripts to validate HTML local file links (i.e. check file existence of local links). One written in perl, one in elisp.
The 2 script's algorithms are not artificially made to be the same, but follow the natural style/idiom for each lang. They do the same job for my need.
For each file, call “process_file”. That function then calls
to get a list of links, then print the link if it leads to a non-existent
The heart of the algorithm is the “get_links” function. It reads the whole file content as one big string, then split the string by the char “<”, then, for each segment of text, it proceed to find a link using regex.
full updated code at Perl: Validate Local Links
For each file, call “my-process-file”. Then, the file is put into a buffer. Then, it uses regex search, and moving cursor, etc, to make sure that we find links we want to check.
What is Valid HTML
Note that the HTML files are assumed to be W3C valid (i.e. no missing closing tags or missing “>”). However, my code are not general enough to cover arbitrary valid HTML. SGML based HTML are very complex, and isn't just nested tags, but such HTML are basically never used. The perl and elisp code here work correctly (get all links) for perhaps 99.9% HTML files out there. (symbolic links or other alias mechanisms on file system are not considered.)
Edge Case Examples
Here's some edge cases. These examples show that you cannot simply use regex to search for the pattern
<a href="…" …>. Here's a most basic example:
<a href="math.html" title="x > y">math</a>
Note that the above is actually valid HTML according to W3C's validator. Also, note that pages passing W3C validator are not necessarily valid by W3C's HTML spec. [see W3C HTML Validator Invalid]
One cannot simply use regex to search for pattern
<a href="…" …>, and this is especially so because some HTML pages
contains sample HTML code for teaching HTML, and
programing tutorials containing code example of using regex to parse HTML.
So, the HTML is sometimes HTML embedded in HTML,
HTML code in regex in python code on a HTML page.
The following shows that patterns such as
src="…" are not necessarily HTML links.
Perl vs Emacs Lisp
One thing interesting is to compare the approaches in perl and emacs lisp.
For our case, regex is not powerful enough to deal with the problem by itself, due to the nested nature of HTML. This is why, in my perl code, i split the file by “<” into segments first, then, use regex to deal with now the non-nested segment. This will break if you have
<a title="x < href=z" href="math.html">math</a>. This cannot be worked around unless you really start to write a real parser.
The elisp here is more powerful, not because of any lisp features, but because emacs's buffer datatype. You can think of it as a glorified string datatype, that you can move a cursor back and forth, or use regex to search forward or backward, or save cursor positions (index) and grab parts of text for further analysis.
also, might checkout my perl tutorial Learn Perl in 1 Hour
Emacs Lisp Misc Technical Essays
- Elisp coding style: let forms
- ELisp Naming Convention
- Some and Every
- What is the Function fn?
- Symbol vs String
- Meaning of Lisp List, Function Type, and Syntax Coloring
- Elisp vs Perl: Validate File Links
- Text Processing: ELisp vs Perl
- Controversy of Common Lisp Package in ELisp
- Lisp List Problem
- Lisp-1 vs Lisp-2
- ELisp Problems: Trim String, Regex Match Data, Lacking Namespace
- Functional Programing: Function Output Should Always Have the Same Structure