Elisp: HTML Processing: Split Annotation
This page shows a example of emacs lisp for processing HTML. The HTML files are classic novels, with annotations. The annotation markups need to change from one format into another. There are hundreds of such pages that need to be processed.
For all HTML files in a directory, find any annotation markup containing the bullet “•” symbol:
<div class="annotate27223">A … • B … • C …</div>
Split the annotation into multiple markups, like this:
<div class="annotate27223">A … </div> <div class="annotate27223">B … </div> <div class="annotate27223">C … </div>
If you are a contract web dev programer, then you know that 99.99% of websites are a messy text soup. They are created by hundreds of tools or languages. Word processors, HTML generators, tens of lightweight markup languages, different frameworks from different languages PHP, Perl, Python, from different web era, from different programers in the past. Even emacs has several modes that generate HTML. They are not in any consistent form. Often, they have mis-matched tags too as invalid HTML.
It is in these situations, emacs shines thru, because emacs's powerful embedded language lisp, and its interactive nature, lets you maximize automation. Interactively when you are still feeling the pattern, then by Keyboard Macro or emacs lisp for parts that can be automated.
For my website, i take the time to make sure that all my HTML are consistent. But still, they are written in the span of 15 years. Periodically i take the time to improve the markup. For example, when new versions of CSS or HTML became mature and widely adopted by web browsers. (CSS1 to 2 to 3, HTML 3 to 4 to HTML5.)
I have hundreds of pages of classic novels as HTML documents. These documents contain annotations in a special HTML markup. For example, here's sample annotation from Titus Andronicus: Act 1:
SATURNINUS. 'Tis good, sir. You are very short with us; But if we live we'll be as sharp with you.
• sharp ⇒ Fierce, impetuous, hash, severe… (AHD)
Here's the raw HTML:
<div class="annotate27223">• short ⇒ rudely brief. (AHD)<br> • sharp ⇒ Fierce, impetuous, hash, severe… (AHD)</div> <pre class="text48074">SATURNINUS. 'Tis good, sir. You are very <span class="xntt">short</span> with us; But if we live we'll be as <span class="xntt">sharp</span> with you. </pre>
Here's how the tag works. Each
<span class="xntt"> markup a word in main text. When a word is marked by “span.xntt”, that means it has a sidebar annotation. The sidebar section is marked by
<div class="annotate27223">. Inside the “div.annotate27223”, there may be more than one entries. Each entry starts with the bullet symbol “•”. For example, in the above, the words “short” and “sharp” are both entries inside a “div.annotate27223” sidebar.
So, i want to write a elisp script to process all my files. If you simply read the spec for this job, of splitting a markup by a particular character, you may think it's trivial and can be done in any language in 10 minutes. Why then the elaborate discussion about text soup situation?
The important thing is that i DO NOT know what needs to be done to begin with. Only after having used emacs power together with lisp script i wrote before to look at and check my existing markup in hundreds of files, then i know what state they are and decide on what i want to do. Also, this change must be done with the ability to visually check that all changes are done correctly, because the input may not be in the format i expect. (it might be missing the bullet “•”.)
For those Scheme Lisp academic computer science folks, you might wonder, when i started with these annotations, why didn't i “design” it well to begin with. The reason is that, when i write a blog article, or my literature annotation project, i really want to focus on the writing first, the content, get it done, rather than get distracted by the CSS/HTML markup design. (one thing i do make sure is that whatever CSS/HTML i device, i made sure that they can be easily changed systematically later by a simple parsing.) I devote significantly more time on design than most people, but many factors necessitate change. For example, CSS in practice is rather complex and it takes years of experience to learn its quirks and tricks. Similarly, the best practices of HTML changes with time. (For example, see: Are You Intelligent Enough to Understand HTML5?.) Browsers change, standards changes (For example, HTML → XHTML → HTML5. See: HTML5 Doctype, Validation, X-UA-Compatible, and Why Do I Hate Hackers.), thoughts of best practices change, and my needs for the annotation also changed through-out the years.
Here's the outline of steps:
- Open the file. Search for the tag we want.
- Check if the tag contains a bullet “•”.
- If so, replace the bullet char with new end tag and beginning tag. For example:
- Do this for all files in a dir. (or a given list of files)
Here's the code:
;; -*- coding: utf-8 -*- ;; 2011-08-13 ;; process all files in a dir. ;; split any markup like this: ;; <div class="annotate27223">… • … • …</div> ;; by the bullet • ;; into several annotate27223 tags (setq inputDir "~/web/xahlee_org/p/" ) ;; add a ending slash if not there (when (not (string= "/" (substring inputDir -1) )) (setq inputDir (concat inputDir "/") ) ) ;; files to process (setq fileList [ "~/web/xahlee_org/p/arabian_nights/aladdin/aladdin4_1.html" "~/web/xahlee_org/p/arabian_nights/aladdin/aladdin3.html" ] ) (defun my-process-file-xnote (fPath) "Process the file at FPATH …" (let (myBuffer ($counter 0) p1 p2 $meat $meatNew (changedItems '()) (tagBegin "<div class=\"annotate27223\">" ) (tagEnd "</div>" ) ) (require 'sgml-mode) (when t (setq myBuffer (find-file fPath)) (goto-char 1) (while (search-forward "<div class=\"annotate27223\">" nil t) ;; capture the annotate27223 tag text (setq p1 (point)) (backward-char 1) (sgml-skip-tag-forward 1) (backward-char 6) (setq p2 (point)) (setq $meat (buffer-substring-no-properties p1 p2)) ;; if it contains a bullet (when (string-match "•" $meat) (setq $counter (1+ $counter)) ;; clean the text. Remove some newline and <br> that's no longer needed (setq $meat (replace-regexp-in-string "\n*• *" "•" $meat t t ) ) (setq $meat (replace-regexp-in-string "\n$" "" $meat t t ) ) ; delete ending eol (setq $meat (replace-regexp-in-string "<br>•" "•" $meat t t ) ) ;; put the new entries into a list, for later reporting (setq changedItems (split-string $meat "•" t) ) ;; break the bullet into new end/begin tags (setq $meatNew (replace-regexp-in-string "•" (concat tagEnd "\n" tagBegin) $meat t t ) ) (goto-char p1) (delete-region p1 p2) (insert $meatNew) ;; remove the newline before end tag (when (looking-back "\n") (delete-backward-char 1)) ) ) ;; report if the occurrence is not n times (when (not (= $counter 0)) (princ "-------------------------------------------\n") (princ (format "%d %s\n\n" $counter fPath)) (mapc (lambda ($x) (princ (format "%s\n\n" $x)) ) changedItems) ) ;; close buffer if there's no change. Else leave it open. (when (not (buffer-modified-p myBuffer)) (kill-buffer myBuffer) ) ) )) (require 'find-lisp) (let (outputBuffer) (setq outputBuffer "*xah annotate27223 output*" ) (with-output-to-temp-buffer outputBuffer ;; (mapc 'my-process-file-xnote fileList) (mapc 'my-process-file-xnote (find-lisp-find-files inputDir "\\.html$")) (princ "Done deal!") ) )
Here's a sample output: elisp_text_processing_split_annotation.txt.
I've put lots comments in the code. It should be easy to understand. If any part you don't understand, ask me. If you are new to elisp, checkout the first few section of Emacs Lisp Tutorial.
The weird ξ you see in my elisp code is Greek x. I use Unicode char in symbol name for easy distinction from builtin symbols. You can just ignore it. [see Programing Style: Variable Naming: English Words Considered Harmful]
I ♥ emacs.