Elisp: Process HTML, span, code, Key, Title, Markups
This page is a tutorial, showing a real-world example of using emacs lisp to do many tag transformation.
I need to transform many HTML tags. Typically, they are of the form
begin_delimiter…end_delimiter, where the delimiters may be curly quotes
“…”, or it may be a HTML tag such as
I need to apply the transformation on over 4 thousand HTML pages, and needs it to be accurate, mostly on a case-by-case base with human watch.
Also, the delimiters may be nested, so regex won't work. They either getting too much text (using default greedy match) or getting not enough text (using shy group). With a elisp script, you can use
if and other emacs functions, to correctly find the matching ending tag, as well automatically skip cases that this transform should not apply, so drastically reduce the need for human watch.
In the past week, i spend about 2 days and done a lot text processing with elisp on the 4 thousand files of my site. Here's the changes i've made:
- book title. For example: “Art Of Programing” ⇒ 〈Art Of Programing〉
- article title. For example: “How to Edit Elisp Code with Emacs” ⇒ 〈How to Edit Elisp Code with Emacs〉
- computer code. For example: “(setq x 1)” ⇒ 「(setq x 1)」
- file path. For example: “~/Documents/emacs/” ⇒ 〔~/Documents/emacs/〕
- keyboard shortcut notation. For example: “Ctrl+c” ⇒ 【Ctrl+c】
The purpose of the change is to make the syntactical markup more semantically precise. Before, they are all marked by double curly quotes. Now, if i want to find all books i cited on my site, i can do so easily by a simple search on a special bracket for book titles. These changes also make the text easier to read. In the future, if i want all book titles to be colored red for example, i can easily do that by changing the 《》 to a HTML markup (For example,
All this is part of the HTML Microformat, which is part of semantic web concept. The basic ideas is that, the syntax encodes semantics. This advantage is part of the major reason XML becomes so useful. (the other reason is its regular syntax.)
For info on various brackets used, see: Intro to Chinese Punctuation and Matching Brackets in Unicode.
Also, much of the HTML markup on my site has been cleaned up. For example:
<code>…</code>(Remove the redundant curly quote. Was a struggle to make a decision on this. Note that it can be auto added with Cascading Style Sheet (CSS) if needed.)
<kbd>…</kbd>(Change to standard tag; reduce char count.)
…(Remove the tag. Was designed to mark emacs key notation, but doesn't make much sense. Now, 【】 does it.)
There are several advantages in these changes. For example,
<code> is much shorter than
<span class="code">, and it has a standard meaning. It is also more unique than “span” tag, so that reduce parsing complexity when i need to process “span” tags.
[see Keyboard Notation Design Issues]
To do these tag transformations, simple cases such as
“file path” ⇒ 〔file path〕
, where the delimiters are single characters and there's no nesting, they can be done with emacs's
[see Emacs: Find Replace Text in Directory]
More complicated cases with nested HTML tags, can be done with a elisp script. Here's the general plan.
- Open the file
- Search for the tag
- If found, move to the beginning of tag, mark positions of begin/end of the opening tag
sgml-skip-tag-forwardto move to the end matching tag
- Mark positions of begin/end of the ending tag
- Replace the begin/end tags with new tags
To open the file, we can use
To search for the tag, we do:
(while (search-forward "<span class=\"code\">" nil t) … )
We give “t” for the third argument. It means don't complain if not found.
The next step is to get the begin/end positions of the opening tag. The end position is simply the current cursor position, because the search-forward automatically place it there. To get the beginning position, we just use search-backward on “<”
Now, we need to get the begin/end positions of the matching end tag.
This may be a problem because the tags are nested, so there may be many
</span> before the one we want.
The good thing is that emacs's
sgml-skip-tag-forward function. It will move cursor from a beginning tag to its matching end tag.
Once we got the begin/end positions for the begin/end tags, we can now easily do replacement. Just use
delete-region, then use
insert to insert the new tag we want.
One thing important is that we should do replacement with the ending tag first, because if we replace the beginning tag first, the positions of the ending tag will be changed.
;; -*- coding: utf-8 -*- ;; 2010-08-25 ;; change ;; <span class="code">…</span> ;; to ;; 「…」 (setq inputDir "~/web/xahlee_org/" ) ; dir should end with a slash (defun my-process-file (fPath) "process the file at fullpath fPath …" (let ( myBuff changedQ p3 p4 p8 p9) ;; open the file ;; search for the tag ;; if found, move to the beginning of tag, mark positions of begin/end of < and > ;; use sgml-skip-tag-forward to move to the end matching tag </span> ;; mark positions of begin/end of < and > ;; replace them with 「 and 」 ;; repeat (setq myBuff (find-file fPath ) ) (setq changedQ nil ) (goto-char 1) (while (search-forward "<span class=\"code\">" nil t) (backward-char 1) (if (looking-at ">") (setq p4 (1+ (point)) ) (error "expecting <" ) ) ;; go to beginning of "<span class="code">" (sgml-skip-tag-backward 1) (if (looking-at "<") (setq p3 (point) ) (error "expecting <" ) ) (forward-char 2) ;; go to end of </span> (sgml-skip-tag-forward 1) (backward-char 1) (if (looking-at ">") (setq p9 (1+ (point)) ) (error "expecting >" ) ) ;; go to beginning of </span> (backward-char 6) (if (looking-at "<") (setq p8 (point) ) (error "expecting <" ) ) (when (y-or-n-p "change? ") (delete-region p8 p9 ) (insert "」") (delete-region p4 p3 ) (goto-char p3) (insert "「") (setq changedQ t ) )) ;; if not changed, close it. Else, leave buffer open (if changedQ (progn (make-backup)) ; leave it open (progn (kill-buffer myBuff)) ) )) (require 'find-lisp) (let (outputBuffer) (setq outputBuffer "*span tag to code tag*" ) (with-output-to-temp-buffer outputBuffer (mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$")) (princ "Done deal!") ) )
In the code above, i also put extra checks to make sure that the position of beginning tag is really the
< char. Same for ending tag. (probably redundant, but i tend to be extra careful.)
Also, i used a
y-or-n-p function, so emacs will prompt me for each change that i can visually check.
For those files that are changed, i leave them open. So, if i decided on a whim i don't want all these to happen on potentially hundreds of files that i've changed, i can simply close all the buffer with 4 keystrokes with
Same if i want to save them all.
[see Emacs: ibuffer tutorial]
For files that no change takes place, the buffer is simply closed.
In the above, i also called “make-backup”. I want to make a backup of changed file, but not relying on emac's automatic backup mechanism (i have it turned off). For the code, see: Emacs: Backup Current File.
Emacs is fantastic!
Elisp Script Examples
- Write grep in Elisp
- Find String Inside HTML Tag
- Validate Matching Brackets
- Generate Links Report
- Generate Sitemap
- Archive Website For Reader Download
- Process File line-by-line
- Text-Soup Automation
- Split HTML Annotation
- Fixing Dead Links
- Elisp vs Perl: Validate Local File Links
- Transform Page Tag
- Transform HTML FAQ Tags
- Transform HTML Tags
- “figure” to “figcaption”
- “span.w” to “b”