This page is a tutorial, showing a real-world example of using emacs lisp to do many tag transformation.
I need to transform many HTML tags. Typically, they are of the form BeginDelimiter…EndDelimiter, where the delimiters may be curly quotes “…”, or it may be a HTML tag such as
<span class="xyz">…</span>.
I need to apply the transformation on over 4 thousand HTML pages, and needs it to be accurate, mostly on a case-by-case base with human watch.
Also, the delimiters may be nested, so regex won't work. They either getting too much text (using default greedy match) or getting not enough text (using shy group). With a elisp script, you can use if and other emacs functions, to correctly find the matching ending tag, as well automatically skip cases that this transform should not apply, so drastically reduce the need for human watch.
In the past week, i spend about 2 days and done a lot text processing with elisp on the 4 thousand files of my site. Here's the changes i've made:
The purpose of the change is to make the syntactical markup more semantically precise. Before, they are all marked by double curly quotes. Now, if i want to find all books i cited on my site, i can do so easily by a simple search on a special bracket for book titles. These changes also make the text easier to read. In the future, if i want all book titles to be colored red for example, i can easily do that by changing the 《》 to a HTML markup (⁖ <span class="title">…</span>), or use a JavaScript to do that on the fly. Same for emacs keybinding. For example, with this clear syntax, it's easier to write a JavaScript so that when mouse is hovering over the keybinding notation, it shows a balloon of the command name for that key.
All this is part of the HTML Microformat, which is part of semantic web concept. The basic ideas is that, the syntax encodes semantics. This advantage is part of the major reason XML becomes so useful. (the other reason is its regular syntax.)
For info on various brackets used, see: Intro to Chinese Punctuation and Matching Brackets in Unicode.
Also, much of the HTML markup on my site has been cleaned up. For example:
<span class="code">…</span> ⇒ <code>…</code>“<span class="code">…</span>” ⇒ <code>…</code> (Remove the redundant curly quote. It can be auto added with Cascading Style Sheet (CSS) if needed.)<span class="key">…</span> ⇒ <kbd>…</kbd> (Change to standard tag; reduce char count.)<span class="kbd">…</span> ⇒ … (Remove the tag. Was designed to mark emacs key notation, but doesn't make much sense. Now, 【】 does it.)There are several advantages in these changes. For example, <code> is much shorter than <span class="code">, and it has a standard meaning. It is also more unique than “span” tag, so that reduce parsing complexity when i need to process “span” tags.
Here's a side note about key notations.
Also, i used to use a <span class="kbd">…</span> tag to markup emacs key notation, but my use isn't consistent. For example, 【Ctrl+x find-file】 might be marked as 【Ctrl+x】 find-file or 【Ctrl+x find-file】. The problem is actually quite thorny. It is about designing a consistent notation for keyboard shortcuts. Keep in mind that there are many types of key shortcuts. ⁖ single key such as {F1, ❖ Win}, or normal combination such as 【Ctrl+x】 , or a sequence of combination such as emacs's 【Ctrl+x Ctrl+f】 and 【Ctrl+x f】, or Window's 【Alt+Space c】, 【Alt t i】 (accessing menu by key, called “menu accelerator”). (A sequence of single keys is also common when you have sticky keys on, available in Windows, Mac, Linux.)
In general, it is not trivial to design a notation that is not ambiguous and covers all these different types of common key shortcuts practices. In general, you want a notation that can contain a sequence of key-press elements, and each key-press element can be a single key or key combination (such as 【Ctrl+x】). Also, note that the key 【Ctrl+x】 does not simply mean pressing them together, but actually pressing and hold Ctrl first, and release it last.
Also, in designing such a notation, there's a consideration of space char in the notation. For example, 【Ctrl+c Ctrl+c】 does not mean you have to press a space in this sequence, rather, space is used as a separator.
Another issue to consider is the plus sign in it. For example, 【Ctrl+x】 does not actually involve pressing the “+” key. Rather, “+” is used to indicate combination. This can be a readability problem when you have 【Ctrl++】(usually for zoom in). (in Microsoft's software, such case is simply written as 【Ctrl+】 — a break of regularity. Apple's notation simply does not use any conjunction sign; it just place 2 keys together meaning for simultaneous pressed keys..)
Another issue is whether to consider a key as a key or as a character. For example, by convention, 【Ctrl+X】 means pressing the lower case x key, not capital X. This does introduce ambiguity. Most app's menu use a notation that explicitly include a ⇧ Shift key. For example, in Firefox, “Show All History” shortcut is written as 【Ctrl+⇧ Shift+H】, but for “Zoom In” it is written as 【Ctrl++】 not 【Ctrl+⇧ Shift++】 nor 【Ctrl+⇧ Shift+-】. When you consider different keyboard layout, for example the QWERTZ layout used in Germany, the # key is not the shifted 3, this inconsistency about Shift key creates more ambiguity. 〔☛ Idiocy of Keyboard Layouts: QWERTZ, AZERTY, Alt Graph〕
Also, what does it mean when you have a sequence of char? For example, 【Ctrl+x】 does not mean pressing C, then t, then r, then l. However, in 【Meta+x dired】, it does mean press each of the character in the word dired.
See also: Short Survey of Keyboard Shortcut Notations.
The way i want a human readable key notation with the degree of precision is close to creating a language for key macro applications. But if you look at those apps, their syntax is not human readable, hugely inconsistent, and basically most of them are just syntax soup with lots of special cases. See:
To do these tag transformations, simple cases such as
“file path” ⇒ 〔file path〕
, where the delimiters are single characters and there's no nesting, they can be done with emacs's dired-do-query-replace-regexp.
〔☛ Interactively Find/Replace String Patterns on Multiple Files〕
More complicated cases with nested HTML tags, can be done with a elisp script. Here's the general plan.
sgml-skip-tag-forward to move to the end matching tagTo open the file, we can use find-file.
To search for the tag, we do:
(while (search-forward "<span class=\"code\">" nil t) … )
We give “t” for the third argument. It means don't complain if not found.
The next step is to get the beginning and ending positions of the opening tag. The end position is simply the current cursor position, because the search-forward automatically place it there. To get the beginning position, we just use search-backward on “<”
Now, we need to get the beginning and ending positions of the matching end tag.
This may be a problem because the tags are nested, so there may be many </span> before the one we want.
The good thing is that emacs's html-mode has sgml-skip-tag-forward function. It will move cursor from a beginning tag to its matching end tag.
Once we got the beginning and ending positions for the beginning and ending tags, we can now easily do replacement. Just use delete-region, then use insert to insert the new tag we want.
One thing important is that we should do replacement with the ending tag first, because if we replace the beginning tag first, the positions of the ending tag will be changed.
;; -*- coding: utf-8 -*- ;; 2010-08-25 ;; change ;; <span class="code">…</span> ;; to ;; <code>…</code> (setq inputDir "~/web/xahlee_org/" ) ; dir should end with a slash (defun my-process-file (fPath) "process the file at fullpath fPath …" (let ( myBuff changedQ p3 p4 p8 p9) ;; open the file ;; search for the tag ;; if found, move to the beginning of tag, mark positions of beginning and ending of < and > ;; use sgml-skip-tag-forward to move to the end matching tag </span> ;; mark positions of beginning and ending of < and > ;; replace them with <code> and </code> ;; repeat (setq myBuff (find-file fPath ) ) (setq changedQ nil ) (goto-char 1) (while (search-forward "<span class=\"code\">" nil t) (backward-char 1) (if (looking-at ">") (setq p4 (1+ (point)) ) (error "expecting <" ) ) ;; go to beginning of "<span class="code">" (sgml-skip-tag-backward 1) (if (looking-at "<") (setq p3 (point) ) (error "expecting <" ) ) (forward-char 2) ;; go to end of </span> (sgml-skip-tag-forward 1) (backward-char 1) (if (looking-at ">") (setq p9 (1+ (point)) ) (error "expecting >" ) ) ;; go to beginning of </span> (backward-char 6) (if (looking-at "<") (setq p8 (point) ) (error "expecting <" ) ) (when (yes-or-no-p "change? ") (delete-region p8 p9 ) (insert "</code>") (delete-region p4 p3 ) (goto-char p3) (insert "<code>") (setq changedQ t ) )) ;; if not changed, close it. Else, leave buffer open (if changedQ (progn (make-backup)) ; leave it open (progn (kill-buffer myBuff)) ) )) (require 'find-lisp) (let (outputBuffer) (setq outputBuffer "*span tag to code tag*" ) (with-output-to-temp-buffer outputBuffer (mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$")) (princ "Done deal!") ) )
In the code above, i also put extra checks to make sure that the position of beginning tag is really the < char. Same for ending tag. (probably redundant, but i tend to be extra careful.)
Also, i used a yes-or-no-p function, so emacs will prompt me for each change that i can visually check.
For those files that are changed, i leave them open. So, if i decided on a whim i don't want all these to happen on potentially hundreds of files that i've changed, i can simply close all the buffer with 4 keystrokes with ibuffer. Same if i want to save them all.
For files that no change takes place, the buffer is simply closed.
In the above, i also called “make-backup”. I want to make a backup of changed file, but not relying on emac's automatic backup mechanism (i have it turned off). Here's the code.
(defun make-backup () "Make a backup copy of current buffer's file. Create a backup of current buffer's file. The new file name is the old file name with “~” added at the end, in the same dir. If such a file already exist, append more “~”. If the current buffer is not associated with a file, its a error." (interactive) (let (cfile bfilename) (setq cfile (buffer-file-name)) (setq bfilename (concat cfile "~")) (while (file-exists-p bfilename) (setq bfilename (concat bfilename "~")) ) (copy-file cfile bfilename t) (message (concat "Backup saved as: " (file-name-nondirectory bfilename))) ) )
Emacs is fantastic!