This page shows emacs lisp script that search files, similar to unix grep, but with the condition that the string must happen inside a HTML tag. If you don't know elisp, first take a look at Emacs Lisp Basics.
Few days ago we covered How to Write grep in Emacs Lisp. With a few lines of change, here's a version that search a string only if it is inside a HTML tag. This is something grep and unix tool bag cannot do, and difficult to do even with Perl, Python, unless you use a HTML parser.
Here's the code:
;; -*- coding: utf-8 -*- ;; 2011-02-25 ;; print files that meet this condition: ;; contains <div class="x-note">…</div> ;; where the text content contains more than one bullet char • (setq inputDir "~/web/xahlee_org/" ) ; dir should end with a slash (require 'sgml-mode) ; need sgml-skip-tag-forward (defun my-process-file (fPath) "Process the file at FPATH …" (let (myBuffer p3 p4 (bulletCnt 0) (totalCnt 0) ) (when (and (not (string-match "/xx" fPath)) ) ; skip some dir (setq myBuffer (get-buffer-create " myTemp")) (set-buffer myBuffer) (insert-file-contents fPath nil nil nil t) (setq bulletCnt 0 ) (goto-char 1) (while (search-forward "<div class=\"x-note\">" nil t) (setq p3 (point) ) ; beginning of text content, after <div class="x-note"> (backward-char) (sgml-skip-tag-forward 1) (backward-char 6) (setq p4 (point) ) ; end of tag content, before the </div> (setq bulletCnt (count-matches "•" p3 p4) ) (when (> bulletCnt 2) (setq totalCnt (1+ totalCnt) ) (princ (format "this many: %d %s\n" bulletCnt fPath)) ) ) (kill-buffer myBuffer) ) )) (require 'find-lisp) (let (outputBuffer) (setq outputBuffer "*xah occur output*" ) (with-output-to-temp-buffer outputBuffer (mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$")) (princ "Done deal!") ) )
The code is almost exactly the same as our previous grep in Emacs Lisp code, with only about 5 line change. (if you are not familiar with the code, see: grep in Emacs Lisp.)
Here's the main part that's different:
(while (search-forward "<div class=\"x-note\">" nil t) (setq p3 (point) ) ; beginning of text content, after <div class="x-note"> (backward-char) (sgml-skip-tag-forward 1) (backward-char 6) (setq p4 (point) ) ; end of tag content, before the </div> (setq bulletCnt (count-matches "•" p3 p4) ) (when (> bulletCnt 2) (setq totalCnt (1+ totalCnt) ) (princ (format "this many: %d %s\n" bulletCnt fPath)) ) )
The interesting part here is the use of
sgml-skip-tag-forward function. It moves the cursor to the end of the tag, skipping all nested tags. Without this function, it may take days to code it in any language, unless you have experience in writing parsers.
(note that this cannot be done with regex.)
The other thing i really love about this is that we are not using a XML parser. If you do this in Perl or Python, you almost certainly have to use a HTML/XML parser. With that, your code becomes really complex, dealing with parents, child, tree structure, and not simple text processing.
O Emacs! ♥
Find/Replace, or just Find, has extensive use in text processing. Here's some examples of variations, all of which i need on weekly basis and have several elisp scripts to do the job:
<h1>name doesn't match.
Here's a little side story on why i wrote this one. It's a bit long.
I have about 30 classic literature with annotations. For example: The Tale Of The Bull And The Ass. The annotation system is custom HTML tags with proper CSS. (so-called HTML Microformat) Extensive custom emacs commands are used to insert such tags. (you can see these commands at Xah Lee's Emacs Customization Files.)
Each annotation are in the tag
<div class="x-note">…</div>. For example:
<div class="x-note">• provaunt ⇒ provide. Provant is a verb meaning: To supply with provender or provisions.</div>
However, some “x-note” block is multiple annotations in one. For example, in the story The Fisherman And The Jinni, there's this one:
<div class="x-note">• stint ⇒ a fixed amount or share work. • might and main ⇒ with all effort and strength. • skein ⇒ A quantity of yarn, thread, or the like, put up together, after it is taken from the reel. • buffet ⇒ hit, beat, especially repeatedly. • fain ⇒ with joy; satisfied; contented. </div>
Each of the annotation are marked by a bullet “•” symbol, followed by a word. Each word corresponds to the same word in the main text marked by
(I pack multiple annotation into one “x-note” block to save space in HTML rendering.)
To make that possible from my existing system, i need to make sure of few things:
It would be simpler, if each “x-note” block contains just ONE annotation. This means, i will first change all my files that contain multi-annotation blocks to make them 1-annotation per block. This is a text processing job. (Hello emacs lisp!) But it will probably take a full day's work over about a thousand files. The number of files doesn't matter. What'd be time consuming is to do it correctly. First, i need to make sure the text has correct syntax. For example: make sure that each “x-note” do indeed contain at least one bullet symbol. Make sure that each
<span class="xnt">…</span> has a corresponding
<div class="x-note">…</div>. (i will be writing emacs lisp scripts to do these validation tasks.)
So, that's why i wrote this script. I wanted to get some idea of how many “x-note” blocks in which files actually contain multi-annotations.