Elisp: Find String Inside HTML Tag
This page shows emacs lisp script that search files, similar to unix grep, but with the condition that the string must happen inside a specific HTML tag. If you don't know elisp, first take a look at Emacs Lisp Basics.
We need to list all files that contains a given string, and only if it is inside a given HTML tag. i.e. the condition that some strig must happen before and after the string we want.
This is something grep and linux shell commands cannot do easily, and difficult to do even with Perl, Python, unless you use a HTML parser (which gets complex).
[see Learn Perl in 1 Hour]
Here's the code:
;; -*- coding: utf-8 -*- ;; 2011-02-25 ;; print files that meet this condition: ;; contains <div class="x-note">…</div> ;; where the text content contains more than one bullet char • (setq inputDir "~/web/" ) ; dir should end with a slash (require 'sgml-mode) ; need sgml-skip-tag-forward (defun my-process-file (fPath) "Process the file at FPATH …" (let (myBuffer p3 p4 (bulletCnt 0) (totalCnt 0)) (when (and (not (string-match "/xx" fPath))) ; skip some dir (setq myBuffer (get-buffer-create " myTemp")) (set-buffer myBuffer) (insert-file-contents fPath nil nil nil t) (setq bulletCnt 0 ) (goto-char 1) (while (search-forward "<div class=\"x-note\">" nil t) (setq p3 (point)) ; beginning of text content, after <div class="x-note"> (backward-char) (sgml-skip-tag-forward 1) (backward-char 6) (setq p4 (point)) ; end of tag content, before the </div> (setq bulletCnt (count-matches "•" p3 p4)) (when (> bulletCnt 2) (setq totalCnt (1+ totalCnt)) (princ (format "this many: %d %s\n" bulletCnt fPath)))) (kill-buffer myBuffer)))) (require 'find-lisp) (let (outputBuffer) (setq outputBuffer "*xah occur output*" ) (with-output-to-temp-buffer outputBuffer (mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$")) (princ "Done deal!")))
The code is almost exactly the same as our previous grep in Emacs Lisp code, with only about 5 line change.
Here's the main part that's different:
(while (search-forward "<div class=\"x-note\">" nil t) (setq p3 (point) ) ; beginning of text content, after <div class="x-note"> (backward-char) (sgml-skip-tag-forward 1) (backward-char 6) (setq p4 (point) ) ; end of tag content, before the </div> (setq bulletCnt (count-matches "•" p3 p4) ) (when (> bulletCnt 2) (setq totalCnt (1+ totalCnt) ) (princ (format "this many: %d %s\n" bulletCnt fPath)) ) )
The interesting part here is the use of
sgml-skip-tag-forward function. It moves the cursor to the end of the tag, skipping all nested tags. Without this function, it may take days to code it in any language, unless you have experience in writing parsers.
(note that this cannot be done with regex.)
Find Replace, or just Find, has extensive use in text processing. Here's some examples of variations, all of which i need on weekly basis and have several elisp scripts to do the job:
- Find by printing file names.
- Find by printing adjacent text.
- Find by condition of number of occurrences.
- Find only if the file contains a set of strings.
- Replace text based on file's name.
- Find (or and Replace) only if
<h1>name doesn't match.
- Find/Report/Replace only if the string is at a particular position in the file. (For example, near top, near bottom.)
- Find only if the string is inside a tag.
Why I Wrote This Code
Here's a little side story on why i wrote this one. It's a bit long.
I have about 30 classic literature with annotations. For example: The Arabian Nights. The annotation system is custom HTML tags with proper CSS. (so-called HTML Microformat) Extensive custom emacs commands are used to insert such tags.
Each annotation are in the tag
<div class="x-note">…</div>. For example:
<div class="x-note">• provaunt ⇒ provide. Provant is a verb meaning: To supply with provender or provisions.</div>
However, some “x-note” block is multiple annotations in one. For example, in the story The Fisherman And The Jinni, there's this one:
<div class="x-note">• stint ⇒ a fixed amount or share work. • might and main ⇒ with all effort and strength. • skein ⇒ A quantity of yarn, thread, or the like, put up together, after it is taken from the reel. • buffet ⇒ hit, beat, especially repeatedly. • fain ⇒ with joy; satisfied; contented. </div>
Each of the annotation are marked by a bullet “•” symbol, followed by a word. Each word corresponds to the same word in the main text marked by
(I pack multiple annotation into one “x-note” block to save space in HTML rendering.)
To make that possible from my existing system, i need to make sure of few things:
- ① My custom markup must have precise semantics.
- ② The syntax should be as simple as possible. (otherwise it increases the work in js code)
- ③ My HTML file's annotation markup must have correct syntax. (else js will fail silently)
It would be simpler, if each “x-note” block contains just ONE annotation. This means, i will first change all my files that contain multi-annotation blocks to make them 1-annotation per block. This is a text processing job. (Hello emacs lisp!) But it will probably take a full day's work over about a thousand files. The number of files doesn't matter. What'd be time consuming is to do it correctly. First, i need to make sure the text has correct syntax. For example: make sure that each “x-note” do indeed contain at least one bullet symbol. Make sure that each
<span class="xnt">…</span> has a corresponding
<div class="x-note">…</div>. (i will be writing emacs lisp scripts to do these validation tasks.)
So, that's why i wrote this script. I wanted to get some idea of how many “x-note” blocks in which files actually contain multi-annotations.