Elisp: Find String Inside HTML Tag

By Xah Lee. Date: . Last updated: .

This page shows emacs lisp script that search files, similar to unix grep, but with the condition that the string must happen inside a specific HTML tag. If you don't know elisp, first take a look at Emacs Lisp Basics.

Problem

We need to list all files that contains a given string, and only if it is inside a given HTML tag. i.e. the condition that some strig must happen before and after the string we want.

This is something grep and linux shell commands cannot do easily, and difficult to do even with Perl, Python, unless you use a HTML parser (which gets complex).

[see Learn Perl in 1 Hour]

[see Python: Learn Python in 1 Hour]

Solution

Here's the code:

;; -*- coding: utf-8 -*-
;; 2011-02-25
;; print files that meet this condition:
;; contains <div class="x-note">…</div>
;; where the text content contains more than one bullet char •

(setq inputDir "~/web/" ) ; dir should end with a slash

(require 'sgml-mode) ; need sgml-skip-tag-forward

(defun my-process-file (fPath)
  "Process the file at FPATH …"
  (let (myBuffer
        p3 p4  (bulletCnt 0) (totalCnt 0))

    (when
        (and (not (string-match "/xx" fPath))) ; skip some dir

      (setq myBuffer (get-buffer-create " myTemp"))
      (set-buffer myBuffer)
      (insert-file-contents fPath nil nil nil t)

      (setq bulletCnt 0 )
      (goto-char 1)
      (while
          (search-forward "<div class=\"x-note\">"  nil t)

        (setq p3 (point)) ; beginning of text content, after <div class="x-note">
        (backward-char)
        (sgml-skip-tag-forward 1)
        (backward-char 6)
        (setq p4 (point)) ; end of tag content, before the </div>

        (setq bulletCnt (count-matches "•" p3 p4))

        (when (> bulletCnt 2)
          (setq totalCnt (1+ totalCnt))
          (princ (format "this many: %d %s\n" bulletCnt fPath))))

      (kill-buffer myBuffer))))

(require 'find-lisp)

(let (outputBuffer)
  (setq outputBuffer "*xah occur output*" )
  (with-output-to-temp-buffer outputBuffer
    (mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$"))
    (princ "Done deal!")))

The code is almost exactly the same as our previous grep in Emacs Lisp code, with only about 5 line change.

[see How to Write grep in Emacs Lisp]

Here's the main part that's different:

(while
          (search-forward "<div class=\"x-note\">"  nil t)

        (setq p3 (point) ) ; beginning of text content, after <div class="x-note">
        (backward-char)
        (sgml-skip-tag-forward 1)
        (backward-char 6)
        (setq p4 (point) ) ; end of tag content, before the </div>

        (setq bulletCnt (count-matches "•" p3 p4) )

        (when (> bulletCnt 2)
          (setq totalCnt (1+ totalCnt) )
          (princ (format "this many: %d %s\n" bulletCnt fPath))
          )
        )

The interesting part here is the use of sgml-skip-tag-forward function. It moves the cursor to the end of the tag, skipping all nested tags. Without this function, it may take days to code it in any language, unless you have experience in writing parsers. (note that this cannot be done with regex.)

Emacs ♥

Applications

Find Replace, or just Find, has extensive use in text processing. Here's some examples of variations, all of which i need on weekly basis and have several elisp scripts to do the job:

Why I Wrote This Code

Here's a little side story on why i wrote this one. It's a bit long.

I have about 30 classic literature with annotations. For example: The Arabian Nights. The annotation system is custom HTML tags with proper CSS. (so-called HTML Microformat) Extensive custom emacs commands are used to insert such tags.

Each annotation are in the tag <div class="x-note">…</div>. For example:

<div class="x-note">• provaunt ⇒ provide. Provant is a verb meaning: To supply with provender or provisions.</div>

However, some “x-note” block is multiple annotations in one. For example, in the story The Fisherman And The Jinni, there's this one:

<div class="x-note">• stint ⇒ a fixed amount or share work.
• might and main ⇒ with all effort and strength.
• skein ⇒ A quantity of yarn, thread, or the like, put up together, after it is taken from the reel.
• buffet ⇒ hit, beat, especially repeatedly.
• fain ⇒ with joy; satisfied; contented.
</div>

Each of the annotation are marked by a bullet “•” symbol, followed by a word. Each word corresponds to the same word in the main text marked by <span class="xnt">…</span>. (I pack multiple annotation into one “x-note” block to save space in HTML rendering.)

This annotation system is not perfect. It is static HTML/CSS based. Recently i've been thinking of making it more dynamic based on JavaScript. With JavaScript, it's possible to have features such as hide/show annotation. Or, mouse over to have a annotation show. Or, mouse-over any word to show its definition from a online dictionary.

To make that possible from my existing system, i need to make sure of few things:

With my current system, a annotation block is contained in a “x-note” tag, and within that block, each annotation is marked by a bullet. This semantic is precise, but isn't simple enough. If i want JavaScript to automatically highlight the annotation text when user mouse-over a annotated word, the js will have to do some parsing of text in the “x-note” block.

It would be simpler, if each “x-note” block contains just ONE annotation. This means, i will first change all my files that contain multi-annotation blocks to make them 1-annotation per block. This is a text processing job. (Hello emacs lisp!) But it will probably take a full day's work over about a thousand files. The number of files doesn't matter. What'd be time consuming is to do it correctly. First, i need to make sure the text has correct syntax. For example: make sure that each “x-note” do indeed contain at least one bullet symbol. Make sure that each <span class="xnt">…</span> has a corresponding <div class="x-note">…</div>. (i will be writing emacs lisp scripts to do these validation tasks.)

So, that's why i wrote this script. I wanted to get some idea of how many “x-note” blocks in which files actually contain multi-annotations.

See also: Emacs: xah-replace-pairs.el Multi-Pair Find Replace

Like my tutorial? Put $5 at patreon

Or Buy Xah Emacs Tutorial

Or buy a nice keyboard: Best Keyboard for Emacs