Elisp: Find String Inside HTML Tag
This page shows a emacs lisp script that search files, similar to unix grep, but with the condition that the string must happen inside a specific HTML tag.
I need to list all files that contains a given string, and only if the string is inside a given HTML tag. That is, the condition that some strig must happen before and after the string we want.
This is something grep and linux shell commands cannot do easily, and difficult to do even with Perl, Python, unless you use a HTML parser (which gets complex).
Here's the code:
;; -*- coding: utf-8 -*- ;; 2011-02-25, 2019-01-13 ;; print files that meet this condition: ;; contains <div class="xnote">…</div> ;; where the text content contains more than one bullet char • (setq inputDir "/Users/xah/web/wordyenglish_com/arabian_nights/" ) ; dir should end with a slash ;; need sgml-skip-tag-forward (require 'sgml-mode) (defun my-process-file (fPath) "Process the file at FPATH …" (let (myBuffer p3 p4 (bulletCount 0) ) ;; (print fPath) (when (and (not (string-match "/xx" fPath))) ; skip some dir (setq myBuffer (get-buffer-create " myTemp")) (set-buffer myBuffer) (insert-file-contents fPath nil nil nil t) (setq bulletCount 0 ) (goto-char (point-min)) (while (search-forward "<div class=\"xnote\">" nil t) (setq p3 (point)) ; beginning of innerText, after <div class="xnote"> (backward-char) (sgml-skip-tag-forward 1) (backward-char 6) (setq p4 (point)) ; end of innerText, before </div> (setq bulletCount (count-matches "•" p3 p4)) (when (> bulletCount 1) (princ (format "Found: %d %s\n" bulletCount fPath)))) (kill-buffer myBuffer)))) (let (outputBuffer) (setq outputBuffer "*my output*" ) (with-output-to-temp-buffer outputBuffer (mapc 'my-process-file (directory-files-recursively inputDir "\.html$" )) (princ "Done")))
Find Replace Applications
Find, or Find Replace, has extensive use in text processing. Here's some examples of variations, all of which i need on weekly basis and have several elisp scripts to do the job:
- List file that contains a string.
- Show adjacent text around a string.
- List a file only if it contains more than 1 occurence of a string. (or more than n, less than n, exactly n.)
- List file if it contains a given set of strings.
- Replace text based on file's name.
- List file only if its HTML title and heading doesn't match.
- Find/Report/Replace only if the string is at a particular position in the file. (For example, near top, near bottom.)
- List a file only if the string is inside a tag.
Why I Wrote This Code
Here's a little story on why i wrote this one.
I have about 30 classic literature with annotations. For example: The Arabian Nights.
Each annotation are in the tag
<div class="xnote">• provaunt ⇒ provide. Provant is a verb meaning: To supply with provender or provisions.</div>
However, some “xnote” block is multiple annotations in one. e.g.
<div class="xnote">• stint ⇒ a fixed amount or share work. • might and main ⇒ with all effort and strength. • skein ⇒ A quantity of yarn, thread, or the like, put up together, after it is taken from the reel. • buffet ⇒ hit, beat, especially repeatedly. • fain ⇒ with joy; satisfied; contented. </div>
Each of the annotation are marked by a bullet “•” symbol, followed by a word. Each word corresponds to the same word in the main text marked by
To make that possible, i need to make sure of few things:
- ① My custom markup must have precise semantics.
It would be simpler, if each “xnote” block contains just ONE annotation. This means, i will first change all my files that contain multi-annotation blocks to make them 1-annotation per xnote block. This is a text processing job. (Hello emacs lisp!)
Before doing text transformation on the xnote blocks, first i need to make sure the text has correct syntax. For example: make sure that each “xnote” do indeed contain at least one bullet symbol, and make sure that each
<span class="xnt">…</span> has a corresponding
So, that's why i wrote this script. I wanted to get some idea of how many “xnote” blocks in which files actually contain multi-annotations.
The elisp code to split xnote block to multiple is very similar to the elisp code here.