Text Processing with Emacs Lisp: Transforming Page Tag

Buy Xah Emacs Tutorial. Master emacs benefits for life.
, ,

This page shows a example of writing a emacs lisp to update several HTML page's navigation bar. If you don't know elisp, first take a look at Emacs Lisp Basics.

Problem Description


I want to write a elisp program, that process a list of given files. Each file is a HTML file. For each file, i want to remove the link to itself, in its page navigation bar. More specifically, each file has a page navigation bar in this format:

<div class="pages">Goto Page: <a href="1.html">1</a>, <a href="2.html">2</a>, <a href="3.html">3</a>, …</div>

where the file names and link texts are arbitrary. (not necessarily as 1, 2, 3 shown here.) The link to itself needs to be removed. For example, for the file named “2.html”, the string should be like this:

<div class="pages">Goto Page: <a href="1.html">1</a>, 2, <a href="3.html">3</a>, …</div>


I have over 5 thousand HTML files, many of the pages are parts of a series. For example, this page Algorithmic Mathematical Art is broken into 3 HTML pages. So, at the bottom of each page, i have a page navigation bar with code like this:

<div class="pages">Goto Page: <a href="mathPag1">1</a>, <a href="mathPag2">2</a>, <a href="mathPag3">3</a></div>

In a browser, with proper Cascading Style Sheet, it looks like this:

page tag 1

Note that the link to the page itself really shouldn't be a link. So, what i really want is this:

<div class="pages">Goto Page: 1, <a href="mathPag2">2</a>, <a href="mathPag3">3</a></div>

that should show in browser like this:

page tag 2

There are a total of 134 pages scattered about in various directories that have this page navigation bar. I need some easy way to process these files and remove the self-link.

We proceed to write a elisp code to solve this problem.


Here are the steps we need to do for each file:

We begin by writing a test code to process a single file.

(defun xx ()
  "temp. experimental code"
  (let (fPath fName myBuffer)
    (setq fPath "~/test1.html")
    (setq fName (file-name-nondirectory fPath))
    (setq myBuffer (find-file fPath))
    (search-forward "<div class=\"pages\">Goto Page:")
    (search-forward fName)
    (sgml-delete-tag 1)
    (kill-buffer myBuffer)))

First of all, create files {test1.html, test2.html, test3.html} in a temp directory for testing this code. Place the following content into each file:

<div class="pages">Goto Page: <a href="test1.html">XYZ Overview</a>, <a href="test2.html">Second Page</a>, <a href="test3.html">Summary Z</a></div>

The elisp code above is very basic. The file opening function find-file is found from elisp doc section (info "(elisp) Files"). The cursor moving function search-forward is in (info "(elisp) Buffers"), the save or close buffer functions are in section (info "(elisp) Searching and Matching").

The interesting part is calling the function sgml-delete-tag. It is a function loaded by html-mode (which is automatically loaded when a HTML file is opened). What sgml-delete-tag does is to delete the tag that encloses the cursor (both the opening and closing tags will de deleted). The cursor can be anywhere inside the beginning tag or ending tag. This sgml-delete-tag function helps us tremendously.

Now, with the above code, our job is essentially done. All we need to do now is to feed it a bunch of file paths. First we clean up the code by making it to take a path as argument.

(defun my-modfile-page-tag (fPath)
  "Modify the HTML file at fPath."
  (let (fName myBuffer)
    (setq fName (file-name-nondirectory fPath))
    (setq myBuffer (find-file fPath))
    (goto-char (point-min)) ;; in case buffer already open
    (search-forward "<div class=\"pages\">Goto Page:")
    (search-forward fName)
    (sgml-delete-tag 1)
    (kill-buffer myBuffer)))

Then, we test this modified code by evaluating the following code:

(my-modfile-page-tag "~/test1.html")

To complete our task, all we have to do now is get the list of files that contains the page-nav tag and feed them to “my-modfile-page-tag”.

To get the list of files that contains the page-nav tag, we can simply use unix's “find” and “grep”, like this:

find . -name "*\.html" -exec grep -l '<div class="pages">' {} \;

For each line in the output, we just wrap a double quote around it to make it a lisp string. Possibly also insert the full path by using string-rectangle, to construct the following code:

(mapc 'my-modfile-page-tag
;… 100+ lines

The mapc is a lisp idiom of looping thru a list. The first argument is a function. The function will be applied to every element in the list. The single quote in front of the function is necessary. It prevents the symbol “my-modfile-page-tag” from being evaluated (as a expression of a variable).

Emacs is beautiful!

Like it?
Buy Xah Emacs Tutorial
or share
blog comments powered by Disqus