Emacs Lisp: Text Processing: Generate a Web Links Report

Master emacs+lisp, benefit for life. Testimonials. Thank you for support.
, , …,

This page shows you how to use elisp to process HTML files in a directory, searching for a text pattern, and generate a report.

Problem Description

You have 7 thousand HTML files. You want to generate a report that lists all links to Wikipedia, and the file the link is in.

Normally, in unix you can use grep like this: grep -r 'wikipedia\.org/' dir. This can give you a nice overview, but the result of grep also contains adjacent texts. If you want to know which link are duplicated, or if you want to organize the output by file name, then unix shell tools is not convenient. You can write a {Perl, Python, Ruby} scritpt to do it, but let's see how to do it in elisp.

In this tutorial, you'll learn:

Solution

Here are the basic steps we need:

Once we have the data in a hash-table, it is very flexible. We can then generate a report in plain text or HTML, or do other processing.

Hash Table

First, we'll need to know how to use hash table in emacs. Here's a simple example:

(let (myhash val)

  ;; create a hash table
  (setq myhash (make-hash-table :test 'equal))

  ;; add entries
  (puthash "Joe" "19" myhash)
  (puthash "Jane" "20" myhash)
  (puthash "Liz" "21" myhash)

  ;; modify a entry's value
  (puthash "Jane" "16" myhash)

  ;; remove a entry
  (remhash "Liz" myhash)

  ;; get a entry's value
  (setq val (gethash "Jane" myhash))

  (message val) ; print it
)

For a detailed lesson on hash table, see: Emacs Lisp Tutorial: Hash Table.

Process a Single File

In our situation, we want the hash table's keys to be the full URL string, and the value should be a list. Each element in the list is the full path of the file that contains the link.

Here is the code that processes a single file. It opens the file, search for URL, if found, check if it exist in hash, if not, add it, else append to the existing entry.

(defun ff ()
  "test code to process a single file"
  (interactive)
  (let (myhash myBuff myfile url)
    (setq myfile "~/web/p/time_machine/tm.html") ; the file to process 

    (setq myhash (make-hash-table :test 'equal)) ; create hash table

    (setq myBuff (find-file myfile)) ; open file

    ;; repeat search for URL till not found
    (while
        (re-search-forward
         "href=\"\\(http://..\\.wikipedia\\.org/[^\"]+\\)\">\\([^<]+\\)</a>"
         nil t)

      (when (match-string 0) ; if URL found
        (setq url (match-string 1)) ; set the url to matched string

        ;; if exist in hash, prepend to existing entry, else just add
        (if (gethash url myhash)
            (puthash url (cons myfile (gethash url myhash)) myhash)
          (puthash url (list myfile) myhash))))

    (kill-buffer myBuff) ; close file

    ; print number of elements to see if this worked
    (message (number-to-string (hash-table-count myhash)))
    ))

The sample file processed in the above code is this: Time Machine.

The above code prints 2, since there are 2 Wikipedia links in that file. This is all good. This means the code is working.

Note that in the above, we opened a file in a buffer, did some processing, then closed it. The form is this:

(setq myBuff (find-file myfile)) ; open file
…
(kill-buffer myBuff) ; close file

When emacs opens a file, it does several tasks transparent to the user. For example, it will load the proper mode, and colorize the text, and also automatically use the proper decoding to display the text, setup undo, …. This is all good, but if you are going to process thousands of files, you usually don't need emacs to load the language mode or do syntax-coloring or record undo info.

Open a File for Read, Fast

Here's a idiom to process a file read-only.

;; a elisp idiom for read-only processing a file without user interaction
(defun my-process-file (fPath)
  "Process the file at path FPATH …"
  (with-temp-buffer fPath
    (insert-file-contents fPath)
    ;; process it …
    ) )

Traverse Directory

To get a list of dir content, use directory-files.

To list a dir and all subdirs, use find-lisp-find-files. Example:

(require 'find-lisp)
(find-lisp-find-files "~/web/emacs/" "\\.html$")
;; returns a list of all HTML files in dir (including files in any nested subdirs)

Now, you can map your process file function to the list of files, like this:

(mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$"))

Pretty Print Helpers

Here's some function to help us do the pretty-print for our report in HTML format.

(require 'gnus-util) ; for gnus-url-unhex-string

(defun wikipedia-url-to-link (url)
  "Return the URL as HTML link string.
Example:
 http://en.wikipedia.org/wiki/Emacs%20Lisp
becomes
 <a href=\"http://en.wikipedia.org/wiki/Emacs%20Lisp\">Emacs Lisp</a>
"
  (let ((linkText url))
    (setq linkText (gnus-url-unhex-string linkText nil)) ; decode percent encoding. ⁖ %20
    (setq linkText (car (last (split-string linkText "/")))  ) ; get last part
    (setq linkText (replace-regexp-in-string "_" " " linkText )  ) ; low line → space
    (format "<a href=\"%s\">%s</a>" url linkText)
    ))
(defun get-html-file-title (fName)
  "Return FNAME <title> tag's text.
Assumes that the file contains the string
“<title>…</title>”."
  (with-temp-buffer
      (insert-file-contents fName nil nil nil t)
      (goto-char 1)
      (buffer-substring-no-properties
       (search-forward "<title>") (- (search-forward "</title>") 8))
      ))

That's it. You can get the full code at https://github.com/xahlee/xahscripts/tree/master/site_link_report

Here's a sample of generated report: Links to Wikipedia from XahLee.org.

Emacs ♥

Like what you read?
Buy Xah Emacs Tutorial
or share some
blog comments powered by Disqus