Elisp: Generate Web Links Report

By Xah Lee. Date: . Last updated: .

Here's how to use elisp to process files in a directory, searching for a text pattern, and generate a report.

Problem

You have 7 thousand HTML files. You want to generate a report that lists all links to Wikipedia, and the file path the link is from.

In this tutorial, you'll learn:

Solution

Here are the basic steps we need:

Once we have the data in a hash-table, it is very flexible. We can then generate a report in plain text or HTML, or do other processing.

Hash Table

First, we'll need to know how to use hash table in emacs. Here's a simple example:

;; create a hash table
(setq myhash (make-hash-table :test 'equal))

;; add entries
(puthash "Joe" "19" myhash)
(puthash "Jane" "20" myhash)
(puthash "Liz" "21" myhash)

;; add/modify
(puthash "Jane" "16" myhash)

;; get or check existence
(setq val (gethash "Jane" myhash))

(message val) ; print it

[see Elisp: Hash Table]

Process a Single File

we want the hash table key to be the full URL string, and the value a list. Each element in the list is the full path of the file that contains the link.

Here is the code that processes a single file. It opens the file, search for URL, if found, check if it exist in hash, if not, add it, else append to the existing entry.

(setq myfile "/Users/xah/web/ergoemacs_org/emacs/GNU_Emacs_dev_inefficiency.html")

(setq myhash (make-hash-table :test 'equal))

(defun ff ()
  "test code to process a single file"
  (interactive)
  (let ( myBuff url)

    (setq myBuff (find-file myfile)) ; open file

    (goto-char (point-min))

    ;; search for URL till not found
    (while
        (re-search-forward
         "href=\"\\(https://..\\.wikipedia\\.org/[^\"]+\\)\">\\([^<]+\\)</a>"
         nil t)

      (when (match-string 0)        ; if URL found
        (setq url (match-string 1)) ; set the url to matched string

        (print url)

        ;; if exist in hash, prepend to existing entry, else just add
        (if (gethash url myhash)
            (puthash url (cons myfile (gethash url myhash)) myhash)
          (puthash url (list myfile) myhash))))

    (kill-buffer myBuff) ; close file
    ))

(ff)

(print myhash)

Walk a Directory

(mapc
 'ff
 (directory-files-recursively "~/web/ergoemacs_org/" "\.html$" ))

Elisp: Walk Directory

Pretty Print Helpers

Given a Wikipedia URL, returns a HTML link string.

For example:

http://en.wikipedia.org/wiki/Emacs

becomes

<a href="http://en.wikipedia.org/wiki/Emacs">Emacs</a>

(require 'gnus-util) ; for gnus-url-unhex-string

(defun wikipedia-url-to-link (url)
  "Return the URL as HTML link string.
Example:
 http://en.wikipedia.org/wiki/Emacs%20Lisp
becomes
 <a href=\"http://en.wikipedia.org/wiki/Emacs%20Lisp\">Emacs Lisp</a>
"
  (let ((linkText url))
    (setq linkText (gnus-url-unhex-string linkText nil)) ; decode percent encoding. For example: %20
    (setq linkText (car (last (split-string linkText "/")))  ) ; get last part
    (setq linkText (replace-regexp-in-string "_" " " linkText )  ) ; low line → space
    (format "<a href=\"%s\">%s</a>" url linkText)
    ))

Given a file path, return a link string.

For example:

/Users/xah/web/ergoemacs_org/index.html

becomes

<a href="../index.html">ErgoEmacs</a>

, where the link text came from the file's “title” tag.

(defun get-html-file-title (fName)
  "Return FNAME <title> tag's text.
Assumes that the file contains the string
“<title>…</title>”."
  (with-temp-buffer
      (insert-file-contents fName nil nil nil t)
      (goto-char 1)
      (buffer-substring-no-properties
       (search-forward "<title>") (- (search-forward "</title>") 8))
      ))

Putting It All Together

 ;; -*- coding: utf-8; lexical-binding: t; -*-
;; emacs lisp.

;; started: 2008-01-03.
;; version: 2019-06-11
;; author: Xah Lee
;; url: http://ergoemacs.org/emacs/elisp_link_report.html
;; purpose: generate a report of wikipedia links.

;; traverse a given dir, visiting every html file, find links to Wikipedia in those files, collect them, and generate a html report of these links and the files they are from, then write it to a given file. (overwrite if exist)

&#12;

(setq InputDir "/Users/xah/web/ergoemacs_org/" )

;; Overwrites existing
(setq OutputPath "/Users/xah/web/xahlee_org/wikipedia_links.html")

&#12;

;; add a ending slash to InputDir if not there
(when (not (string= "/" (substring InputDir -1) )) (setq InputDir (concat InputDir "/") ) )

(when (not (file-exists-p InputDir)) (error "input dir does not exist: %s" InputDir))

(setq XahHeaderText
"<!doctype html><html><head><meta charset=\"utf-8\" />
<title>Links to Wikipedia from Xah Sites</title>
</head>
<body>

<nav class=\"n1\"><a href=\"index.html\">XahLee.org</a></nav>
")

(setq XahFooterText
  "
</body></html>
"
)

&#12;

;; hash table. key is string Wikipedia url, value is a list of file paths.
(setq LinksHash (make-hash-table :test 'equal :size 8000))

&#12;

(defun xah-add-link-to-hash (filePath hashTableVar)
  "Get links in filePath and add it to hash table at the variable hashTableVar."
  (let ( $wUrl)
    (with-temp-buffer
      (insert-file-contents filePath nil nil nil t)
      (goto-char 1)
      (while
          (re-search-forward
           "href=\"\\(https://..\\.wikipedia\\.org/[^\"]+\\)\">\\([^<]+\\)</a>"
           nil t)
        (setq $wUrl (match-string 1))
        (when (and
               $wUrl ; if url found
               (not (string-match "=" $wUrl )) ; do not includes links that are not Wikipedia articles. e.g. user profile pages, edit history pages, search result pages
               (not (string-match "%..%.." $wUrl )) ; do not include links that's lots of unicode
               )

          ;; if exist in hash, prepend to existing entry, else just add
          (if (gethash $wUrl hashTableVar)
              (puthash $wUrl (cons filePath (gethash $wUrl hashTableVar)) hashTableVar) ; not using add-to-list because each Wikipedia URL likely only appear once per file
            (puthash $wUrl (list filePath) hashTableVar)) )) ) ) )

(defun xah-print-each (ele)
  "Print each item. ELE is of the form (url (list filepath1 filepath2 …)).
Print it like this:
‹link to url› : ‹link to file1›, ‹link to file2›, …"
  (let ($wplink $files)
    (setq $wplink (car ele))
    (setq $files (cadr ele))

    (insert "<li>")
    (insert (wikipedia-url-to-linktext $wplink))
    (insert "—")

    (dolist (xx $files nil)
      (insert
       (format "<a href=\"%s\">%s</a>•"
               (xahsite-filepath-to-href-value xx OutputPath )
               (xah-html-get-html-file-title xx))))
    (delete-char -1)
    (insert "</li>\n"))
  )

&#12;

(defun wikipedia-url-to-linktext (@url)
  "Return the title of a Wikipedia link.
Example:
http://en.wikipedia.org/wiki/Emacs
becomes
Emacs"
  (require 'url-util)
  (decode-coding-string
   (url-unhex-string
    (replace-regexp-in-string
     "_" " "
     (replace-regexp-in-string
      "&" "&"
      (car
       (last
        (split-string
         @url "/")))))) 'utf-8))

(defun wikipedia-url-to-link (@url)
  "Return the @url as html link string.\n
Example:
http://en.wikipedia.org/wiki/Emacs
becomes
<a href=\"http://en.wikipedia.org/wiki/Emacs\">Emacs</a>"
  (format "<a href=\"%s\">%s</a>" @url (wikipedia-url-to-linktext @url)))

(defun xah-hash-to-list (@hash-table)
  "Return a list that represent the @HASH-TABLE
Each element is a list: '(key value).

http://ergoemacs.org/emacs/elisp_hash_table.html
Version 2019-06-11"
  (let ($result)
    (maphash
     (lambda (k v)
       (push (list k v) $result))
     @hash-table)
    $result))

&#12;
;;;; main

;; fill LinksHash
(mapc
   (lambda ($x) (xah-add-link-to-hash $x LinksHash ))
   (directory-files-recursively InputDir  "\.html$" ))

;; fill LinksList
(setq LinksList
      (sort (xah-hash-to-list LinksHash)
            ;; (hash-table-keys LinksHash)
            (lambda (a b) (string< (downcase (car a)) (downcase (car b))))))

;; write to file
(with-temp-file OutputPath
  (insert XahHeaderText)
  (goto-char (point-max))

  (insert
   "<h1>Links To Wikipedia from XahLee.org</h1>\n\n"
   "<p>This page contains all existing links from xah sites to Wikipedia, as of ")

  (insert (format-time-string "%Y-%m-%d"))

  (insert
   ". There are a total of " (number-to-string (length LinksList)) " links.</p>\n\n"
   "<p>This file is automatically generated by a <a href=\"http://ergoemacs.org/emacs/elisp_link_report.html\">emacs lisp script</a>.</p>

"
   )

  (insert "<ol>\n")

  (mapc 'xah-print-each LinksList)

  (insert "</ol>

")

  (insert XahFooterText)
  (goto-char (point-max)))

;; clear memory
;; (clrhash LinksHash)
;; (setq LinksList nil)

;; open the file
(find-file OutputPath )

Emacs ♥

Elisp Script Examples

  1. Write grep in Elisp
  2. Find String Inside HTML Tag
  3. Validate Matching Brackets
  4. Generate Links Report
  5. Generate Sitemap
  6. Archive Website For Reader Download
  7. Process File line-by-line
  8. Text-Soup Automation
  9. Split HTML Annotation
  10. Fixing Dead Links
  11. Elisp vs Perl: Validate Local File Links
  12. Transform Page Tag
  13. Transform HTML FAQ Tags
  14. Transform HTML Tags
  15. “figure” to “figcaption”
  16. “span.w” to “b”

If you have a question, put $5 at patreon and message me.
Or Buy Xah Emacs Tutorial
Or buy a nice keyboard: Best Keyboards for Emacs

Emacs

Emacs Lisp

Misc