Elisp: Fix Dead Links

By Xah Lee. Date: . Last updated: .

This page shows you how to write a elisp script that checks thousands of HTML files and fix dead links.

Problem

I have 2 thousands HTML files that contains about 70 dead local links. I need to write a elisp script to change these links to non-links. For example, this is a dead link:

<a href="../widget/index.html#Top">Introduction</a>

I need it to be:

<span class="εlink" title="../widget/index.html#Top">Introduction</span>

The script should run in batch. And it should generate a report of all changed links.

Detail

I have a copy of the emacs manuals, at:

These manual sometimes have links to other info files that's not emacs. For example, on this page Changing Files - GNU Emacs Lisp Reference Manual, it contains a link to GNU coreutils like this:

<a href="../coreutils/File-Permissions.html">File Permissions</a>

I need to change these links to non-links.

Solution

Here's outline of steps.

  1. Open each file.
  2. Search for “href=”.
  3. Get the link URL.
  4. Check if the link is a local file and exists.
  5. If not, change the entire link tag into a “span” tag.
  6. Repeat the above, until no link found.

First, we start like this:

(setq inputDir "~/web/xahlee_org/emacs_manual/" )

(defun my-process-file (fPath)
  "Process the file at FPATH …"
  …
)

;; traverse the directory on all HTML files
(require 'find-lisp)
(mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$"))

The important part is the “my-process-file” function. Here's the basic code:

(defun my-process-file (fPath)
  "Process the file at FPATH …"
  (let (…)

    ;; open file
    (setq myBuff (find-file fPath))

    (while
        ;; search local link
        (search-forward "href=\"../" nil t)

      ;; get the URL string
      (setq urlStr (thing-at-point 'filename) )

      ;; if the URL is a dead link
      (when (not (file-exists-p urlStr))
        (progn

          ;; set p1 and p2 to be the start/end of the link tag
          ;; and get the entire link string
          (sgml-skip-tag-backward 1)
          (setq p1 (point) ) ; start of link tag
          (sgml-skip-tag-forward 1)
          (setq p2 (point) ) ; end of link tag
          (setq wholeLinkStr (buffer-substring-no-properties p1 p2) )

          ;; get link text
          (search-backward "</a>")
          (setq p4 (point) ) ; end of link text
          (search-backward ">")
          (forward-char 1)
          (setq p3 (point) ) ; start of link text
          (setq linkText (buffer-substring-no-properties p3 p4) )

          ;; remove the link, replace it with a non-link span text.
          (delete-region p1 p2)
          (insert
           "<span class=\"εlink\" title=\""
           urlStr
           "\">"
           linkText
           "</span>"
           )
          )
        )
      )

    ;; close the file if no changes made
    (when (not (buffer-modified-p myBuff)) (kill-buffer myBuff) )

    ) )

Complete Code

Here's the complete code.

;; -*- coding: utf-8 -*-
;; 2011-09-25
;; replace dead links in emacs manual on my website
;;
;; Example. This:
;; <a href="../widget/index.html#Top">Introduction</a>
;;
;; should become this
;;
;; <span class="εlink" title="../widget/index.html#Top">Introduction</span>
;;
;; do this for all files in a dir.

;; rough steps:
;; go thru each file
;; search for link
;; if the link is 「../xx/」 where the file doesn't exist, then replace the whole link tag.

(setq inputDir "~/web/xahlee_org/emacs_manual/" ) ; dir should end with a slash

(defun my-process-file (fPath)
  "Process the file at FPATH …"
  (let (
        myBuff
        urlStr
        linkText
        wholeLinkStr
        p1 p2
        p3 p4
        )
    (setq myBuff (find-file fPath))
    (widen) ; in case it's open and narrowed
    (goto-char (point-max)) ; work from bottom, so that changes in point are preserved. (actually, doesn't really matter for this script)

    (while
        (search-backward "href=\"../" nil t)
      (forward-char 7)
      (setq urlStr (replace-regexp-in-string "\\.html#.+" ".html" (thing-at-point 'filename) ) )

      (when (not (file-exists-p urlStr))
        (progn
          (sgml-skip-tag-backward 1)
          (setq p1 (point) )                      ; start of link tag
          (sgml-skip-tag-forward 1)
          (setq p2 (point) )                      ; end of link tag

          (setq wholeLinkStr (buffer-substring-no-properties p1 p2) )

          (search-backward "</a>")
          (setq p4 (point) )                      ; end of link text
          (search-backward ">")
          (forward-char 1)
          (setq p3 (point) )                      ; start of link text

          (setq linkText (buffer-substring-no-properties p3 p4) )

          (princ (buffer-file-name))
          (princ "\n")
          (princ wholeLinkStr)
          (princ "\n")
          (princ "----------------------------\n")

          (delete-region p1 p2)
          (insert
           "<span class=\"εlink\" title=\""
           urlStr
           "\">"
           linkText
           "</span>"
           )
          )
        )
      )

    (when (not (buffer-modified-p myBuff)) (kill-buffer myBuff) )

    ) )

(require 'find-lisp)

(font-lock-mode 0)

(with-output-to-temp-buffer "*xah elisp dead link replace output*"
    (mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$"))
    (princ "Done deal!")
    )

(font-lock-mode 1)

Here's few interesting parts.

Turn Syntax Coloring Off

We turn font lock off, by (font-lock-mode 0). When font lock is on, processing 2 thousand HTML files will take ~50 minutes. With syntax coloring off, it's 3 minutes.

Leave Changed Files Open

If there are changes in the file, we leave it open. This way, we don't have to revert to backup files if there's a mistake. If we like the result, just call ibuffer and press 【* u】 to mark all un-saved, then S to save all. Then press D to close them all. If you do not want to save them, simply mark all unsaved 【* u】 then press D to close all.

This is extremely useful while you are still working on the code and doing some test runs. This interactive nature of emacs is what beats {Perl, Python, …} for text processing.

If you do want to save the file in the script, simply call (save-buffer) or (write-file (buffer-file-name))

When the file is not modified, we close it. Like this: (when (not (buffer-modified-p myBuff)) (kill-buffer myBuff) ).

Use sgml-skip-tag-forward

The sgml-skip-tag-forward and sgml-skip-tag-backward are from html-mode. They move the cursor to the beginning or ending of a tag. They are extremely useful. It saves you a lot time in writing code to parse tags, especially when tags are nested. Here's how we used it.

Suppose there's this link in a file:

<a href="../widget/index.html#Top">Introduction</a>

After we did the search with

 (while
  (search-backward "href=\"../" nil t)
  …
 )

the cursor is on the “h”. While the cursor is inside the tag, we call:

(sgml-skip-tag-backward 1)
 (setq p1 (point) ) ; start of link tag
 (sgml-skip-tag-forward 1)
 (setq p2 (point) ) ; end of link tag

 (setq wholeLinkStr (buffer-substring-no-properties p1 p2) )

This sets the value of wholeLinkStr to the entire anchor tag <a …>…</a>.

Print Output to Your Own Buffer

Printing output is done here using with-output-to-temp-buffer and princ. Like this:

(with-output-to-temp-buffer "*xah elisp dead link replace output*"
    (mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$"))
    (princ "Done deal!")
    )

Inside the “my-process-file” function, we write:

(princ (buffer-file-name))
 (princ "\n")
 (princ wholeLinkStr)
 (princ "\n")
 (princ "----------------------------\n")

Here's a output from the script: elisp_fix_dead_links_output.txt. It lets me easily see if there are any errors. There are a total of 68 changes.

For detail about printing in elisp, see: Elisp: Print, Output.

Elisp Script Examples

  1. Write grep in Elisp
  2. Find String Inside HTML Tag
  3. Validate Matching Brackets
  4. Generate Links Report
  5. Generate Sitemap
  6. Archive Website For Reader Download
  7. Process File line-by-line
  8. Text-Soup Automation
  9. Split HTML Annotation
  10. Fixing Dead Links
  11. Elisp vs Perl: Validate Local File Links
  12. Transform Page Tag
  13. Transform HTML FAQ Tags
  14. Transform HTML Tags
  15. “figure” to “figcaption”
  16. “span.w” to “b”

If you have a question, put $5 at patreon and message me.
Or Buy Xah Emacs Tutorial
Or buy a nice keyboard: Best Keyboards for Emacs

Emacs

Emacs Lisp

Misc