ErgoEmacsEmacsLispBlogEmacsLispBuy Tutorial
Web Hosting by 1&1

Emacs Lisp: Replace HTML Named Entities with Unicode Characters

Xah Lee, , …,

This page shows you how to write a elisp command to replace HTML entities such as é by its Unicode character é.

Problem Description

I want a command that automatically change HTML named entities to Unicode characters. Example:

Note: there are over 200 named HTML entities 〔☛ HTML/XML Entities List〕.

The command should work on the current paragraph, or text selection.

Solution

This is easy to write. One of the basic elisp idiom is find/replace on a region, like this:

(defun replace-html-chars-region (start end)
  "Replace some HTML entities in region …."
  (interactive "r")
  (save-restriction
    (narrow-to-region start end)

    (goto-char (point-min))
    (while (search-forward "‘" nil t) (replace-match "‘" nil t))

    (goto-char (point-min))
    (while (search-forward "’" nil t) (replace-match "’" nil t))

    (goto-char (point-min))
    (while (search-forward "“" nil t) (replace-match "“" nil t))

    (goto-char (point-min))
    (while (search-forward "”" nil t) (replace-match "”" nil t))

    (goto-char (point-min))
    (while (search-forward "é" nil t) (replace-match "é" nil t))
    ;; more here
    )
  )

The (interactive "r") tells emacs that this is a command that can be called by execute-extended-command 【M-x】 and the "r" means emacs will feed the beginning and ending text selection positions to your function's parameters.

There are several problems with the above simple code.

① The code requires you to make a text selection first. It'd be better if it automatically work on text selection if there's one, else works on current paragraph.

For solution on this, see: Emacs Lisp Idioms (for writing interactive commands).

② The elisp code above is too verbose. It'd be much better if we can write it like this:

(defun replace-html-named-entities ()
 …
  (replace-pairs-in-string inputStr
    [
     ["‘" "‘"]
     ["’" "’"]
     ["“" "“"]
     ["”" "”"]
     ["é" "é"]
     ]
  ))

For solution on this, see: Emacs Lisp: Multi-Pair String Replacement Function.

③ Replacing multiple pairs of strings one by one may produce incorrect behavior.

Tricky Issue with Sequential Replacement of Multi-Pairs

Suppose you are working on a HTML tutorial that discusses HTML entities. Suppose the file contains this string:

use “©” for ©

The intended display is use “©” for ©.

However, if you are sequentially replacing each entities, the & part will become &, then © becomes just ©, so you got use “©” for © WRONG!

When you have many pairs of replacement, then doing them one by one, each time starting from the top of the document, may introduce unexpected changes. One solution is to replace them to a set of unique intermediate values, then replace these to the final values. See: Emacs Lisp: Multi-Pair String Replacement Function

blog comments powered by Disqus