Elisp: Replacing HTML Entities with Unicode Characters

By Xah Lee. Date: . Last updated: .

This page shows you how to use emacs to replace HTML entities to corresponding Unicode characters. (for example: αα)

Problem

I have a file with content like this:

…
<tr><td>pound</td><td>&#163;</td></tr>
<tr><td>curren</td><td>&#164;</td></tr>
<tr><td>yen</td><td>&#165;</td></tr>
<tr><td>brvbar</td><td>&#166;</td></tr>
<tr><td>sect</td><td>&#167;</td></tr>
<tr><td>copy</td><td>&#169;</td></tr>
…

I need it to be like this:

…
<tr><td>pound</td><td>£</td></tr>
<tr><td>curren</td><td>¤</td></tr>
<tr><td>yen</td><td>¥</td></tr>
<tr><td>brvbar</td><td>¦</td></tr>
<tr><td>sect</td><td>§</td></tr>
<tr><td>copy</td><td>©</td></tr>
…

How do you do it using emacs's power?

Note: the syntax &#n; in HTML represents a character in Unicode with code point of the integer n. This mechanism is called HTML entities. [see Character Sets and Encoding in HTML] [see HTML XML Entities]

Solution 1

Write a emacs lisp command. See: Elisp: Replace HTML Entities

Solution 2

Emacs lets you do find replace with replacement being a elisp function. Here's a outline of the solution.

  1. Open the file.
  2. Alt+x query-replace-regexp.
  3. Give the regex &#\([0-9]+\);. This will match HTML entity and capture the decimal code.
  4. In the replacement input, tell emacs to use a elisp function, like this: \,(ff), where the “ff” is my function name.
  5. Then, type y or n for each match, or type ! to replace all occurrences in the file.

The key here is writing the replacement function ff.

Your function ff will take the matched string, then return a Unicode character that has the code point of the matched string. For example, if the matched string is "945", then ff should return the string "α".

Here's the code:

(defun ff ()
  "temp function. Returns a string based on current regex match.
This is for the regex: &#\\([0-9]+\\);"
  (char-to-string (string-to-number (match-string 1)))
  )

Let's go thru the code. The code (match-string 1) gives me the 1st captured string. Let's say the captured string is "945".

In emacs, character datatype are just integers. A character is just its Unicode decimal code point. For example, if you run this code: (insert 945), it'll insert “α”. (try it now)

(info "(elisp) Character Type")

So, i change the matched string into a character datatype (integer) by (string-to-number (match-string 1)), then i change this char to a string, by (char-to-string …).

A Shortcut

Once you become familiar with using a lisp expression for regex replacement, you can simply use this code for the replacement:
\,(char-to-string (string-to-number \1)).

No need to write a function ff. But writing out function makes it clear what we are doing. It is easier if the transformation you need is a bit complex.

Carlos at comp.lang.lisp, and Jon Snader (jcs) on his blog (irreal.org) gave the following nice solutions:

\,(char-to-string \#1)
\,(format "%c" \#1)

When using a lisp expression in query-replace-regexp, the \1 is the 1st captured string. The \#1 is the first captured string as a number. Alt+x describe-function on query-replace-regexp for detail.

Using a elisp function as replacement has many uses. For several examples, see: Elisp: Using a Elisp Function for Replacement String.

For tips on using emacs regex, see: Matching Text Pattern in Emacs.

Function as Replacement String

  1. Call Function in Replacement String
  2. Add “alt” Attribute to Image Tags
  3. Replace String Based on File Name
  4. HTML Entities → Unicode Char
  5. Char → Unicode Name + Char

If you have a question, put $5 at patreon and message me.
Or Buy Xah Emacs Tutorial
Or buy a nice keyboard: Best Keyboards for Emacs

Emacs

Emacs Lisp

Misc