This page shows you how to use emacs to replace HTML entities to corresponding Unicode characters.
(⁖ α ⇒ α)
I have a file with content like this:
… <tr><td>pound</td><td>£</td></tr> <tr><td>curren</td><td>¤</td></tr> <tr><td>yen</td><td>¥</td></tr> <tr><td>brvbar</td><td>¦</td></tr> <tr><td>sect</td><td>§</td></tr> <tr><td>copy</td><td>©</td></tr> …
I need it to be like this:
… <tr><td>pound</td><td>£</td></tr> <tr><td>curren</td><td>¤</td></tr> <tr><td>yen</td><td>¥</td></tr> <tr><td>brvbar</td><td>¦</td></tr> <tr><td>sect</td><td>§</td></tr> <tr><td>copy</td><td>©</td></tr> …
How do you do it using emacs's power?
Note: the syntax &#‹n›; in HTML represents a character in Unicode with code point of the integer ‹n›. This mechanism is called HTML entities.
〔☛ Character Sets and Encoding in HTML〕
〔☛ HTML/XML Entities List〕
Remember that emacs lets you do replacement with a elisp function? This is the quickest solution i found. Here's a outline of the solution.
query-replace-regexp.&#\([0-9]+\);. This will match all the HTML entities and capture their decimal code.\,(ff), where the “ff” is my function name.The key here is writing the replacement function ff.
Your function ff will take the matched string, then return a Unicode character that has the code point of the matched string. For example, if the matched string is "945", then ff should return the string "α".
Here's the code:
(defun ff () "temp function. Returns a string based on current regex match. This is for the regex: &#\\([0-9]+\\);" (char-to-string (string-to-number (match-string 1))) )
Let's go thru the code. The code
(match-string 1)
gives me the 1st captured string. Let's say the captured string is "945".
In emacs, character datatype are just integers. A character is just its Unicode decimal code point. For example, if you run this code: (insert 945), it'll insert “α”. (try it now)
(info "(elisp) Character Type")
So, i change the matched string into a character datatype (integer) by
(string-to-number (match-string 1)),
then i change this char to a string, by
(char-to-string …).
Once you become familiar with using a lisp expression for regex replacement, you can simply use this code for the replacement:
\,(char-to-string (string-to-number \1)).
No need to write a function ff. But writing out function makes it clear what we are doing. It is easier if the transformation you need is a bit complex.
Carlos at comp.lang.lisp, and Jon Snader (jcs) on his blog (irreal.org) gave the following nice solutions:
\,(char-to-string \#1)
\,(format "%c" \#1)
When using a lisp expression in query-replace-regexp, the \1 is the 1st captured string. The \#1 is the first captured string as a number. Call describe-function on query-replace-regexp for detail.
Using a elisp function as replacement has many uses. For several examples, see: Emacs Lisp: Using a Elisp Function for Replacement String.
For tips on using emacs regex, see: Matching Text Pattern in Emacs.