Emacs Lisp: Syntax Color Source Code in HTML

Master emacs+lisp, benefit for life. Testimonials. Thank you for support.
, , …,

This page shows you how to write a emacs lisp command to syntax color computer language source code in HTML.

Problem Description

Write a command “htmlize-pre-block”. When called, it will syntax color the computer language source code under cursor.

For example, here's a elisp code snippet:

(if (< 3 2) (message "yes") )

Here's syntax colored version of raw HTML:

(<span class="keyword">if</span> (&lt; 3 2) (message <span class="string">"yes"</span>) )

Here's how it looks like in a web browser:

(if (< 3 2) (message "yes") )

There is a emacs package that transforms any colored text in emacs to HTML form. This is extremely nice. The package is named htmlize.el and is written by Hrvoje Niksic, available at http://fly.srk.fer.hr/~hniksic/emacs/htmlize.el.cgi. This package primarily gives you 3 new commands:

  1. htmlize-region. Output to a new buffer.
  2. htmlize-buffer. Output to a new buffer.
  3. htmlize-file. Takes a input file name, output to new file.

Here's what i need. I need a command “htmlize-pre-block”. When the cursor is inside a “pre” tag like this:

<pre class="‹lang_name›">
…
</pre>

then, after calling “htmlize-pre-block”, the source code inside the tag will be syntax colored, that is: wrapped with appropriate “span” tags on the language's keywords.

Solution

There are many ways to solve this problem. Here's one way.

  1. Grab the text inside the <pre class="‹lang_name›">…</pre> tag the cursor is in.
  2. Create a temp buffer. Insert the text in.
  3. Set the new buffer to a major mode corresponding to ‹lang_name›, and fontify it.
  4. Call htmlize-buffer.
  5. From the htmlize-buffer output, grab the (htmlized) text inside <pre> tag.
  6. Kill the htmlize output buffer and my temp buffer.
  7. Delete the original text, insert in the htmlized text.

To achieve the above, i decided on 2 steps:

  1. Write a function “htmlize-string” that takes a string and mode name, and returns the htmlized string.
  2. Write a function “htmlize-pre-block” that does the steps of grabbing text, calls “htmlize-string”, then replace original text with the new.

htmlize-string

Here's the code of “htmlize-string” function:

(defun htmlize-string (sourceCodeStr langModeName)
  "Take SOURCECODESTR and return a htmlized version using LANGMODENAME.
This function requries the htmlize.el by Hrvoje Niksic."
  (require 'htmlize)
  (let (htmlizeOutputBuf p1 p2 resultStr)

    ;; put code in a temp buffer, set the mode, fontify
    (with-temp-buffer
      (insert sourceCodeStr)
      (funcall (intern langModeName))
      (font-lock-fontify-buffer)
      (setq htmlizeOutputBuf (htmlize-buffer))
      )

    ;; extract the fontified source code in htmlize output
    (with-current-buffer htmlizeOutputBuf
      (setq p1 (search-forward "<pre>"))
      (setq p2 (search-forward "</pre>"))
      (setq resultStr (buffer-substring-no-properties (+ p1 1) (- p2 6))))

    (kill-buffer htmlizeOutputBuf)
    resultStr
    ))

The “htmlize-string” takes a string and a mode name, and returns a htmlized string.

First it creates a temp buffer by (with-temp-buffer …), then insert the string, set a major mode for the language it should be colored with, then call htmlize-buffer to generate the htmlized string. The return value of htmlize-buffer is the buffer of its output, which we set to “htmlizeOutputBuf”.

Now, we have a buffer object “htmlizeOutputBuf”. It contains the htmlized text. It is actually a complete HTML file like this: <html><head>…</head><body>…</body></html>. We want to grab part of the text that is the htmlized source code. (that is, excluding the usual HTML header and footer)

We call (with-current-buffer …) and extract text between <pre>…</pre> tags. The first argument to with-current-buffer is a buffer object or buffer name. Then, emacs will use that buffer as the current buffer.

Emacs's buffer related functions can often take a argument that is either a buffer name (of type “string”) or a buffer object itself (of type “buffer”).

(info "(elisp) Buffers")

We extract text by (buffer-substring-no-properties pos1 pos2). Emacs's string can contain information called “properties”, which contains info about font and coloring. To grab a string in a buffer, you can use buffer-substring or buffer-substring-no-properties. Most emacs commands that take a string as argument can accept string with or without properties.

(info "(elisp) Text Properties")

htmlize-pre-block

Here's the code of “htmlize-pre-block” function:

(defun htmlize-pre-block ()
  "Replace text enclosed by <pre> tag to htmlized code.
For example, if the cursor is somewhere between the pre tags:
 <pre class=\"lang-code\">…▮…</pre>

after calling, the text inside the pre tag will be htmlized.
That is, wrapped with many span tags.

The opening tag must be of the form <pre class=\"lang-code\">.
The “lang-code” determines what emacs mode is used to colorize the
text.

 “lang-code” can be any of {c, elisp, java, JavaScript, html, xml, css, …}.
 (See source code for a full list)

See also: `dehtmlize-pre-block'.

This function requires htmlize.el by Hrvoje Niksic."
  (interactive)
  (let (inputStr langCode p1 p2 modeName
    (langModeMap
     '(
       ("ahk" . "ahk-mode")
       ("bash" . "sh-mode")
       ("bbcode" . "xbbcode-mode")
       ("c" . "c-mode")
       ("cl" . "lisp-mode")
       ("clojure" . "clojure-mode")
       ("cmd" . "dos-mode")
       ("css" . "css-mode")
       ("elisp" . "emacs-lisp-mode")
       ("haskell" . "haskell-mode")
       ("html" . "html-mode")
       ("xml" . "sgml-mode")
       ("html6" . "html6-mode")
       ("java" . "java-mode")
       ("javascript" . "js-mode")
       ("js" . "js-mode")
       ("lsl" . "xlsl-mode")
       ("ocaml" . "tuareg-mode")
       ("org" . "org-mode")
       ("perl" . "cperl-mode")
       ("php" . "php-mode")
       ("povray" . "pov-mode")
       ("powershell" . "powershell-mode")
       ("python" . "python-mode")
       ("ruby" . "ruby-mode")
       ("scala" . "scala-mode")
       ("scheme" . "scheme-mode")
       ("vbs" . "visual-basic-mode")
       ("visualbasic" . "visual-basic-mode")
       ) ))

    (save-excursion
      (re-search-backward "<pre class=\"\\([-A-Za-z0-9]+\\)\"") ; tag begin position
      (setq langCode (match-string 1))
      (setq p1 (search-forward ">")) ; lang source code string begin
      (search-forward "</pre>")
      (setq p2 (search-backward "<")) ; lang source code string end
      (search-forward "</pre>") ; tag end position
      (setq inputStr (buffer-substring-no-properties p1 p2))

      (setq modeName
            (let ((tempVar (assoc langCode langModeMap) ))
              (if tempVar (cdr tempVar) "text-mode" ) ) )

      (delete-region p1 p2)
      (goto-char p1)
      (insert (htmlize-string inputStr modeName)) ) ) )

The function first sets up a map of langCode to major mode name, like this:

'(
  ("ahk" . "ahk-mode")
  ("bash" . "sh-mode")
  ("bbcode" . "xbbcode-mode")
  ("c" . "c-mode")
  ("cl" . "lisp-mode")
  ("clojure" . "clojure-mode")
  ("cmd" . "dos-mode")
  …
  )

This is called a association list, or sometimes known as keyed list, dictionary. To get a item, you can use assoc. See: (info "(elisp) Association Lists").

Then, it grabs the text inside the

 block in the current buffer. Like this:

(re-search-backward "
")) ; lang source code string begin
(search-forward "
") (setq p2 (search-backward "<")) ; lang source code string end (search-forward "
") ; tag end position (setq inputStr (buffer-substring-no-properties p1 p2))

In the above, the langCode is also set from the regex match in re-search-backward.

Then, we get the major mode name for that langCode, by:

(setq modeName
      (let ((tempVar (assoc langCode langModeMap) ))
        (if tempVar (cdr tempVar) "text-mode" ) ) )

Once we know what major mode to use, then we call “htmlize-string” to get the htmlized text. We just delete the original text and insert the new one there. Like this:

(delete-region p1 p2)
(goto-char p1)
(insert (htmlize-string inputStr modeName))

Emacs ♥

Setting Up htmlize.el and CSS

Note: quote from htmlize.el's header documentation:

htmlize supports three types of HTML output, selected by setting “htmlize-output-type”: “css”, “inline-css”, and “font”. … “css” mode is the default.

My functions “htmlize-pre-block” and “htmlize-string” assumes you are using the CSS mode output. This means, you'll have to do a one-time manual process of taking the CSS code generated by the htmlized output and place it in your own HTML page to reference it. You can use my CSS code for language here: elisp_htmlize_css_code.css.

If your HTML is in Unicode UTF-8 encoding, you might add the following to your emacs init file:

(setq htmlize-convert-nonascii-to-entities nil)
(setq htmlize-html-charset "utf-8")

They will prevent htmlize creating ugly HTML entities. For example, if you have a bullet char “•” (Unicode U+2022), you will see the character as is instead of &#x2022.

If you are not familiar with {HTML, CSS}, see:

Dehtmlize Text

The raw HTML of htmlized language code is usually unreadable. For example, here's 2 lines of OCaml language code:

let myComposition f g = (fun x -> f (g x) );;
myComposition (fun x -> x ^ "c") (fun x -> x ^ "b") "a";;

Here's its htmlized version:

<span class="tuareg-font-lock-governing">let</span> <span class="function-name">myComposition</span><span class="variable-name"> f g </span><span class="tuareg-font-lock-operator">=</span> <span class="tuareg-font-lock-operator">(</span><span class="keyword">fun</span> <span class="variable-name">x </span><span class="tuareg-font-lock-operator">-&gt;</span> f <span class="tuareg-font-lock-operator">(</span>g x<span class="tuareg-font-lock-operator">)</span> <span class="tuareg-font-lock-operator">);;</span>
myComposition <span class="tuareg-font-lock-operator">(</span><span class="keyword">fun</span> <span class="variable-name">x </span><span class="tuareg-font-lock-operator">-&gt;</span> x <span class="tuareg-font-lock-operator">^</span> <span class="string">"c"</span><span class="tuareg-font-lock-operator">)</span> <span class="tuareg-font-lock-operator">(</span><span class="keyword">fun</span> <span class="variable-name">x </span><span class="tuareg-font-lock-operator">-&gt;</span> x <span class="tuareg-font-lock-operator">^</span> <span class="string">"b"</span><span class="tuareg-font-lock-operator">)</span> <span class="string">"a"</span><span class="tuareg-font-lock-operator">;;</span>

Suppose you want to modify the OCaml code in your blog. Usually, you switch to browser, copy the code, switch back to emacs, create a new buffer, paste the code to edit it. When done, you copy it, close temp buffer, delete the htmlized version on your blog, paste the new in, then htmlize it again. This process is painful.

It would be nice, if you can press a button, then the htmlized source code in your HTML will become plain. So you can modify it. Press a button again to have it htmlized again.

Here's the code of “dehtmilze-pre-block”:

(defun dehtmlize-pre-block (p1 p2)
  "Delete span tags between pre tags.
For example, if the cursor is somewhere between the tags:
<pre class=\"…\">…▮…</pre>

after calling, all span tags inside the block will be removed.
If there's a text selection, dehtmlize that region.

Note: only span tags of the form 「<span class=\"…\">…</span>」 are deleted.

This command does the reverse of `htmlize-pre-block'."
  (interactive
   (if (use-region-p)
       (list (region-beginning) (region-end))
     (let (p3 p4)
       (save-excursion
         (re-search-backward "<pre class=\"\\([-A-Za-z0-9]+\\)\"")
         (setq p3 (re-search-forward ">")) ; code begin position
         (re-search-forward "</pre>")
         (setq p4 (- (point) 6)) ; code end position
         (list p3 p4 )) ) ) )
  (dehtmlize-span-region p1 p2)
   )
(defun dehtmlize-span-region (p1 p2)
  "Delete HTML “span” tags in region.
Note: only span tags of the form 「<span class=\"…\">…</span>」 are deleted."
  (interactive "r")
  (save-excursion
    (save-restriction
      (narrow-to-region p1 p2)
      (replace-regexp-pairs-region (point-min) (point-max) '(["<span class=\"[^\"]+\">" ""]))
      (replace-pairs-region (point-min) (point-max) '( ["</span>" ""] ["&amp;" "&"] ["&lt;" "<"] ["&gt;" ">"] ) ) ) ) )

Set the Code to a File

If you have a pre block:

<pre class="python">
…
</pre>

Wouldn't it be nice, by pressing a button, then a plain source code content is moved into a temp file 〔xx-temp-‹randomstr›.py〕 in a split buffer?

For latest version of these code, see Emacs: Xah HTML Mode.

JavaScript Solution

Google has a open source technology that uses JavaScript to color code in HTML on the fly instead of using the bulky markup. For detail, see: Syntax Coloring with Google-Code-Prettify.

Like what you read?
Buy Xah Emacs Tutorial
or share some
blog comments powered by Disqus