Elisp: Create Sitemap

By Xah Lee. Date: . Last updated: .

This page shows how to use emacs lisp to create a sitemap.

Problem

Write a elisp script to generate a sitemap. That is: create a file of sitemap format that lists all files in a directory.

Detail

A sitemap is a XML file that lists URLs of all files in a website for web crawlers to crawl.

A sitemap file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

   <url>
      <loc>http://www.example.com/</loc>
      <lastmod>2005-01-01</lastmod>
      <changefreq>monthly</changefreq>
      <priority>0.8</priority>
   </url>

   …

</urlset>
  1. The file can have many <url>…</url> item.
  2. Each <url> container represent a file and other info.
  3. The <loc> is a URL of the file.
  4. The <lastmod>, <changefreq>, <priority> are optional.
  5. A sitemap file can list a max of 50k URLs.

The purpose of sitemap file is for web crawlers to easily know all files that exist on your site.

Solution

The general plan is very simple. Here's one way to do it.

  1. Create a new file, insert XML header tags.
  2. Traverse the web root dir. For each file, determine whether it should be listed in the sitemap.
  3. If so, generate the proper URL tag and insert it into the new file.
  4. When done visiting files, insert the XML footer tags. Save the file.
;; -*- coding: utf-8; lexical-binding: t; -*-
;; 2018-09-04

(require 'seq)

(setq xah-web-root-path "/Users/xah/web/" )

(defvar xahsite-external-docs nil "A vector of dir paths.")
(setq  xahsite-external-docs
 [
  "ergoemacs_org/emacs_manual/"
  "xahlee_info/REC-SVG11-20110816/"
  "xahlee_info/clojure-doc-1.8/"
  "xahlee_info/css_2.1_spec/"
  "xahlee_info/css_transitions/"
  "xahlee_info/javascript_ecma-262_5.1_2011/"
  "xahlee_info/javascript_ecma-262_6_2015/"
  "xahlee_info/javascript_es2016/"
  "xahlee_info/javascript_es6/"
  "xahlee_info/jquery_doc/"
  "xahlee_info/node_api/"
  "xahlee_info/ocaml_doc/"
  "xahlee_info/python_doc_2.7.6/"
  "xahlee_info/python_doc_3.3.3/"
  ])

(defun xahsite-generate-sitemap (@domain-name)
  "Generate a sitemap.xml.gz file of xahsite at doc root.
@domain-name must match a existing one.
Version 2018-09-17"
  (interactive
   (list (ido-completing-read "choose:" '( "ergoemacs.org" "wordyenglish.com" "xaharts.org" "xahlee.info" "xahlee.org" "xahmusic.org" "xahsl.org" ))))
  (let (
        ($sitemapFileName "sitemap.xml" )
        ($websiteDocRootPath (concat xah-web-root-path (replace-regexp-in-string "\\." "_" @domain-name "FIXEDCASE" "LITERAL") "/")))
    ;; (print (concat "begin: " (format-time-string "%Y-%m-%dT%T")))
    (let (
          ($filePath (concat $websiteDocRootPath $sitemapFileName ))
          ($sitemapBuffer (generate-new-buffer "sitemapbuff")))
      (with-current-buffer $sitemapBuffer
        (set-buffer-file-coding-system 'unix)
        (insert "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">
"))
      (mapc
       (lambda ($f)
         (setq $pageMoved-p nil)
         (when (not (or
                     (string-match "/xx" $f) ; ; dir/file starting with xx are not public
                     (string-match "403error.html" $f)
                     (string-match "404error.html" $f)))
           (with-temp-buffer
             (insert-file-contents $f nil 0 100)
             (when (search-forward "page_moved_64598" nil t)
               (setq $pageMoved-p t)))
           (when (not $pageMoved-p)
             (with-current-buffer $sitemapBuffer
               (insert "<url><loc>"
                       "http://" @domain-name "/" (substring $f (length $websiteDocRootPath))
                       "</loc></url>\n"
                       )))))
       (seq-filter
        (lambda (path)
          (not (seq-some
                (lambda (x) (string-match x path))
                xahsite-external-docs
                )))
        (directory-files-recursively $websiteDocRootPath "\\.html$" )))
      (with-current-buffer $sitemapBuffer
        (insert "</urlset>")
        (write-region (point-min) (point-max) $filePath nil 3)
        (kill-buffer ))
      (find-file $filePath)
      )
    ;; (print (concat "done: " (format-time-string "%Y-%m-%dT%T")))
    ))

On a site with 3515 html files (10 times more if counting image files etc), the script takes 5 seconds to run. (e.g. timing based on running it a second time, thus not counting disk reading time. )

(defun xahsite-generate-sitemap-all ()
  "do all
2016-08-15"
  (interactive)
  (xahsite-generate-sitemap "ergoemacs.org" )
  (xahsite-generate-sitemap "wordyenglish.com" )
  (xahsite-generate-sitemap "xaharts.org" )
  (xahsite-generate-sitemap "xahlee.info" )
  (xahsite-generate-sitemap "xahlee.org" )
  (xahsite-generate-sitemap "xahmusic.org" )
  (xahsite-generate-sitemap "xahporn.org" )
  (xahsite-generate-sitemap "xahsl.org"  ))

Elisp Script Examples

  1. Write grep in Elisp
  2. Find String Inside HTML Tag
  3. Validate Matching Brackets
  4. Generate Links Report
  5. Creating a Sitemap
  6. Archive Website For Reader Download
  7. Process File line-by-line
  8. Text-Soup Automation
  9. Split HTML Annotation
  10. Fixing Dead Links
  11. Elisp vs Perl: Validate Local File Links
  12. Transform Page Tag
  13. Transform HTML FAQ Tags
  14. Transform HTML Tags
  15. “figure” to “figcaption”
  16. “span.w” to “b”

If you have a question, put $5 at patreon and message me.
Or Buy Xah Emacs Tutorial
Or buy a nice keyboard: Best Keyboards for Emacs

Emacs

Emacs Lisp

Misc