ErgoEmacsEmacsLispBlogEmacsLispBuy Tutorial
Web Hosting by 1&1

Emacs Lisp: Writing a Command to Extract URL

Xah Lee, , …,

This page shows you how to write a emacs lisp command to extract all URLs in a HTML file. If you don't know elisp, first take a look at Emacs Lisp Basics.

Problem Description

Write a command extract-url. When called, all URLs in a text selection will be listed in a separate pane, one per line.

For example, suppose you have this text:

<a href="../cats.html">cats</a>, <a href="http://en.wikipedia.org/wiki/Idiom">Idiom</a>, <a class="b" href="computing.html"></a>

After calling the command, you'll get in a separate buffer showing this text:

../cats.html
http://en.wikipedia.org/wiki/Idiom
computing.html

If there's no text selection, current paragraph should be used.

Solution

There are many ways to code this. Here's one:

(defun extract-url (&optional p1 p2)
  "Returns a list of URLs in the region p1 p2.
The region's text should be HTML.

When called interactively, use text selection as input, or current text block between empty lines. Output URLs in a buffer named 「*extract URL output*」.

When called in a program, the first URL is the last list element.

WARNING: this function extract all text of the form 「<a … href=\"…\" …>」 by a simple regex. It does not extract single quote form 「href='…'」 nor 「src=\"…\"」 , nor other considerations."
(interactive
 (if (region-active-p)
     (list (buffer-substring-no-properties (region-beginning) (region-end)) )
   (let ((bds (bounds-of-thing-at-point 'paragraph)))
     (list (car bds) (cdr bds)) ) ) )
  (let ((htmlText (buffer-substring-no-properties p1 p2)) (urlList (list)))
    (with-temp-buffer
      (insert htmlText)
      (goto-char 1)
      (while (re-search-forward "<a.+?href=\"\\([^\"]+?\\)\".+?>" nil t)
        (setq urlList (cons (match-string 1) urlList))
        ))

    (when (called-interactively-p 'any)
        (with-output-to-temp-buffer "*extract URL output*"
          (mapc (lambda (ξx) (princ ξx) (terpri) ) (reverse urlList))
          )
      )
    urlList
    ))

Here's how it works.

Using “interactive” to get Arguments

First note that the function takes 2 optional arguments: the beginning and end of buffer position.

When called interactively, we want to set the end points to region's end points, if the region is active. Else, the current paragraph.

Emac's interactive function is a way to fill arguments for interactive call. The “interactive” can return a list. Emacs will use the list elements as arguments. In our case, it's done by:

(interactive
 (if (region-active-p)
     (list (buffer-substring-no-properties (region-beginning) (region-end)) )
   (let ((bds (bounds-of-thing-at-point 'paragraph)))
     (list (car bds) (cdr bds)) ) ) )

Output to Separate Buffer

To output to a separate buffer, we use with-output-to-temp-buffer and princ, like this:

(with-output-to-temp-buffer "*extract URL output*"
…
(princ urlStr)
…
 )

With (with-output-to-temp-buffer ‹buffer›), and printing functions will print to that buffer. The printing function we used is princ, which print lisp objects to a human readable form. 〔☛ Emacs Lisp: print, princ, prin1, format, message

Extracting URL

To extract URL, there are many approaches. Here, we use a simple regex.

(while (re-search-forward "<a.+?href=\"\\([^\"]+?\\)\".+?>" nil t)
        (setq urlList (cons (match-string 1) urlList))
        )

The regex without elisp string escapes is this:

<a.+?href="\([^\"]+?\)".+?>

〔☛ Emacs: Text Pattern Matching (regex) tutorial

In this line:

(setq urlList (cons (match-string 1) urlList))

The (match-string 1) gets the captured string. Then we prepend it to a list.

Simple Solution

Note that this solution is very simple and practically useful, but isn't a fully correct solution. For example, it does not get URL that's enclosed by 'single straight quotes'. Nor does it get URL inside “img” tags or JavaScript tags with the “src” attribute src="…".

You can easly modify it by just adding more while block, change the double quote to single, and also “href” to “src”. However, that approach won't have the URL in order they appears in the text.

Can you fix it?

blog comments powered by Disqus