This page shows you how to write a emacs lisp command to extract all URLs in a HTML file. If you don't know elisp, first take a look at Emacs Lisp Basics.
Write a command extract-url. When called, all URLs in a text selection will be listed in a separate pane, one per line.
For example, suppose you have this text:
<a href="../cats.html">cats</a>, <a href="http://en.wikipedia.org/wiki/Idiom">Idiom</a>, <a class="b" href="computing.html"></a>
After calling the command, you'll get in a separate buffer showing this text:
../cats.html http://en.wikipedia.org/wiki/Idiom computing.html
If there's no text selection, current paragraph should be used.
There are many ways to code this. Here's one:
(defun extract-url (&optional p1 p2) "Returns a list of URLs in the region p1 p2. The region's text should be HTML. When called interactively, use text selection as input, or current text block between empty lines. Output URLs in a buffer named 「*extract URL output*」. When called in a program, the first URL is the last list element. WARNING: this function extract all text of the form 「<a … href=\"…\" …>」 by a simple regex. It does not extract single quote form 「href='…'」 nor 「src=\"…\"」 , nor other considerations." (interactive (if (region-active-p) (list (buffer-substring-no-properties (region-beginning) (region-end)) ) (let ((bds (bounds-of-thing-at-point 'paragraph))) (list (car bds) (cdr bds)) ) ) ) (let ((htmlText (buffer-substring-no-properties p1 p2)) (urlList (list))) (with-temp-buffer (insert htmlText) (goto-char 1) (while (re-search-forward "<a.+?href=\"\\([^\"]+?\\)\".+?>" nil t) (setq urlList (cons (match-string 1) urlList)) )) (when (called-interactively-p 'any) (with-output-to-temp-buffer "*extract URL output*" (mapc (lambda (ξx) (princ ξx) (terpri) ) (reverse urlList)) ) ) urlList ))
Here's how it works.
First note that the function takes 2 optional arguments: the beginning and end of buffer position.
When called interactively, we want to set the end points to region's end points, if the region is active. Else, the current paragraph.
Emac's interactive function is a way to fill arguments for interactive call. The “interactive” can return a list. Emacs will use the list elements as arguments. In our case, it's done by:
(interactive (if (region-active-p) (list (buffer-substring-no-properties (region-beginning) (region-end)) ) (let ((bds (bounds-of-thing-at-point 'paragraph))) (list (car bds) (cdr bds)) ) ) )
To output to a separate buffer, we use with-output-to-temp-buffer and princ, like this:
(with-output-to-temp-buffer "*extract URL output*" … (princ urlStr) … )
With (with-output-to-temp-buffer ‹buffer›), and printing functions will print to that buffer. The printing function we used is princ, which print lisp objects to a human readable form.
〔☛ Emacs Lisp: print, princ, prin1, format, message〕
To extract URL, there are many approaches. Here, we use a simple regex.
(while (re-search-forward "<a.+?href=\"\\([^\"]+?\\)\".+?>" nil t) (setq urlList (cons (match-string 1) urlList)) )
The regex without elisp string escapes is this:
<a.+?href="\([^\"]+?\)".+?>
.+ means one or more chars, any char."\([^\"]+?\)" captures string between quotes.〔☛ Emacs: Text Pattern Matching (regex) tutorial〕
In this line:
(setq urlList (cons (match-string 1) urlList))
The (match-string 1) gets the captured string. Then we prepend it to a list.
Note that this solution is very simple and practically useful, but isn't a fully correct solution. For example, it does not get URL that's enclosed by 'single straight quotes'. Nor does it get URL inside “img” tags or JavaScript tags with the “src” attribute src="…".
You can easly modify it by just adding more while block, change the double quote to single, and also “href” to “src”. However, that approach won't have the URL in order they appears in the text.
Can you fix it?
blog comments powered by Disqus