ErgoEmacsEmacsLispBlogEmacsLispBuy Tutorial
Web Hosting by 1&1

Emacs: Text Pattern Matching (regex) tutorial

Xah Lee, , …,

This page is a tutorial on emacs regex.

Emacs's regex is not based on Perl or Python's, but is very similar. In emacs regex, the parenthesis characters () are literal. If you want to capture a pattern, you need to escape the paren like this: \(myPattern\).

your regex brain

Common Patterns

Here are some common patterns:

PatternMatches
.Any single character
\.One period
[0-9]+Sequence of digits
[A-Za-z]+Sequence of letters
[-A-Za-z0-9]+Sequence of letter, digit, hyphen
[_A-Za-z0-9]+Sequence of letter, digit, underscore
[-_A-Za-z0-9]+Sequence of letter, digit, hyphen, underscore
[[:ascii:]]+Sequence of ASCII chars.
[[:nonascii:]]+Sequence of none ASCII chars (⁖ Unicode chars)
[\t\n ]+Sequence of {tab, newline, space}. (in elisp code only. For interactive regex commands, use literal character. Insert by 【Ctrl+q Ctrl+j】)
PatternMatches
"\([^"]+\)"capture text between double quotes (greedy; match as far to the right as possible)
“\([^”]+\)”capture text between curly double quotes
(\([^)]+\))capture text between parenthesis
PatternMatches
+means match previous pattern 1 or more times
*means match previous pattern 0 or more times
?means match previous pattern 0 or 1 time
+?means match previous pattern 1 or more times, but with minimal match (aka non-greedy)
PatternMatches
^…Beginning of {line, string, buffer}
…$End of {line, string, buffer}
\`…Beginning of {string, buffer}
…\'End of {string, buffer}
\bword boundary marker

Matching Newline

• When using interactive commands for find/replace, emacs won't understand \t, \n. To enter a Tab character, press 【Ctrl+q Tab ↹】. To enter a new line, press 【Ctrl+q Ctrl+j】. (For explanation, see: Emacs's Key Notations Explained (/r, ^M, C-m, RET, <return>, M-, meta)).

• When a file is opened in Emacs, newline is always \n, regardless whether your file is from Unix, Windows, Mac. Do NOT manually do find/replace on newline chars for file newline convention. 〔☛ Emacs: Newline Representations ^M ^J ^L

Case Sensitivity

When using [a-z], it is not case sensitive by default. Case sensitivity is controlled by the variable “case-fold-search”. Call toggle-case-fold-search to toggle it.

Do not use [A-z]. It'll match punctuation chars between ASCII “Z” and “a”. Use [A-Za-z].

Perl Regex vs Emacs Regex

Here are some practical major differences.

perlemacs
Capture(…)\(…\)
digit\d[[:digit:]]
word\w[[:word:]]
space\s[[:space:]]
newline\n\n in elisp string, or 【Ctrl+q Ctrl+j】 in interactive command.

For example, Perl's \d+ is emacs's [[:digit:]]+.

Also, the meaning of a character class in emacs may be dependent on the current major mode's syntax table. For example, what chars are considered “word” in [[:word:]] depends on how its defined in syntax table of current major mode. Syntax table is hard to work with. Best is just to put the chars you want explicitly in your regex, ⁖ [A-Za-z].

Interactive Emacs Regex Mode

Emacs has a interactive regex mode. It show matches as you type. To go into the mode, call regexp-builder. (I don't use this)

Alternatively, call query-replace-regexp to test your pattern. Ι prefer this.

Regex in Emacs Lisp Code

Regex is used in elisp code too, just like Perl as a language.

How to Test Regex in Elisp Code

To test regex in your elisp code, open a empty file and place the regex function at top and the text you want to match below it, like this:

(search-forward-regexp "yourRegex")

whatever text here

Then, put your cursor to the right of the closing parenthesis, then call eval-last-sexpCtrl+x Ctrl+e】. If your regex matches, it'll move cursor to the last char of the matched text. If you get a lisp error saying search failed, then your regex didn't match. If you get a lisp syntax error, then you probably screwed up on the backslashs.

Double Backslash in Lisp Code

In a lisp regex function that takes a regex string (⁖ search-forward-regexp), you will need to use double backslash. This is because, in elisp string, a backslash needs to be prefixed with a backslash, then, this interpreted string is passed to emacs's regex engine.

Here's General Rule:

For example, suppose you have this text:

<img src="cat.jpg" alt="my cat" width="795" height="183" />

When you call a command such as query-replace-regexp, you can type the regex in the prompt. Example:

<img src="\([^"]+?\)" alt="\([^"]+?\)" width="\([0-9]+\)" height="\([0-9]+\)" />

But in lisp code, the same regex needs to have many backslash escapes, like this:

(search-forward-regexp
"<img src=\"\\([^\"]+?\\)\" alt=\"\\([^\"]+?\\)\" width=\"\\([0-9]+\\)\" height=\"\\([0-9]+\\)\" />" )

The following should have single backslash only: {\n, \t, \"}.

blog comments powered by Disqus