Emacs: Text Pattern Matching (regex) tutorial

Buy Xah Emacs Tutorial. Master emacs benefits for life.
, , …,

This page is a tutorial on emacs regex.

Emacs's regex is not based on Perl or Python's, but is very similar. In emacs regex, the parenthesis characters () are literal. If you want to capture a pattern, you need to escape the paren like this: \(pattern\).

your regex brain

Common Patterns

.Any single character
\.One period
[0-9]+Sequence of digits
[A-Za-z]+Sequence of letters
[-A-Za-z0-9]+Sequence of letter, digit, hyphen
[_A-Za-z0-9]+Sequence of letter, digit, underscore
[-_A-Za-z0-9]+Sequence of letter, digit, hyphen, underscore
[[:ascii:]]+Sequence of ASCII chars.
[[:nonascii:]]+Sequence of none ASCII chars (⁖ Unicode chars)
"\([^"]+\)"capture text between double quotes.
+match previous pattern 1 or more times
*match previous pattern 0 or more times
?match previous pattern 0 or 1 time
+?match previous pattern 1 or more times, but with minimal match (aka non-greedy)
^…Beginning of {line, string, buffer}
…$End of {line, string, buffer}
\`…Beginning of {string, buffer}
…\'End of {string, buffer}
\bword boundary marker

Matching Newline

When using interactive commands for find & replace, emacs won't understand \t, \n.

(For explanation, see: Emacs's Key Notations Explained (/r ^M C-m RET <return> M- meta)).

In elisp code, you can use \t for tab and \n for newline inside a string. You can use [\t\n ]+ for sequence of {tab, newline, space}. (literal tab/newline will also work.)

When a file is opened in Emacs, newline is always \n, regardless whether your file is from Unix, Windows, Mac. Do NOT manually do find/replace on newline chars for file newline convention. 〔➤ Emacs: Newline Representations ^M ^J ^L

Case Sensitivity

When using [a-z], it is not case sensitive by default. Case sensitivity is controlled by the variable “case-fold-search”. Call toggle-case-fold-search to toggle it.

Do not use [A-z], because that'll match some punctuation chars too. Use [A-Za-z].

Perl Regex vs Emacs Regex

Here are some practical major differences.

newline\n\n in elisp string, or 【Ctrl+q Ctrl+j】 in interactive command.

For example, Perl's \d+ is emacs's [[:digit:]]+.

IMPORTANT: the meaning of a character class in emacs is dependent on the current major mode's syntax table. For example, what chars are considered “word” in [[:word:]] depends on how its defined in syntax table of current major mode. Syntax table is hard to work with. Best is just to put the chars you want explicitly in your regex, ⁖ [A-Za-z].

Interactive Emacs Regex Mode

Emacs has a interactive regex mode. It show matches as you type. To go into the mode, call regexp-builder. (I don't use this)

Alternatively, call query-replace-regexp to test your pattern. Ι prefer this.

Regex in Emacs Lisp Code

Regex is used in elisp code too, just like Perl as a language.

How to Test Regex in Elisp Code

To test regex in your elisp code, open a empty file and place the regex function at top and the text you want to match below it, like this:

(search-forward-regexp "yourRegex")

whatever text here

Then, put your cursor to the right of the closing parenthesis, then call eval-last-sexpCtrl+x Ctrl+e】. If your regex matches, it'll move cursor to the last char of the matched text. If you get a lisp error saying search failed, then your regex didn't match. If you get a lisp syntax error, then you probably screwed up on the backslashs.

Double Backslash in Lisp Code

In a lisp regex function that takes a regex string (⁖ search-forward-regexp), you will need to use double backslash. This is because, in elisp string, a backslash needs to be prefixed with a backslash, then, this interpreted string is passed to emacs's regex engine.

Here's General Rule:

For example, suppose you have this text:

<img src="cat.jpg" alt="my cat" width="795" height="183" />

When you call a command such as query-replace-regexp, you can type the regex in the prompt. Example:

<img src="\([^"]+?\)" alt="\([^"]+?\)" width="\([0-9]+\)" height="\([0-9]+\)" />

But in lisp code, the same regex needs to have many backslash escapes, like this:

"<img src=\"\\([^\"]+?\\)\" alt=\"\\([^\"]+?\\)\" width=\"\\([0-9]+\\)\" height=\"\\([0-9]+\\)\" />" )

The following should have single backslash only: {\n, \t, \"}.

Unicode Representation in String

Emacs Lisp: Unicode Representation in String

Like it?
Buy Xah Emacs Tutorial
or share
blog comments powered by Disqus