Emacs: Regex Tutorial

By Xah Lee. Date: . Last updated: .

Regex lets you find text by pattern, such as any characters that are repeated exactly twice. Regex is like wildcard, but more flexible.

Regex Commands

The most commonly used commands that use regex are:

What's Regex? What's Wrong with Wildcards * ?

In shell, you can use wildcards to match filename, for example:

This system is called unix “glob”.

Unix glob is simple and easy to understand.

But what if, you want to match any file name that has 5 to 10 repeated B?

Or, for example, in a HTML file, you have lines like these:

<img src="img/cat.png" alt="my cat" />
<img src="img/dog.png" alt="her dog" />
<img src="img/house.jpg" alt="friend's house" />

You want to get all image tag's file names.

You cannot do with glob. You need a more powerful one to match pattern: regex.

(Note: Regex is short for Regular Expressions. It came from formal language theory. Regular Expressions is used to express all “Regular Languages” (aka Linear Languages), basically meaning, any string from a grammar that does not generate “nested strings”.).

Emacs Regex Syntax

Here's commonly used patterns.

PatternMatches
.Any single character except newline ("\n").
\.One period
[0-9]+One or more digits
[^0-9]+One or more non-digit characters
[A-Za-z]+one or more letters
[-A-Za-z0-9]+one or more {letter, digit, hyphen}
[_A-Za-z0-9]+one or more {letter, digit, underscore}
[-_A-Za-z0-9]+one or more {letter, digit, hyphen, underscore}
[[:ascii:]]+one or more ASCII chars. (codepoint 0 to 127, inclusive)
[[:nonascii:]]+one or more none-ASCII characters (For example, Unicode characters)
[\n\t ]+one or more {newline character, tab, space}.
PatternMatches
"\([^"]+\)"capture text between double quotes.
repetition
PatternMatches
+match previous pattern 1 or more times. e.g. a+ means 1 or more occurrence of “a”. [0-9]+ means 1 or more occurrence of digit.
*match previous pattern 0 or more times
?match previous pattern 0 or 1 time
+?match previous pattern 1 or more times, but with minimal match (aka non-greedy)
boundary anchors
PatternMatches
^…Beginning of {line, string, buffer}
…$End of {line, string, buffer}
\`…Beginning of {string, buffer}
…\'End of {string, buffer}
\bword boundary marker

Unicode character can be used literally, for example, matches the right arrow character literally.

You can also represent any character by a coded syntax such as \u2192. See: 〔►see Elisp: Unicode Representation in String

For complete list of regex syntax, see: (info "(elisp) Syntax of Regexps")

Example Regex Usage

  1. Open a big file with many lines.
  2. Alt+x list-matching-lines
  3. type th.t
  4. It will list all lines that contains the regex pattern th.t, which includes “this”, “that”.

Newline and Tab

When using interactive commands, emacs won't understand \n or \t.

For example, if you want to match 2 lines, first line start with A, second line start with B

A, something
B, other thing

you can use this regex

A.+ B.+

But to type a return between the lines, you need to press Ctrl+q Ctrl+j

(For explanation of why Ctrl+q Ctrl+j is for return, see: Emacs's Key Syntax Explained).

Case Sensitivity

By default, Emacs regex is not case sensitive unless the pattern contains capital letters. That is, dragon will match “dragon” and “Dragon” and “DRAGON” and “draGON”. But Dragon will match only “Dragon”.

Case sensitivity is controlled by the variable case-fold-search. Alt+x toggle-case-fold-search to toggle it. Remember to toggle it back when you are done. Because case-fold-search is also used by isearch and basically all search or find/replace commands.

Do not use [A-z], because that'll match some punctuation chars too. Use [A-Za-z].

JavaScript vs Emacs Regex

Regex for most languages, JavaScript, Python, Ruby, Perl etc are similar. Emacs regex is different from them.

Here are practical major differences.

Major difference between emacs regex and other language's regex
JavaScriptemacs lisp
Capture(…)\(…\)
digit\d[[:digit:]]
word\w[[:word:]]
whitespace\s[[:space:]]

For example, JavaScript's \d+ is emacs's [[:digit:]]+.

Note: the meaning of a character class in emacs is dependent on the current buffer's syntax table. For example, what chars are considered “word” in [[:word:]] depends on how its defined in syntax table of current major mode. But practically, it's what you'd expect.

For a example showing the difference, see: Elisp: Regex Patterns and Syntax Table.

Syntax table is hard to work with, and regex using it may be unpredictable. Best is just to put the chars you want explicitly in your regex, for example, use [A-Za-z] instead of [[:word:]] unless you also need to match Chinese character or Russian character etc.

Interactive Emacs Regex Mode

Emacs has a interactive regex mode. It show matches as you type. To go into the mode, Alt+x regexp-builder. (I don't like this)

Alternatively, Alt+x query-replace-regexp to test your pattern. Ι prefer this.

Regex in Emacs Lisp Code

Elisp: Regex Tutorial

Reference

(info "(emacs) Regexps")

Find and Replace Topic

  1. Emacs: Search / Highlight Words
  2. Emacs: Search Text in Files
  3. Emacs: Find and Replace Commands
  4. Emacs: Find Replace Text in Directory
  5. Emacs: Regex Tutorial
  6. Emacs: isearch Current Word
  7. Emacs: xah-find.el, Find Replace in Pure Elisp
Liket it? Put $5 at patreon. Or Buy Xah Emacs Tutorial. Thanks.