Problems of Calling Unix grep in Emacs

, , …,

This page describes problems of calling unix grep in emacs, and why a emacs lisp version is more flexible and superior. (or a perl/python/ruby script)

Unix grep util is quite useful. In emacs, it's even better. Because, emacs has commands that act as wrapper to unix grep, with the advantage that the output is colored, and file names are linked. 〔☛ Emacs: Searching for Text in Files (grep, find)

However, calling unix grep inside emacs has some problems, either directly by a shell-command command, or indirectly by {grep, rgrep, lgrep, grep-find, …}.

External Program Problem

On Windows, calling a external unix util is major problem, because by default there's no unix grep installed. This makes a large part of critical emacs features not usable.

User has to install either Cygwin or others, then, emacs goes thru several layers, thru Cygwin, Windows. Unicode, or text that contain quote chars or escapes, almost always gets screwed along the way. 〔☛ Installing Cygwin Tutorial

Unix Shell Quote Escape Problem

Today, i want to grep with this regex height="[0-9]+" /> for HTML image tags.

On my Microsoft Windows machine with Cygwin installed, when calling emacs's grep command, i gave this to the prompt:

grep -ie -nH 'height\=\"[0\-9]\+\" \/\>' *html

It doesn't work. I tried many variations: with the backslash in different places, double backslash, single/double quotes. Sometimes the error is about Cygwin detecting DOS style slash. Sometimes it silently creates a file of 0 length named ' in your directory, due to your bad escapes.

Dependence on Familiarity of Unix Shell Syntax

Unix utils syntax is incomprehensible to none-unix users, but emacs depends on them for basic features such as searching text in a dir. Users not familiar with the cryptic shell syntax won't be able to use it. For example, emacs grep command prompts this: grep -nH -e ▮.

Also, to list files in emacs dired, the only command for that is find-dired. These depends on familiarity of unix find/xargs commands. For example, it prompts “Run find (with args):”, where the user is supposed to type something like -name "*html" 〔☛ Inconsistency of Emacs Text-Searching Features

Regex Not Compatible to Emacs Regex

The regex is different from regex that users normally use in emacs, such as when calling query-replace-regexp. 〔☛ Emacs: regex tutorial

Unicode String Problem

When using unix grep on MS Windows for processing Unicode text, there are many encoding problems. On Windows with Cygwin, the char encoding in the stream gets messed up thru the various layers.

For example, grep fails when searching for (U+2502). This is calling Cygwin grep from emacs on Windows. It's too complex to figure out exactly why it fails.

With Unicode, you have to deal with unix environment variable “locale”, emacs's own various encoding settings, MS Window's locale and “codepage” setup. There's complex interplay of environment variables among {emacs, emacs's inferior shells, Cygwin, Windows}. 〔☛ Emacs and Microsoft Windows Tips

Problem with Long Search String

Sometimes you want to search a string that's part of source code. It may be long, containing 300 hundred chars or more. (⁖ a snippet of HTML that contains JavaScript and span multi-lines.) You could put your search string in a file with grep using --file=file name, but this is not convenient. Here's a example of a string i need to do a literal search:

<div class="chtk"><script>ch_client="polyglut";ch_width=550;ch_height=90;ch_type="mpu";ch_sid="Chitika Default";ch_backfill=1;ch_color_site_link="#00C";ch_color_title="#00C";ch_color_border="#FFF";ch_color_text="#000";ch_color_bg="#FFF";</script><script src="http://scripts.chitika.net/eminimalls/amm.js"></script></div>

Shell escape this string would be very inconvenient, and more complex when shell is used inside emacs.

Grep Not Flexible for Specifying Files in Directories

grep is not very flexible for working with all files in a directory. There's -r option, but then you can't specify file pattern (⁖ *html). You have to do it like this: grep -r 'xyz' --include='*html' dirname 〔☛ Linux Shell Text Processing Tutorial: grep, cat, awk, sort, uniq, …

Sometimes you need to work on a list of files, sometimes by a pattern, sometimes you want to exclude some files by list or by pattern, sometimes only the first 2 levels, or a combination of the above in a specific order. Some unix tools provide these options, sometimes by combination of tools (⁖ find/xargs), but their order and syntax is complex and tool specific. With a script in Perl, Python, elisp, it's much easier to control.

Too Many Incompatible Versions of Grep

There are too many versions and varieties of grep. The primary 2 are BSD vs GNU. Mac OS X comes with BSD versions, but some utils are GNU versions. Linuxes typically come with GNU versions. The different versions accept different options. Also, GNU grep supports a varieties of confusing regex (“--basic-regexp”, “--extended-regexp”, “--perl-regexp”.) It's too painful to figure them out and remember their details.

grep is Not Powerful Enough for Nested Syntax (⁖ HTML/XML)

Unix grep and associated tool (sort, wc, uniq, pipe, sed, awk, …) is not flexible, when your need is slightly more complex. For example, suppose i need to find all occurrences of HTML “img” tag that are not wrapped by a <div> tag. This is impossible with unix tools. (extending the limit of unix tools is how Perl was born in 1987.)

Example of a Real World Problem Using Grep Inside Emacs

Here's a concrete example of grep problem.

In my vocabulary page Wordy English — the Making of Belles-Lettres, i use the Unicode BOX DRAWINGS LIGHT VERTICAL “│” as a temp marker for processing the word list. Today i need to grep pages containing that character.

Calling 【Meta+x grep】 in emacs with grep -inH -e "│" *html returns a error:

-*- mode: grep; default-directory: "c:/Users/xah/web/xahlee_org/emacs/" -*-
Grep started at Tue Apr 05 15:37:47

grep "│" *html
warning: extra args ignored after 'grep "│\'

Grep finished with no matches found at Tue Apr 05 15:37:47

Starting shell in emacs (which runs Microsoft cmd.exe in Windows Vista) doesn't work neither. (it works fine when grepping ASCII string) Here's a session log:

Microsoft Windows [Version 6.0.6002]
Copyright (c) 2006 Microsoft Corporation.  All rights reserved.

c:\Users\xah\web\xahlee_org\emacs>grep "│" *html
grep "â\224? *html

It stuck there. 【Ctrl+c Ctrl+c】 doesn't get out. I had to kill the buffer.

Calling msys-shell works. (msys-shell is bundled with ErgoEmacs. It calls bash in MinGW, which is a subset of Cygwin port.) Here's a log:

sh-3.2$ grep "│" *html
antonymous_synonyms.html:<li> cry, decry │ you can cry, as in crying out loud, but you can also decry, by crying out loud </li>
antonymous_synonyms.html:<li> linear, rectilinear │ linear algebra, rectilinear motion. Rectilinear is the linearness of motion.</li>
…

Calling it in Cygwin Bash running inside Windows Console also works.

So, this means, the problem isn't grep not understanding Unicode. Something went wrong when emacs talks to Cygwin. Though, what exactly is the problem? Well, i'm not about to spend few hours to find out.

in PowerShell, it also works. ⁖ with this command select-string -path *.html -pattern "│". However, calling PowerShell thru emacs does not work. 〔☛ Emacs PowerShell Modes

Here's my system setup:

O, Complexity & Tedium of Software Engineering.

Addendum: Adding the option -P also worked. ⁖ call emacs “grep” command, then give grep -inH -e -P "│" *html. Thanks to “blandest” (gnu.emacs.help).

Emacs Lisp Solves All Problems

If, grep is written in elisp, which is trivial to write, all these problem would be solved. I've been using elisp version of grep i wrote myself for a number of years now.

This page started when i wrote a grep in emacs lisp: How to Write grep in Emacs Lisp, and people are asking why.

blog comments powered by Disqus