Elisp: Replace Invisible Unicode Chars

By Xah Lee. Date: . Last updated: .

When copying text from Twitter or Google Plus, often you'll get a invisible ZERO WIDTH NO-BREAK SPACE, that's really annoying. (this is so at least from to )

Here's a command to replace BOM mark character.

(defun xah-replace-BOM-mark-etc ()
  "Query replace some invisible Unicode chars.
The chars to be searched are:
 ZERO WIDTH NO-BREAK SPACE (codepoint 65279, #xfeff)
 RIGHT-TO-LEFT MARK (codepoint 8207, #x200f)
 RIGHT-TO-LEFT OVERRIDE (codepoint 8238, #x202e)
 OBJECT REPLACEMENT CHARACTER (codepoint 65532, #xfffc)

Search begins at cursor position. (respects `narrow-to-region')

This is useful for text copied from twitter or Google Plus, because they often contain BOM mark. See URL `http://xahlee.info/comp/unicode_BOM_byte_orde_mark.html'

URL `http://ergoemacs.org/emacs/elisp_unicode_replace_invisible_chars.html'
Version 2016-07-24"
  (interactive)
  (query-replace-regexp "\u200f\\|\u202e\\|\ufeff\\|\ufffc" ""))

To do this for all files in a directory, you can use the following temp hack.

You'll need Emacs: xah-find.el, Find Replace in Pure Elisp.

(defun xah-replace-BOM-mark-dir ()
  "temp hack. replace some invisible Unicode chars.
see `xah-replace-BOM-mark-etc'
Version 2015-10-11"
  (interactive)
  (require 'xah-find)
  (let ($dir)
    (setq $dir (ido-read-directory-name "Directory: " default-directory default-directory "MUSTMATCH"))
    (xah-find-replace-text (char-to-string 65279) "" $dir "\\.html\\'" t t t t)))

The above will do it in batch, and make backup of changed files, and generate a report of changed replaces and files you can jump to.

You can also use emacs's builtin methods, but the problem is the annoyance of the step to type the invisible char (either by copy-paste, or remember the unicode codepoint〔►see Emacs: Unicode Tutorial〕). See:

Liket it? Put $5 at patreon. Or Buy Xah Emacs Tutorial. Thanks.