May 10, 2014

Tinysnip #1: Colored diff with mksh or awk

For the record, when I started this article I wanted to "share" a snippet related to awk. It was almost ready for a publishing, but I had another idea, just before pushing the "nuke" button. I'm going to show you how to proceed, using two manners (mksh and awk).

NOTE: This article was improved compared to the first version. Now, I tend to use the awk(1) snippet. The mksh(1) variant is still provided but is less "powerful". Therefore, the first snippet was removed from the repository.

I need to compare two files very often and I'm unable to find something better than diff -u. The output is pretty clear and it can be improved with colordiff. It's a perl(1) wrapper which adds color. Now, you have no excuse to misunderstand the changes.

However, there is a problem… perl(1) is not installed by default on all machines (some distros have it in the base) and I find it quite "heavy", for this trivial task. Anyway, my "dirty" and "hacky" mind found ~~a solution~~ two solutions.

* mksh

Why not use mksh(1), the awesome shell?

while IFS= read -r LINE; do
	CHAR="${LINE::1}"
	case $CHAR in
		\-) print -r -- "\033[1;31m${LINE}\033[0m" ;;
		\+) print -r -- "\033[0;32m${LINE}\033[0m" ;;
		\@) print -r -- "\033[1;35m${LINE}\033[0m" ;;
		*) print -r -- "$LINE" ;;
	esac
done

We analyze the first character on each line using CHAR="${LINE::1}" and we apply syntax highlighting, according to this CHAR (green for addition, red for deletion and bright magenta for the context). For non-matching lines, print just show them.

When a line was modified, it's necessary to reuse the default colors. Fortunately \033[0m does the trick. It can be invoked like this diff -u file1 file2 | yiff (yiff is the script itself).

* awk

Why not use awk(1), the "venerable" tool?

symb = substr($0,1,1) {
		# Highlight end line space(s)/tab(s)
		gsub(/[\t ]+$/,"\033[0;41m \033[0m")
		if (symb == "@") {
			printf("\033[1;35m%s\033[0m\n", $0)
		} else if (symb == "-") {
			printf("\033[0;31m%s\033[0m\n", $0)
		} else if (symb == "+") {
			printf("\033[0;32m%s\033[0m\n", $0)
		} else {
			printf("%s\n", $0)
		}
	}

More I work with awk(1), more I love it... It's really simple, powerful, fast and more understandable than sed(1), IMHO.

symb = substr($0,1,1) declares a variable which will match the first char, on every line (-, +, @ or just a space). It will be used later (don't be in a hurry, comrade). Then, we append ANSI escapes codes and we color the whole line. For example: red for a deletion, green for an addition and magenta for range information. The colors could be easily changed.

When the line ends, we have to restore the default foreground. If not, the next line will be highlighted even if it didn't change and you'll be confused. That's why "\033[0m" is added, at the end of "special" lines. It's not needed with unchanged lines. For testing purposes, let's replace "\033[0m" by " endline":

--- file1	2014-05-09 14:58:12.066021088 +0200 endline
+++ file2	2014-05-09 14:59:16.299352783 +0200 endline
@@ -1,5 +1,6 @@ endline
 line 1 coco
 line 2 moo
-line 3 what endline
+line 8 nothing endline
 line 4 shell
-line 5 bro endline
+line 10 hehe endline
+line 24 limit endline

As you can see endline is only "echoed" on the lines starting by the required glyphs. I also wanted to color the spaces / tabs at the end of the line(s), for the reason that they are almost hidden. The trailing whitespace(s) are matched and reported red (background). That was useful to me several times.

The snippet was created on Debian with mawk(1). It's ok with mawk(1), nawk(1) and gawk(1). As usual, it's available on git.

Assuming you prefer "traditional" diff(1) output, replace + and - by > and <. By the way, do I need to mention that my "hacks" are faster than colordiff(1) ?

$ time diff -u file{1,2} | yiff
0m0.09s real     0m0.00s user     0m0.00s system

$ time diff -u file{1,2} | yiffawk
0m0.08s real     0m0.00s user     0m0.00s system

$ time colordiff -u file{1,2}
0m0.14s real     0m0.03s user     0m0.00s system

It was tested on a file with ~ 900 lines (20 differences). Yeah, I know the test isn't representative at all. colordiff(1) includes more options, but I'm not using them (big deal, right?). I'm glad to see I can do things faster...

N.B.: A reader asked me "Why not add a POSIX version?". Hum, on this blog I usually try to focus on (m)ksh(1) (or awk(1)), because I really "love" it and also because ksh is not very used/known, in Linux world. In that article, the awk(1) version is a solid and efficient alternative, when your shell doesn't support ${var::1} (or just use cut(1) instead).