BerandaComputers and TechnologyThe Most Confusing Grep Mistakes I've Ever Made

The Most Confusing Grep Mistakes I’ve Ever Made

2020-11-02 – By Robert Elder

     In this article, I’ll discuss 5 mistakes that have caused me to waste the most amount of time when using the command-line tool known as ‘grep’ to search for things by getting wrong results.  The main reasons for these mistakes are:  Not knowing what flavour of regular expression that grep is currently using (and/or not understanding what features that flavour supports); Not considering the escaping rules of your shell; Issues with character encodings.

     Here is a file containing a few lines of text that we’ll place inside the file ‘hello.txt’:

Hello World.
Hello There World.
Hello Some World.
Hello The World.
HelloWorld.
Goodbye World.

     Let’s say that you wanted to use the ‘grep’ command to find all lines in this file that contain the word ‘Hello’ followed by the word ‘World’.  You could use a grep command like this:

grep "Hello.*World" hello.txt

     and as you expect, this grep command found all of the matches that you expect:

Hello World.
Hello There World.
Hello Some World.
Hello The World.
HelloWorld.

     But now you might consider adding the additional requirement that there be at least one character between ‘Hello’ and ‘World’ so that the line with ‘HelloWorld’ is not included in the matches.  Since you know that ‘*’ is a regex pattern for ‘zero or more’ and ‘+’ is a regex pattern for ‘oen or more’, you decide to try the following:

grep "Hello.+World" hello.txt

     But this doesn’t match anything at all!  What’s going on here?  Isn’t ‘+’ a regular expression symbol for ‘one or more’?  The answer is related to the default regular expression mode that grep uses.  If you don’t specify any flags to grep, it will use ‘BRE’ or ‘Basic Regular Expressions’ which are very old and quite primitive.  In fact, the official standard for BRE doesn’t even support the ‘+’ quantifier!  This can lead to very confusing behaviour since you might just try escaping the ‘+’ and find that it gives you the result you expect:

grep "Hello.+World" hello.txt

     gives the following:

Hello World.
Hello There World.
Hello Some World.
Hello The World.

     But since we’re still using ‘BRE’ regular expressions, the official standard says that this is actually undefined behaviour!  You can learn more about this in Undefined Behaviour With Grep -E.

     Let’s say we have a file called ‘sometext.txt’ with the following text in it:

Make sure you write out the `date`.

Today's date is Oct 21, 2020.

     If you wanted to find all lines in this file that contains the word ‘date’, you could use this grep command:

grep date sometext.txt

     and you’ll get the following result which is what you expect:

Make sure you write out the `date`.
Today's date is Oct 21, 2020.

     But now, let’s assume you wanted to use grep to only find the line that contain the backtick characters around the word ‘date’.  You might try doing the following:

grep `date` sometext.txt

     But this just generates a bunch of error messages in the shell:

grep: Oct: No such file or directory
grep: 21: No such file or directory
grep: 13: 30: 57: No such file or directory
grep: EST: No such file or directory
grep: 2020: No such file or directory

     You might be thinking, “Oh, no problem, I’ll just use double quotes” and try something like this:

grep "`date`" sometext.txt

     But that still doesn’t work (at least not in bash)!  It doesn’t find any matches at all!  The problem in this case is related to the fact that the backtick character has a special meaning in our shell, even when used inside double quotes.  To illustrate this point, we can run the following two echo commands:

echo "date"
echo "`date`"

     and the output of these echo statements is:

date
Mon Oct  21 13: 30: 57 EST 2020

     So from reading the results above example, you can see why the grep command we last used didn’t find anything: We were literally searching for the current date instead of the word ‘date’ surrounded by backticks!  The solution (in bash shell), is to use single-quotes instead:

grep '`date`' sometext.txt

     which will match correctly as expected:

Make sure you write out the `date`.

     This isn’t the only issue that you can encounter where your shell could unexpectedly change the meaning of the search string that you pass into grep.  You can also encounter an issue with unexpected ‘globbing’ when you attempt to use a regular expression containing the ‘*’ character without using quotes.  For example, consider this simple echo statement that just prints out ‘asdf’:

echo "asdf"

     If you filter this echo statement through a grep search for the character ‘a’ like this:

echo "asdf" | grep a

     the search will pass the line ‘asdf’ through as expected.  And similarly, if you do a regex search with grep for an ‘a’ followed by any number of other characters like this:

echo "asdf" | grep a.

     this will also let the ‘asdf’ through.  However, if you create a new file in the current directory called ‘a.txt’:

touch a.txt

     the following search won’t work anymore!  It doesn’t find anything:

echo "asdf" | grep a.

     What!?  How can creating a new file change how our grep commands run in the shell???  This problem is explained in detail in this article on shell globbing.

     This mistake isn’t specific to grep since it’s really about regular expressions in general, but it’s common enough to include in this article.  Consider a case where you’re trying to use grep to extract all instances of numbers that include a decimal point.  In your search, you’re looking for one or more digits, followed by a period, followed by one or more digits.  You might try writing a grep command like this:

echo "234.328" | grep -Eo "[0-9]+.[0-9]+"

     which looks like it works just fine because it does match all of the things you do want.  The problem is that is also matches things that you don’t want:

echo "234A328" | grep -Eo "[0-9]+.[0-9]+"

     In the above case, our regular expression will match the pattern ‘234A328’ which isn’t a decimal point number.  This case becomes obvious when you point it out, since the ‘.’ character usually represents “any character except for newline” in most regular expressions engines.  In order to match a ‘literal’ period character in a regular expression, you need to escape it:


echo "234A328" | grep -Eo "[0-9]+.[0-9]+"

echo "234.328" | grep -Eo "[0-9]+.[0-9]+"

     The lesson is to be careful when using searches that include a ‘.’ character, since it may not always literally mean a period character.

     Here is some text that we’ll place inside a file called ‘animals.txt’.  Take note that the two ‘columns’ in this file are separated with tab (t) characters:

Person	Favourite AnimalPet
Robert	Cat
Alexander	Dog
Sam	Monkey
Michael	Snake

     Let’s say that we wanted to write a grep statement to extract the first column from this file.  We could do this quickly and crudely by writing a regular expression that will extract anything up and including the tab character.  Here’s an attempt to do this with the following grep command:

grep -o ".*t" animals.txt

     But if you run this, you’ll get results that are completely wrong:

Person  Favourite AnimalPet
Robert  Cat

     The reason is, again, because of grep’s default regular expression mode: BRE or ‘Basic Regular Expressions’.  However, if we try using the -E flag for ‘Extended Regular Expressions’, this doesn’t fix the problem:

grep -Eo ".*t" animals.txt

     still gives:

Person  Favourite AnimalPet
Robert  Cat

     In fact, if you check the official standard for BRE and ERE, you’ll see that it has no support for matching just a ‘tab’ character!  In POSIX BRE or ERE, there are just a handful of characters that you can escape with a backslash, and they don’t include tab.

     Confusingly, GNU grep does support things like ‘s’ in ERE even though it’s not officially supported by the POSIX standard.

     The solution in our case is to use the -P flag for ‘Perl-Compatible Regular Expressions’:

grep -Po ".*t" animals.txt

     which gives us the expected result:

Person
Robert
Alexander
Sam
Michael

     Unfortunately, the ‘-P’ flag is not supported by all version of grep, so this solution isn’t always available.

     This issue is one that you won’t encounter every day, but when you do it can be extremely confusing to figure out what’s going on.  If you ever happen to work with files that are encoded in UTF-16, you’ll have to be mindful of the fact that grep isn’t aware of character encodings, so whatever you grep for will likely only be found if it’s in a character encoding that matches the current encoding of the terminal where you type the grep command.

     For example, imagine that you have two files: the first file encoded in UTF-8 contains this text:

Hello World 123!

     and the second file encoded in UTF-16 contains this text:

Hello World 456!

     On my machine, if I do a grep search over both of these files using this grep command:

grep World 

     this will only match the statement in the first file!

     This fact isn’t too surprising when you know what’s going on, but the difficult part is noticing that you’ve got a file that’s encoded in a different format in the first place.  If you take regular ASCII characters and re-encode them as UTF-16, the file you get will look like regular ASCII encoded text with nulls placed between the characters.  Therefore, if you print the file onto the terminal, the nulls will be ignored and what you see printed will look indistinguishable from regular ASCII text (except for the byte order marker).  Programs like vim will automatically recognize the encoding and display the file as normal text, so you likely won’t notice the encoding.

     One way to identify the encoding of files is to use the ‘file’ command:

file file1.txt file2.txt

     which gives this output in our example:

file1.txt: ASCII text
file2.txt: Little-endian UTF-16 Unicode text, with no line terminators

     Here is an example of hex dump from those two files:

xxd file1.txt

00000000: 4865 6c6c 6f20 576f 726c 6420 3132 3321  Hello World 123!
00000010: 0a                                       .

xxd file2.txt

00000000: fffe 4800 6500 6c00 6c00 6f00 2000 5700  ..H.e.l.l.o. .W.
00000010: 6f00 7200 6c00 6400 2000 3400 3500 3600  o.r.l.d. .4.5.6.
00000020: 2100 0a00                                !...

     As you can see, the UTF-16 encoded file looks just like ASCII text with null characters between every character.

     So, how do we actually find matches in a UTF-16 encoded file using grep?  Well, this is actually one of the few situations where grep isn’t actually the best tool for the job.  One option would be to normalize your files to UTF-8/ASCII encoding.  You can convert files between different encoding using the ‘iconv’ command:

iconv -f UTF-16 file2.txt -t UTF-8 -o file3.txt

     Since the file is now encoded as ASCII/UTF-8 in file3.txt, your original grep command should find the expected matches.

     Another less ideal option is to use the ‘-P’ flag with grep and explicitly include the null characters for the UTF-16 encoding in your grep command:

grep -Pa 'Wx00ox00rx00lx00dx00' 

     This looks quite messy, and since ‘-P’ is not supported by all version of grep, you can’t always use this option.  It also requires you to do a separate search every time you suspect there might be a UTF-16 file present (or more if there are even more encodings present).

     Another thing to note is that the ‘-a’ flag in the command above is necessary, otherwise grep will treat the UTF-16 files as binary data and refuse to search them.

     Hopefully, you’ve learned a few things about grep and the shell environment in this article.  I feel like I need to write a conclusion section to avoid ending the article too abrubtly, but there’s really nothing more to say at this point, and if I keep writing then I’ll just be rambling.  I guess we can talk about the weather if you want.  How are things going with you?

Read More

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments