Terminal-01

Grep Tutorial: Searching File Contents

The grep command allows searching the contents of a file from the
command line. It’s a very useful tool to find a particular line in,
say, a log file or a conf file. And because it’s a command line
program, you can combine it with other commands in various ways to
produce powerful results. In this tutorial, you will learn both the
basics and some more advanced applications of grep.

Looking for a needle in a haystack

Suppose you have a large configuration file in which you want to
find a particular setting.
For example, you might want to know the current maximum upload file
size of your PHP installation. The following grep command will quickly
give you your answer:

grep upload_max_filesize php.ini

In my case, it outputs “upload_max_filesize = 2M”. Pretty neat,
isn’t it? Let’s have a closer look at this command and what it does.
Obviously,
“grep” invokes the grep command. The second part, “upload_max_filesize”
is the needle we’re looking for, and the third part, “php.ini” is the
file that is our haystack. When invoked, grep reads the haystack file
line by line, looking for the needle, and prints every line that
contains the needle. The general way of invoking grep is:

grep <needle> <haystack>

As you see, there are two places of interest in this format. The
first is the needle, and the second is the haystack. Read on to learn
more about both of these, and how you can use them to your advantage.

On the use of needles

The needle, or search pattern as it is more commonly called,
specifies what a line should contain to be printed by grep. This can be
a word, or a certain character, but also a so-called regular
expression. Now, if you’re not familiar with regular expressions, there
are plenty of wonderful tutorials on the web, and it could take you a
while before you master them. For this tutorial, I will just show you
the most commonly used example, and warn you about some pitfalls.

The simplest search pattern is just a literal word or character. You
can find all recent pages requested by Googlebot in your web server’s
access log by executing this:

grep Googlebot access.log

Note that this search is case sensitive. A line containing
“googlebot” or “GoogleBot” is not printed. If you’d like to ignore case, add
the -i option:

grep -i Googlebot access.log

Note that options are placed before the search pattern. This command
will
also print lines containing “googlebot”, “GoogleBot” or even
“gOoGLeBoT”.

Now that we’re looking at a web server access log anyway, you might
also want to view all requests by actual people. A logical way to do
this would be to filter out all lines containing requests by known
bots. An interesting option of grep is the -v option, which inverts the search
results by printing all lines that do not
contain the pattern:

grep -iv Googlebot access.log

Remember that -i made the search case insensitive, and that “-iv” is
equivalent to “-i -v”, so that this command prints all lines not containing “Googlebot”,
“gOoGLeBoT”
and so on. Everything that’s printed now was not requested by
Googlebot. But there are more known bots, like bingbot for example. How
to exclude both from the output of grep?This is where the most commonly
used regular expression comes in:

grep -Eiv ‘Googlebot|bingbot’ access.log

Notice three important things:

  1. The -E option allows the use of extended
    regular expressions as a pattern. This is necessary to make more
    advanced patterns work on any computer on which you might execute this
    command.
  2. The “|” character in the pattern means or,
    and makes the pattern match both “Googlebot” and “bingbot”. You can add
    more matching words as long as you place “|” characters between them,
    like ‘Googlebot|bingbot|Baiduspider|facebookexternalhit’.
  3. Finally,
    the single quotes around the pattern prevent your shell
    from interpreting the special characters in the regular
    expression. Forgetting the single quotes may cause disaster, but
    usually
    results in some kind of “command not found” error.

Notice that the “|” character has a special meaning in this context.
A common pitfall when using grep is to put special characters in the
search pattern without realizing they have a special meaning. For
example, in most versions of grep, the dot character matches anything,
even if you don’t use the -E option. Have a look at this command:

grep i.e. textfile.txt

You might want to find all lines that contain “i.e.”, the
abbreviation for “that is”. However, this command will also print lines
containing “item” or “cabinet”
or even “businessman”. All of these words contain
the character sequence “i<some character>e<some
character>”, and the period
matches any single character. If you want grep to look for the search
pattern exactly as it is, you should use the -F option to look for fixed
strings and put the search pattern between single quotes to prevent
your shell from interpreting the special characters. The correct
command would be:

grep -F ‘i.e.’ textfile.txt

However, when you want to search for multiple patterns containing
special
characters, you can’t use the -F option because you need the special
meaning of the “|” character. In this case you’ll need to escape
special characters that grep should
search for as they are. For example, if you want to look for
“i.e.” or “e.g.”, escape the
dots using backslashes, like this:

grep -E ‘i.e.|e.g.’ textfile.txt

A backslash before a special character tells grep to search for the
special character literally, without minding its special meaning. For a
full list of characters that need escaping, refer to the man page of
grep (run “man grep” in your command line).

On the selection of haystacks

So far, we have only searched single text-based files. However, grep
can search multiple files. You can just put multiple files after the
search pattern and grep will search them all. Also, you can use the -r
option to recursively search all files
under the current directory. For example:

# This command prints all lines where a comment is started
# in
index.php and product.php:
grep -E ‘//|/*’ index.php product.php
# This command prints all lines containing TODO in all files
# under the
current directory:
grep -r TODO .

Note that “.” signifies the current (working) directory in the
command line, and that the -r option also searches files in
subdirectories and their subdirectories and so on. When you search
multiple files, grep will
automatically add the filename in which the matching lines was found
before the matching line itself. The output will look like:

./index.php: // TODO: add simpler user authentication
./header.php: /* TODO: fix the following JavaScript:

Always useful to know which TODO is in which file, isn’t it?
Finally, when you use the -r option
and there are some binary files (non-text files) under the directory
being searched, you may get results like this: “Binary file somefile
matches”. Grep can search binary files, but as they are not line-based
nor human-readable,all that’s printed if there is a match is this line.
You can stop grep from wasting time on these files by using the -I
(capital i) option to Ignore binary files.

Now, there is one more kind of haystack that I want to tell you
about:
the standard input. If you do not specify a haystack, grep will search
its stdin. This means that you can pipe the output of another command
to grep, and then search that output for a search pattern. This allows
you to quickly look for lines containing something you want to know.
For example, you can find out if a certain package is installed on your
system:

# For Ubuntu/Debian users
dpkg -l | grep libpng
# For RHEL/CentOS/Fedora users
rpm -qa | grep libpng

The commands before the “|” character generate a list of all
installed packages, and grep only prints the lines containing “libpng”.
You can also search your kernel ring buffer for errors or warnings:

dmesg | grep -Ei ‘error|warning’

Additionally, you can use grep to check if a certain process is
running:

ps -ef | grep sshd

This is actually so commonly used that many distros now include a
command dedicated to this way of grepping: pgrep. If it’s installed,
the equivalent of the above command is “pgrep -fl sshd”. It prints less
information, but doesn’t print an obsolete line for the grep command
itself.

Bonus: a cool trick

Now, I don’t want you to finish reading this article without knowing
a not-so-obvious trick: recently, somebody wanted me to believe that a
certain volume of songs we both knew was very egocentric and contained
the word “I” more often than the word “you”. I happened to have a
text-based digital copy of the volume. Using the following grep
commands, I found out that the bundle wasn’t so egocentric after all:

grep -iorw I songvolume/ | wc -l
grep -iorw you songvolume/ | wc -l

It turned out that “I” occurred 1257 times and “you” occurred 1803
times. Let me explain how this trick works:

  • “-iorw“: these options
    mean, in the order in which they are listed:

    • be case insensitive
    • for each match, print only the matching characters
      (this means that if a word occurs multiple times on a line, grep prints
      it multiple times on separate lines, which is needed for an accurate
      count)
    • recursively search all files
      under the directory given as a haystack
    • match only whole words, not every “i”
      occurring anywhere
  • “| wc -l”: pipes the
    output of grep, containing one line for every time “I” or “you” occurs,
    to “wc -l”, which counts the number of lines it gets and prints that
    number.

The combination of grep and wc -l is used in many more tricks,
whenever the number of lines containing something is of interest.
Think of web access logs (how many times has my site been visited by
Googlebot?), programming code (how many TODO comments are left?) or
process tables (how many instances of apache are running at this
moment?). If you know another cool way to use grep, feel free to share
it in the comments!

3 thoughts on “Grep Tutorial: Searching File Contents

  1. Instead of using wc -l to get the count of matches in grep, why not use the “-c” parameter of grep? From the grep man page:

    -c, –count
    Suppress normal output; instead print a count of matching lines
    for each input file. With the -v, –invert-match option (see
    below), count non-matching lines. (-c is specified by POSIX.)

Leave a Reply

Your email address will not be published. Required fields are marked *