Skip to main content

More Basic Regular Expressions: Matching an Email Address

Hands-On Lab

 

Photo of Elle Krout

Elle Krout

Content Team Lead in Content

Length

00:30:00

Difficulty

Beginner

As our knowledge and experience with regular expressions grows, we can begin to match more and more generalized items, such as an email address. By coming up with the regex to match an email address, we need to use concepts such as grouping, ranges, repetition, literal characters, and more. This learning activity also expects you to have some knowledge of grep and sort.

What are Hands-On Labs?

Hands-On Labs are scenario-based learning environments where learners can practice without consequences. Don't compromise a system or waste money on expensive downloads. Practice real-world skills without the real-world risk, no assembly required.

  1. Let's begin by considering what we have to match against. A list of example emails have been provided in the instructions, but we also want to consider how email addresses can be formatted as a whole. We know that, generally speaking, the username portion (or "local-part") of the address can contain any sort of letter or number, as well as the following characters: ! # $ % & ' * + - / = ? ^ _ ` { | } ~ . The domain is limited as any domain would be: It can use numbers, letters, or a dash for the second-level domain, and letters or numbers for the TDL, up to 63 characters.

  2. Starting with the username, we can use a combination of ranges and literals denote all the options we can match; we use the + metacharacter here because our email needs to be at least one letter long!

    [a-zA-Z0-9!#$%&'*+-/=?^_`{|}~.]+
  3. Next, we can add the @ symbol, which is a literal:

    [a-zA-Z0-9!#$%&'*+-/=?^_`{|}~.]+@
  4. The second-level domain must be at one one character long, and has to start with either a letter or number:

    [a-zA-Z0-9!#$%&'*+-/=?^_`{|}~.]+@[a-zA-ZZ0-9]
  5. After this, the SLD can contain any letters, numbers, or a dash; we want to use * here because although most domains are longer than a single character, it's still within standards to have a single-letter domain -- you never know what the future brings!

    [a-zA-Z0-9!#$%&'*+-/=?^_`{|}~.]+@[a-zA-ZZ0-9][a-zA-Z0-9-]*
  6. Add the next literal dot:

    [a-zA-Z0-9!#$%&'*+-/=?^_`{|}~.]+@[a-zA-ZZ0-9][a-zA-Z0-9-]*.
  7. Finally, match the top-level domain, which has to be between two and 64 characters:

    [a-zA-Z0-9!#$%&'*+-/=?^_`{|}~.]+@[a-zA-ZZ0-9][a-zA-Z0-9-]*.[a-zA-Z0-9]{2,63}
  8. With our expression created, we can now use grep to test:

    grep -Po "[a-zA-Z0-9!#$%&'*+-/=?^_`{|}~.]+@[a-zA-ZZ0-9][a-zA-Z0-9-]*.[a-zA-Z0-9]{2,63}" customer-data.txt
  9. Then sort and output to a file:

    grep -Po "[a-zA-Z0-9!#$%&'*+-/=?^_`{|}~.]+@[a-zA-ZZ0-9][a-zA-Z0-9-]*.[a-zA-Z0-9]{2,63}" customer-data.txt | sort > emails.txt