Mastering Regular Expressions
November 14th, 2018
DevOps Training Architect I in Content
The second part of this course involves three different projects demonstrating ways to combine the use of regular expressions with various scripts and scripting languages to get the desired matches and results.
About the Course
Learn what we'll be learning in Mastering in Regular Expressions! In this brief intro, we'll cover how this course is set up, the best way to approach regular expressions, and how we'll be getting hands-on with certain sysadmin-based tasks while learning the language of regex.
About the Training Architect
Learn a little about who is teaching this course before jumping in! Don't worry; this won't take up too much of your time, you'll be crafting effective regexs soon.
Introducing Regular Expressions
What Are Regular Expressions?
Break down your first regular expression! This lesson gives us a general sense of what a regular expression is, then relates the concept back to some common *nix-based commands you've probably used before. We use the regex for an IP address to demonstrate some basic concepts and get you used to how we'll parse through expressions as we learn.
Writing regular expressions is easy: You just open up your preferred text editor or terminal and write one! Of course, that leaves out all the testing you want to do to make sure it works. Lucky for us, there are plenty of websites that help us in our quest to master regular expressions! Check out the ones below, and pick your favorite to use with the course! RegexrRegex101Regex Pal
The Regex Engine
We know that when we feed our program or application a regular expression, we're trying to get it to match against some other given text. But how does this matching happen? What is the regex doing as it reads through all our characters and metacharacters? And how are the algorithms happening behind the scenes? In this lesson, we take a peek behind the curtain and look at how our expressions work.
We know there are a multitude of different engines that power our regular expressions. But the differences don't stop there; regex engines also have different, and sometimes conflicting, standards to follow, including three standards by IEEE POSIX (Basic Regular Expressions, Extended Regular Expressions, and the deprecated Simple Regular Expressions) and the common Perl Compatible Regular Expressions. In this video, we check out the differences between these standards, then watch what happens when we try to use one standard in a utility made for another.
Using Regular Expressions
Basic Pattern Matching
While most of the time we think of regular expressions as a string of symbols, it's important to remember that sometimes less is more and regex can match plain, literal characters, as well. In this lesson, we take a look at how we can match a specific IP address, as well as see why we need to watch which characters we try to use literally. Get the lesson files here!
Characters and Words
Before we start drafting the regular expressions to convert an HTML document into something else, we need to learn each of the metacharacters we can use to break down and demonstrate a pattern, starting with the basics: How to match word characters (which are letters and numbers). In this lesson, we look at a product list, and, using nothing but the word-related regular expressions and grep, figure out how to pull out the information we need. Get the lesson files here!
As we continue to learn the basic alphabet of regular expressions, we take a look at how we can match digits; that is, the characters 0-9. Using the product list from the last lesson, we adapt and evolve our previous regular expression to utilize the digit metacharacter, as well as learn how to use a numerical range. Get the lesson files here!
We can now match most letters and digits; but what about the spaces between those characters? Whitespace includes regular spacing, tabs, newlines, carriage returns, and sometimes more. Through using sed and a sample /etc/hosts file, we use regular expressions to normalize our tabs while learning about how to target which spaces we want to capture. Get the lesson files here! Sed man page: https://linux.die.net/man/1/sed
We've matched letters, numbers, and whitespace; but what about letters and characters outside of the American standard? Some regex flavors support the use of Unicode to match accented characters, umlauts, and more, opening up the ability to do things like properly verify names of all kinds and support languages besides English. In this lesson, we take a look at both matching individual characters and using Unicode character categories to match specific groups of characters. Helpful links: Regex101Unicode Table
Alternation and Quantifiers
We know how to match individual characters and tokens, but what about when we want to define our match by where in the file the match is located? This is were location match tokens such as ^ and $ come in. These metacharacters allow us to define whether our match comes at the beginning of a line, the end of a line, or both. In this lesson, we use grep to demonstrate the matching capability of this token, then learn a more practical example where we remove blank lines from a file using sed in combination with our location-based regex. Get the lesson files here!: Customer DataAccess Logs
Now that we know how to define the start and end of a line using regular expressions, we want to consider how to limit our matches that don't have such static start or endpoints. Boundaries allow us to define the set end or beginning of a match, ensuring that our match is not part of a larger word or text; so when we search for cat, we don't get results like catastrophe or concatenate. This lesson covers both ERE, PCRE, and most modern-day regex boundaries, as well as an older method of marking boundaries that work in programs such as vim. Get the lesson files here!: Customer Data
Up until this point, we've focused on translating each part of our matches one at a time into the best general option, whether that be metacharacter or literal. But regular expressions are limited to one-to-one matches. Alternation allows us to provide regex options for what we want a token to match. In this lesson, we learn how to use the vertical bar (|) metacharacter to tell our regex engine when we want it to consider multiple matches. Get the lesson files here!: Customer Data
As we move past letter-by-letter translations into regex, the ability to take our metacharacters and subexpressions and either make them optional or have them repeat for determinate or indeterminate amounts becomes necessary. When using repetition, it's the metacharacter or subexpression that's being matched, not the match itself. In this lesson, we look at the various repetition metacharacters, then use them in conjunction with what we've learned in previous lessons to create a regular expression for a website. Get the lesson files here!: Access Logs
While quantifiers are amazing when we need to repeat a token or expression, they can also prove to be a little too greedy when paired with the . wildcard, causing some unintended results with our matches. In this lesson, we learn how to pair the question mark metacharacter (?) with our two quantifiers, + and *, to create lazy or non-greedy matches and restrict the behavior of the quantifier. There are no files needed for this lesson.
Classes and Groups
More Character Classes
We've used square brackets frequently in previous lessons, but there's more to them than we've addressed. In this lesson, we get in-depth with character classes and learn how to use them not just to match a range of characters, but also how to negate which characters we want to be matched. We also address how to use subtraction in regex, a feature limited to XML, .NET, JGSoft, and XPath implementations. Get the lesson files here!: Customer Data
Just like there's more to our square-bracketed character classes, there's more to the use of parenthesis in regular expressions, too. Parenthesis work as capturing groups, which capture the match that the regular expression has made. Later, we can reference that match with a backreference, ensuring that the expression captured with the backreference is identical to the match from the capturing group. This allows us more detailed and accurate matches when we create our expressions. Get the lesson files here!: Customer Data
In the last lesson, we learned how to use capturing groups and backreferences to match the previously-matched text. However, we do not have to rely on the numerical reference style previously shown. Instead, we can use one of the many ways to define a named group (depending on our regex implementation). This allows us to give our groups human-readable names and is especially useful when we have multiple captured matches (both because we can better understand what the group references and because numerical references only go up to nine). Get the lesson files here!: Study Guide
We've learned about capturing groups: Now what about when we want to use parenthesis as visual helpers for when humans read our regexs and not capture anything with them? Luckily, we can turn our capturing groups into non-capturing groups for just this reason. Non-capturing groups allow us to use grouping without that group being counted towards any of our numerical backreferences. To demonstrate this, we further refactor our table of contents one-liner, switching from grep and sed to Perl to print out our captured heading, all while ignoring the matched HTML tags. Get the lesson files here!: Customer Data
With regular expressions, we don't just have to limit ourselves to crafting expressions that capture everything in the expression. Instead, we can create groups within our expression that work similarly to boundaries, called lookarounds. In this lesson, we start with lookaheads, which let us write a subexpression then ensure our regex is either followed by that expression (called a positive lookahead) or NOT followed by that expression (called a negative lookahead), allowing for more fine-tuned matching and capturing. Get the lesson files here!: Customer DataAccess Logs
We know we can use lookaheads to match (or not match), but not capture, anything following our regular expression, so we can also do the same for anything that comes before. Lookbehinds allow us to craft a subexpression to be used as a boundary, either by ensuring the text comes before our captured match or by making sure it doesn't. Lookbehinds do have some limitations, however, and must be a fix-width – so no repetition can be used. We also pair our lookbehinds with a lookahead to further restrict our matches. Get the lesson files here!: Study Guide
Regular expressions allow us to craft rudimentary if-then-else statements within the expression itself. These allow us to ensure that if specific match parameters are met, a defined subexpression is matched; if they are not, the regex engine continues to attempt to match the expression using the expression in the "else" section of the expression. This allows us to refine our matches further and opens up the opportunity to craft simple if statements in programs like sed and grep, without needing to use a more extensive scripting language. Get the lesson files here!: Highstate File
Named and Nested Conditionals
We aren't just limited to writing our regex if statements in the manner of the last lesson. Instead, we can use any type of named capturing group alongside our statement; this is especially useful when we want to include multiple match options in our if statements. We can do that by nesting if statements within the else portion of the regex. In this lesson, we expand on our prior statement to use named grouping for clarity and pull more information from our file. Get the lesson files here!: Highstate File
Using a sed Script to Generate Human-Readable Files - Part 1
Now that we know how to speak fluent regex, we need to start using it for more than just simple grepping! We're going to start small by taking a JSON output file and making it into a human-readable report we can use. Since we've already used sed through part of the course, we're going to start by writing a full sed script we can use across a myriad of files. Get the lesson files here: Highstate File
Using a sed Script to Generate Human-Readable Files - Part 2
Watch part 1 first! Now that we know how to speak fluent regex, we need to start using it for more than just simple grepping! We're going to start small by taking a JSON output file and making it into a human-readable report we can use. Since we've already used sed through part of the course, we're going to start by writing a full sed script we can use across a myriad of files. Get the lesson files here: Highstate File
Using Perl to Convert an HTML Document - Part 1
We've wrangled with sed, but now it's time to tango with Perl. In this project, we're going to go through that same HTML study guide we've worked with previously, but this time we're writing a script to convert it to markdown. We'll get a sense of how regular expressions work within Perl, as well as learn how to use line matches in if statements to get the exact results we want.
Using Perl to Convert an HTML Document - Part 2
Be sure to watch part 1 first! We've wrangled with sed, but now it's time to tango with Perl. In this project, we're going to go through that same HTML study guide we've worked with previously, but this time we're writing a script to convert it to markdown. We'll get a sense of how regular expressions work within Perl, as well as learn how to use line matches in if statements to get the exact results we want.
For our final activity, we're going to move to the front end and take a look at using regular expressions to validate a username and password on a registration form for a website. We'll make use of classes, lookarounds, and both basic and advanced expressions we can use to achieve our desired results, as well as take a look at any common issues we can experience with regex validation. Get the git repo!
Final Thoughts and Next Steps
Congratulations! You've finished Mastering Regular Expressions! And if you're wondering where to do from here, don't worry – we've got you covered with suggestions for future regex tasks and other courses to check out.