01-05-2021

Ruby Regex Cheat Sheet

Open BookshelfCover Page
Preface
Getting Started
Regular Expressions
Using Regular Expressions
Conclusion
Share on

Now that you're bobbing along atop the waves, it's time to relax and explore your surroundings. Get your swim fins on, and head on out into deeper waters.

Thus far, our explorations have given us a good handle on the different types of patterns that can appear in a regex. You know how to match specific characters, classes of characters, can anchor your matches, and can even match strings of different sizes and content. However, you've seen but a handful of examples that show what this looks like in real code. We're going to rectify that a bit in this section and introduce a handful of Ruby and JavaScript methods that use regex. This discussion won't be comprehensive, but it does provide the tools you'll need in the future. Most developers won't ever need anything more.

You'll be able to study them slowly, and to use them as a cheat sheet later, when you are reading the rest of the site or experimenting with your own regular expressions. If you overdose, make sure not to miss the next page, which comes back down to Earth and talks about some really cool stuff: The 1001 ways to use Regex. A regular expression, or 'regex', is used to match parts of a string. Below is my cheat sheet for creating regular expressions. Made with love and Ruby on Rails.

Oddly, the Regexp (Ruby) and RegExp (JavaScript) classes don't provide the regex methods you'll use most often. Instead, the String class does.

Matching Strings

We've already seen match in some of our examples. This method returns a value that indicates whether a match occurred, and what substrings matched. This return value is 'truthy'; you can test it in a conditional expression in either Ruby or JavaScript to determine whether a given string matched a regex. At its most basic, we use it like this:

Ruby

JavaScript

Here we call fetch_url(text) when match returns a value that indicates a match: that is when text contains something that looks like a URL.

We won't discuss the return value of match in detail -- see the documentation instead. For now, match returns an Array that contains the string we matched against, along with the capture groups defined in the regex. If we name this Array capture, then capture[0] represents the entire matched portion of text, while capture[1], capture[2], etc. correspond to the capture groups. (We discuss capture groups below.). If the regex doesn't match text, then Ruby returns nil, while JavaScript returns null.

In Ruby, the return value of match isn't an Array, but a MatchData object that responds to [0], [1], [2], and so on. You cannot apply most Array methods to this object directly.

In Ruby, you sometimes see something like this:

=~ is similar to match, except that it returns the index within the string at which the regex matched, or nil if there was no match. =~ is measurably faster than match, so some rubyists prefer to use it when they can. Others dislike it because it is unfamiliar, or solely because =~ reminds them of the Perl language where it saw widespread use.

Rubyists should also investigate the String#scan method; it is a global form of match that returns an Array of all matching substrings.

Splitting Strings

Applications that process text often must analyze data comprised of records and fields delimited by some special characters or delimiters. A typical format has records separated by newlines, and fields delineated by tabs. Such data often needs parsing before you can use it in your program; the split method is an often-useful parsing tool.

split is frequently used with a simple string as a delimiter:

Ruby

JavaScript

As you can see, split returns an Array that contains the values from each of the split fields.

Not all delimiters are as simple as that, though. Sometimes, formatting is much more relaxed. For example, you may encounter data where arbitrary whitespace characters separate fields, and there may be more than one whitespace character between each pair of items. The regex form of split comes in handy in such cases:

Ruby

JavaScript

Beware of regex like /:*/ and /t?/ when using split. Recall that the * quantifier matches zero or more occurrences of the pattern it is modifying. In the case of split, the result may be totally unexpected:

A six element array instead of the two element array you may have expected. This result occurs because the regex matches the gaps between each letter; zero occurrences of : occurs between each pair of characters.

Capture Groups: A Diversion

Before moving on to the final methods in our whirlwind tour, we need to first talk about capture groups. (Note that regex also have non-capture groups but we won't cover them here.) You've already encountered these before, though we called them something different at the time: grouping parentheses. We didn't mention it at the time, but these meta-characters have another function: they provide capture and non-capture groups.

Capture groups capture the matching characters that correspond to part of a regex. You can reuse these matches later in the same regex, and when constructing new values based on the matched string.

We'll start with a simple example. Suppose you need to match quoted strings inside some text, where either single or double quotes delimit the strings. How would you do that using the regex patterns you know? You might consider:

as your first attempt to match quotes, but, you'll soon find that it also matches mixed single and double quotes. This may not be what you want. Instead, you need a way to capture the opening quote and reuse that character for the closing quote. It's time to call on capture groups:

Here the group captures the part of the string that matches the pattern between parentheses; in this case, either a single or double quote. We then match one or more of any other character and end with a 1: we call this sequence a backreference - it references the first capture group in the regex. If the first group matches a double quote, then 1 matches a double quote, but not a single quote.

It may be more reasonable to use two regex to solve this problem:

It's easier to read and maintain when written like this. However, you will almost certainly encounter problems where a single regex with a backreference is the preferred solution.

A regex may contain multiple capture groups, numbers from left to right as groups 1, 2, 3, and so on, up to 9. As you might expect, the backreferences are 1, 2, 3, ..., and 9.

Note that there are patterns in Ruby that allow for named groups and named backreferences, but this is beyond the scope of this book. If you find yourself needing multiple groups in Ruby regex, you may want to investigate these named groups and backreferences.

While you can use capture groups in any regex, they are most useful in conjunction with methods that use regex to transform strings. We'll see this in the next two sections.

By the way: did you notice that lazy quantifier in our regex? Why do you think we used that here?

Transformations in Ruby

While regex-based transformations in Ruby and JavaScript are conceptually similar, the implementations are different. We'll cover these transformations in separate sections.

Transforming a string with regex involves matching that string against the regex, and using the results to construct a new value. In Ruby, we typically use String#sub and String#gsub. #sub transforms the first part of a string that matches a regex, while #gsub transforms every part of a string that matches.

Here's a simple example:

Here we replace every vowel in text with an *.

We can use backreferences in the replacement string (the second argument):

One thing to note here is that if you double quote the replacement string, you need to double up on the backslashes:

When possible, try to use single quotes to avoid leaning toothpick syndrome.

Transformations in JavaScript

While regex-based transformations in Ruby and JavaScript are conceptually similar, the implementations are different. We'll cover these transformations in separate sections.

Transforming a string with regex involves matching that string against the regex, and using the results of the match to construct a new value. In JavaScript, we can use the replace method which transforms the matched part of a string. If the regex includes a g option, the transformation applies to every match in the string.

Here's a simple example:

Here we replace every vowel in text with an *. We applied the transformation globally since we used the g option on the regex.

We can use backreferences in the replacement string (the second argument):

One thing to note here is that the backreferences in the replacement string use $1, $2, etc. instead of 1, 2, etc.