New JavaScript Features That Will Change How You Write Regex
New JavaScript Features That Will Change How You Write Regex
Faraz Kelhini2019-02-08T13:00:32+01:002019-02-08T12:17:32+00:00
There’s a good reason the majority of programming languages support regular expressions: they are extremely powerful tools for manipulating text. Text processing tasks that require dozens of lines of code can often be accomplished with a single line of regular expression code. While the built-in functions in most languages are usually sufficient to perform search and replace operations on strings, more complex operations — such as validating text inputs — often require the use of regular expressions.
Regular expressions have been part of the JavaScript language since the third edition of the ECMAScript standard, which was introduced in 1999. ECMAScript 2018 (or ES2018 for short) is the ninth edition of the standard and further improves the text processing capability of JavaScript by introducing four new features:
These new features are explained in detail in the subsections that follow.
Debugging JavaScript
console.log
can tell you a lot about your app, but it can’t truly debug your code. For that, you need a full-fledged JavaScript debugger. ?
Lookbehind Assertions
The ability to match a sequence of characters based on what follows or precedes it enables you to discard potentially undesired matches. This is especially important when you need to process a large string and the chance of undesired matches is high. Fortunately, most regular expression flavors provide the lookbehind and lookahead assertions for this purpose.
Prior to ES2018, only lookahead assertions were available in JavaScript. A lookahead allows you to assert that a pattern is immediately followed by another pattern.
There are two versions of lookahead assertions: positive and negative. The syntax for a positive lookahead is (?=...)
. For example, the regex /Item(?= 10)/
matches Item
only when it is followed, with an intervening space, by number 10:
const re = /Item(?= 10)/;
console.log(re.exec('Item'));
// → null
console.log(re.exec('Item5'));
// → null
console.log(re.exec('Item 5'));
// → null
console.log(re.exec('Item 10'));
// → ["Item", index: 0, input: "Item 10", groups: undefined]
This code uses the exec()
method to search for a match in a string. If a match is found, exec()
returns an array whose first element is the matched string. The index
property of the array holds the index of the matched string, and the input
property holds the entire string that the search performed on. Finally, if named capture groups are used in the regular expression, they are placed on the groups
property. In this case, groups
has a value of undefined
because there is no named capture group.
The construct for a negative lookahead is (?!...)
. A negative lookahead asserts that a pattern is not followed by a specific pattern. For example, the pattern /Red(?!head)/
matches Red
only if it not followed by head
:
const re = /Red(?!head)/;
console.log(re.exec('Redhead'));
// → null
console.log(re.exec('Redberry'));
// → ["Red", index: 0, input: "Redberry", groups: undefined]
console.log(re.exec('Redjay'));
// → ["Red", index: 0, input: "Redjay", groups: undefined]
console.log(re.exec('Red'));
// → ["Red", index: 0, input: "Red", groups: undefined]
ES2018 complements lookahead assertions by bringing lookbehind assertions to JavaScript. Denoted by (?<=...)
, a lookbehind assertion allows you to match a pattern only if it is preceded by another pattern.
Let’s suppose you need to retrieve the price of a product in euro without capturing the euro symbol. With a lookbehind, this task becomes a lot simpler:
const re = /(?<=€)d+(.d*)?/;
console.log(re.exec('199'));
// → null
console.log(re.exec('$199'));
// → null
console.log(re.exec('€199'));
// → ["199", undefined, index: 1, input: "€199", groups: undefined]
Note: Lookahead and lookbehind assertions are often referred to as “lookarounds”.
The negative version of lookbehind is denoted by (?<!...)
and enables you to match a pattern that is not preceded by the pattern specified within the lookbehind. For example, the regular expression /(?<!d{3}) meters/
matches the word “meters” if three digits do not come before it:
const re = /(?<!d{3}) meters/;
console.log(re.exec('10 meters'));
// → [" meters", index: 2, input: "10 meters", groups: undefined]
console.log(re.exec('100 meters'));
// → null
As with lookaheads, you can use several lookbehinds (negative or positive) in succession to create a more complex pattern. Here’s an example:
const re = /(?<=d{2})(?<!35) meters/;
console.log(re.exec('35 meters'));
// → null
console.log(re.exec('meters'));
// → null
console.log(re.exec('4 meters'));
// → null
console.log(re.exec('14 meters'));
// → ["meters", index: 2, input: "14 meters", groups: undefined]
This regex matches a string containing meters only if it is immediately preceded by any two digits other than 35. The positive lookbehind ensures that the pattern is preceded by two digits, and then the negative lookbehind ensures that the digits are not 35.
Named Capture Groups
You can group a part of a regular expression by encapsulating the characters in parentheses. This allows you to restrict alternation to a part of the pattern or apply a quantifier on the whole group. Furthermore, you can extract the matched value by parentheses for further processing.
The following code gives an example of how to find a file name with .jpg extension in a string and then extract the file name:
const re = /(w+).jpg/;
const str = 'File name: cat.jpg';
const match = re.exec(str);
const fileName = match[1];
// The second element in the resulting array holds the portion of the string that parentheses matched
console.log(match);
// → ["cat.jpg", "cat", index: 11, input: "File name: cat.jpg", groups: undefined]
console.log(fileName);
// → cat
In more complex patterns, referencing a group using a number just makes the already cryptic regular expression syntax more confusing. For example, suppose you want to match a date. Since the position of day and month is swapped in some regions, it’s not clear which group refers to the month and which group refers to the day:
const re = /(d{4})-(d{2})-(d{2})/;
const match = re.exec('2020-03-04');
console.log(match[0]); // → 2020-03-04
console.log(match[1]); // → 2020
console.log(match[2]); // → 03
console.log(match[3]); // → 04
ES2018’s solution to this problem is named capture groups, which use a more expressive syntax in the form of (?...)
:
const re = /(?<year>d{4})-(?<month>d{2})-(?<day>d{2})/;
const match = re.exec('2020-03-04');
console.log(match.groups); // → {year: "2020", month: "03", day: "04"}
console.log(match.groups.year); // → 2020
console.log(match.groups.month); // → 03
console.log(match.groups.day); // → 04
Because the resulting object may contain a property with the same name as a named group, all named groups are defined under a separate object called groups
.
A similar construct exists in many new and traditional programming languages. Python, for example, uses the (?P)
syntax for named groups. Not surprisingly, Perl supports named groups with syntax identical to JavaScript (JavaScript has imitated its regular expression syntax from Perl). Java also uses the same syntax as Perl.
In addition to being able to access a named group through the groups
object, you can access a group using a numbered reference — similar to a regular capture group:
const re = /(?<year>d{4})-(?<month>d{2})-(?<day>d{2})/;
const match = re.exec('2020-03-04');
console.log(match[0]); // → 2020-03-04
console.log(match[1]); // → 2020
console.log(match[2]); // → 03
console.log(match[3]); // → 04
The new syntax also works well with destructuring assignment:
const re = /(?<year>d{4})-(?<month>d{2})-(?<day>d{2})/;
const [match, year, month, day] = re.exec('2020-03-04');
console.log(match); // → 2020-03-04
console.log(year); // → 2020
console.log(month); // → 03
console.log(day); // → 04
The groups
object is always created, even if no named group exists in a regular expression:
const re = /d+/;
const match = re.exec('123');
console.log('groups' in match); // → true
If an optional named group does not participate in the match, the groups
object will still have a property for that named group but the property will have a value of undefined
:
const re = /d+(?<ordinal>st|nd|rd|th)?/;
let match = re.exec('2nd');
console.log('ordinal' in match.groups); // → true
console.log(match.groups.ordinal); // → nd
match = re.exec('2');
console.log('ordinal' in match.groups); // → true
console.log(match.groups.ordinal); // → undefined
You can refer to a regular captured group later in the pattern with a backreference in the form of 1
. For example, the following code uses a capture group that matches two letters in a row, then recalls it later in the pattern:
console.log(/(ww)1/.test('abab')); // → true
// if the last two letters are not the same
// as the first two, the match will fail
console.log(/(ww)1/.test('abcd')); // → false
To recall a named capture group later in the pattern, you can use the k
syntax. Here is an example:
const re = /b(?<dup>w+)s+k<dup>b/;
const match = re.exec("I'm not lazy, I'm on on energy saving mode");
console.log(match.index); // → 18
console.log(match[0]); // → on on
This regular expression finds consecutive duplicate words in a sentence. If you prefer, you can also recall a named capture group using a numbered back reference:
const re = /b(?<dup>w+)s+1b/;
const match = re.exec("I'm not lazy, I'm on on energy saving mode");
console.log(match.index); // → 18
console.log(match[0]); // → on on
It’s also possible to use a numbered back reference and a named backreference at the same time:
const re = /(?<digit>d):1:k<digit>/;
const match = re.exec('5:5:5');
console.log(match[0]); // → 5:5:5
Similar to numbered capture groups, named capture groups can be inserted into the replacement value of the replace()
method. To do that, you will need to use the $
construct. For example:
const str = 'War & Peace';
console.log(str.replace(/(War) & (Peace)/, '$2 & $1'));
// → Peace & War
console.log(str.replace(/(?<War>War) & (?<Peace>Peace)/, '$<Peace> & $<War>'));
// → Peace & War
If you want to use a function to perform the replacement, you can reference the named groups the same way you would reference numbered groups. The value of the first capture group will be available as the second argument to the function, and the value of the second capture group will be available as the third argument:
const str = 'War & Peace';
const result = str.replace(/(?<War>War) & (?<Peace>Peace)/, function(match, group1, group2, offset, string) {
return group2 + ' & ' + group1;
});
console.log(result); // → Peace & War
s
(dotAll
) Flag
By default, the dot (.
) metacharacter in a regex pattern matches any character with the exception of line break characters, including line feed (n
) and carriage return (r
):
console.log(/./.test('n')); // → false
console.log(/./.test('r')); // → false
Despite this shortcoming, JavaScript developers could still match all characters by using two opposite shorthand character classes like [wW]
, which instructs the regex engine to match a character that’s a word character (w
) or a non-word character (W
):
console.log(/[wW]/.test('n')); // → true
console.log(/[wW]/.test('r')); // → true
ES2018 aims to fix this problem by introducing the s
(dotAll
) flag. When this flag is set, it changes the behavior of the dot (.
) metacharacter to match line break characters as well:
console.log(/./s.test('n')); // → true
console.log(/./s.test('r')); // → true
The s
flag can be used on per-regex basis and thus does not break existing patterns that rely on the old behavior of the dot metacharacter. Besides JavaScript, the s
flag is available in a number of other languages such as Perl and PHP.
Recommended reading: An Abridged Cartoon Introduction To WebAssembly
Unicode Property Escapes
Among the new features introduced in ES2015 was Unicode awareness. However, shorthand character classes were still unable to match Unicode characters, even if the u
flag was set.
Consider the following example:
const str = '𝟠';
console.log(/d/.test(str)); // → false
console.log(/d/u.test(str)); // → false
?
is considered a digit, but d
can only match ASCII [0-9], so the test()
method returns false
. Because changing the behavior of shorthand character classes would break existing regular expression patterns, it was decided to introduce a new type of escape sequence.
In ES2018, Unicode property escapes, denoted by p{...}
, are available in regular expressions when the u
flag is set. Now to match any Unicode number, you can simply use p{Number}
, as shown below:
const str = '𝟠';
console.log(/p{Number}/u.test(str)); // → true
And to match any Unicode alphabetic character, you can use p{Alphabetic}
:
const str = '漢';
console.log(/p{Alphabetic}/u.test(str)); // → true
// the w shorthand cannot match 漢
console.log(/w/u.test(str)); // → false
P{...}
is the negated version of p{...}
and matches any character that p{...}
does not:
console.log(/P{Number}/u.test('𝟠')); // → false
console.log(/P{Number}/u.test('漢')); // → true
console.log(/P{Alphabetic}/u.test('𝟠')); // → true
console.log(/P{Alphabetic}/u.test('漢')); // → false
A full list of supported properties is available on the current specification proposal.
Note that using an unsupported property causes a SyntaxError
:
console.log(/p{undefined}/u.test('漢')); // → SyntaxError
Compatibility Table
Desktop Browsers
Chrome | Firefox | Safari | Edge | |
---|---|---|---|---|
Lookbehind Assertions | 62 | X | X | X |
Named Capture Groups | 64 | X | 11.1 | X |
`s` (`dotAll`) Flag | 62 | X | 11.1 | X |
Unicode Property Escapes | 64 | X | 11.1 | X |
Mobile Browsers
ChromeFor Android | FirefoxFor Android | iOS Safari | Edge Mobile | Samsung Internet | Android Webview | |
---|---|---|---|---|---|---|
Lookbehind Assertions | 62 | X | X | X | 8.2 | 62 |
Named Capture Groups | 64 | X | 11.3 | X | X | 64 |
`s` (`dotAll`) Flag | 62 | X | 11.3 | X | 8.2 | 62 |
Unicode Property Escapes | 64 | X | 11.3 | X | X | 64 |
Node.js
- 8.3.0 (requires –harmony runtime flag)
- 8.10.0 (support for
s
(dotAll
) flag and lookbehind assertions) - 10.0.0 (full support)
Wrapping Up
ES2018 continues the work of previous editions of ECMAScript by making regular expressions more useful. New features include lookbehind assertion, named capture groups, s
(dotAll
) flag, and Unicode property escapes. Lookbehind assertion allows you to match a pattern only if it is preceded by another pattern. Named capture groups use a more expressive syntax compared to regular capture groups. The s
(dotAll
) flag changes the behavior of the dot (.
) metacharacter to match line break characters. Finally, Unicode property escapes provide a new type of escape sequence in regular expressions.
When building complicated patterns, it’s often helpful to use a regular-expressions tester. A good tester provides an interface to test a regular expression against a string and displays every step taken by the engine, which can be especially useful when trying to understand patterns written by others. It can also detect syntax errors that may occur within your regex pattern. Regex101 and RegexBuddy are two popular regex testers worth checking out.
Do you have some other tools to recommend? Share them in the comments!