Match the Same String Twice with Backreferences in Regular Expressions

Joe Maddalone
InstructorJoe Maddalone
Share this video with your friends

Social Share Links

Send Tweet
Published 8 years ago
Updated 5 years ago

Regular Expression Backreferences provide us a method to match a previously captured pattern a second time.

[00:01] When we want to find the same match two times using regular expressions we can use something called back references. So I'm going to create a string here it says, "It was the the thing." So I've got the there twice, and it's a very common typo. I'm going to go ahead and create a regular expression and let's start off by capturing the T-H-E.

[00:22] OK, so now we've captured both T-H-Es, we're going to say that it's potentially followed by a white space, and now to identify the second instance of our first capture we use a back reference and that's identified by \ and then the index of our capture, so we've only got one capture in our regular expression here, so this \1 represents whatever it was we captured in our first capture.

[00:48] So if I save that you'll see that we've got both of them there, it's working just fine. We've captured both the's, but what we could do to only capture the first one is to use a look ahead. So now we're saying we're looking for the followed by the potential white space, followed by the exact same thing we captured in our first capture. So I'll save that, and now we've just got the first the with space. We could use that to clean this string up. So let's say string.replace, I'll say regex, and we'll replace our match with an empty string.

[01:21] If we load that up in the dev tools we can see that we've replaced the duplicated content. So let's say that we've got a second thing here, obviously we're not identifying that because we're looking for the actual group T-H-E, so let's go ahead and just say that's any number of word characters, save that, and we can see that we've identified the T-H-E with the space, and the T-H-I-N-G with the space, and then in our replace function we've replaced those duplicated characters and just came up with, "It was the thing," without the duplicated words.

[01:55] Now a common use for this is stripping content from HTML or stripping HTML out of content. So let's say we're going to have a bold tag here, it's simply going to say bold, and we're going to have an italics tag here, it's simply going to say italics. Now this isn't going to work so great in our pre-output, but what we're going to do is replace it with our replace function. So let's go ahead and build up this string to grab the innerHTML, or the inner content of each of these tags.

[02:28] We know we're going to start with a tag that's going to have something in it. We know we're going to close with a tag, so we'll escape that forward slash. Now inside of this tag we are going to capture any number of word characters. To match that at the end, we're simply going to use our \1. Now to get the content inside of that, we're going to have another capture group here, and that's just going to be whatever's in there, any number of characters, or spaces, or whatever.

[02:56] So we'll save that, and what we'll do here is we'll replace all of our matches with a reference to our capture group which in this case is 2, so you see this one right here, the beginning of the tag is 1, and we reference that at the end of the tag, so this capture group here is 2. So let's just replace our matches with $2 referencing that second capture, and then we'll throw in a line break. So save that, and now in our console we can see that we've got our bold out of our bold tag, and we've got italics out of our italics tag.

Shishir Arora
Shishir Arora
~ 7 years ago

what if var str = `<b>Bold<i>italics</i></b>`;

Markdown supported.
Become a member to join the discussionEnroll Today