SQL RegEx : Capturing groups 101
“Capturing groups” are parts of RegEx which are used to group one or more characters inside a RegEx. Let me try to explain with a simple example.
Email address as Source Text : abc123@qsys400.com
Here is a simple RegEx to match this email address : [a-zA-Z]\w+@\w+\.\w+
This RegEx matches the given Email address perfectly. If we look closely, we can break this email address in following parts-
- User name “abc123” : is a group of alphanumeric characters starting with English alphabet. => [a-zA-Z]\w+
- address sign “@” : is a string literal to create a partition between user name and domain name. => @
- Domain name “qsys400” : is a group of alphanumeric characters. => \w+
- Dot “.” : String literal dot to create a partition between domain and top level domain name. => \.
- Top level domain “com”: is a group of alphanumeric characters. => \w+
So if we write the above RegEx again to group character as described above, it will look as ([a-zA-Z]\w+)(@)(\w+)(\.)(\w+)
Now this RegEx has 5 groups
- ([a-zA-Z]\w+)
- (@)
- (\w+)
- (\.)
- (\w+)
What is the use of these groups?
Every programming language which supports RegEx, gives functionality to get values corresponding to each groups in RegEx.
For example let say I want to get user name and domain name (including top level domain) from the email address
RegEx : ([a-zA-Z]\w+)@(\w+\.\w+)
In this RegEx there are following 2 groups
- ([a-zA-Z]\w+) : from User name (like abc123)
- (\w+\.\w+) : for domain name (like qsys400.com)
“@” is not a part of any of the groups.
So when a programming language processes this RegEx, it gives functionality to get(or capture) the value for GROUP 1 and GROUP2. Where value of GROUP1 will be user name from the email and value of GROUP2 will be domain name .
Capturing Groups Numbering
Each capturing group, in a RegEx, gets a unique number starting from 1. It is very simple. From left, start giving number to each opening parenthesis “(“-
(A)(B) ===> Group 1 (A) ===> Group 2 (B)
Same rule applies for Nested capturing groups-
(A(B))(C) ==> Group 1 (A(B)) ==> Group 2 (B) ==> Group 3 (C)
Example:
RegEx (\w{3}) : It creates a group of 3 characters
Source Text : “abcdef ghi”
Based on RegEx [without capturing groups] \w{3} will find 3 matches
- abc
- def
- ghi
With “Capturing groups”, RegEx (\w{3}) will create one group for each match
Match# | Full Match | Group# | Group value |
1 | abc | 1 | abc |
2 | def | 1 | def |
3 | ghi | 1 | ghi |
Group ZERO
- Group numbering starts with number 1.
- In some RegEx engines, there is a GROUP ZERO which contains the value of complete RegEx match.
Example
RegEx (\d{3})\w+
Source Tex : “123abc%def”
- This complete RegEx will match “123abc” from source text i.e. GROUP ZERO = “123abc”
- With in this match Group 1 will contains the value “123”
Numbering with Quantifiers
If a quantifier is added with any group in RegEx , number of groups in RegEx does not change which means “quantifiers do not impact number of groups”.
(A){3} ==> It says that there must be exact 3 occurrences of capturing group (A)
But number of groups in this RegEx is still one. So, you will get value of GROUP 1 (there is no GROUP 2 or GROUP 3). With every new match, previous value of the group will be overridden by new value.
Example
RegEx (\w{3})+ : Last character “+” adds the quantifier (ONE or MORE occurrences) to the RegEx group with creates a group of 3 characters
Source Text : “abcdef ghi”
- Due to quantifier on First Match, this RegEx will consume “abcdef” (due to space before “ghi”)
- Text “abcdef” has 2 sets of length 3 characters
- “abc”
- “def”
- So technically First match contains 2 groups.
- But there is only 1 group defined in this RegEx i.e (\w{3})
- Here is how it works
- After consuming “abcdef” ,RegEx will try to find group value (it has only 1 Group)
- First match at “abc” ===> Group 1 = “abc”
- 2nd Match at “def” ===> Group 1 = “def”
- 2nd match will override the value of 1st match for GROUP 1
- After consuming “abcdef” ,RegEx will try to find group value (it has only 1 Group)
- Text “abcdef” has 2 sets of length 3 characters
Match# | Full Match | Group# | Group value |
1 | abcdef | 1 | def |
2 | ghi | 1 | ghi |
Capturing Groups Names
- Some RegEx engins allow to name the Capturing Group. [Still working on this part]