SQL RegEx : Basic
Regular expression (aka RegEx) is a way to define the pattern of any text. This pattern can represent the complete text or a part of the text. RegEx is a very powerful and handy to use option for to work with strings specially for search and replace functions.
Almost all modern programming languages support RegEx in similar manner and with latest releases IBM introduced RegEx in IBM I SQL.
Here is an example of how can RegEx be useful-
Pattern for a simple email (like abc123@gmail.com) address is something like this-
- 1st character must be an English alphabet.
- After 1st character, there can be any number or an alphanumeric characters (in some cases, few special characters are also allowed).
- Then there must be one “@” character
- @ is followed by a domain name which also has a pattern in itself-
- any number of alphanumeric characters (mostly company name like gmail)
- followed by a ‘.’ (dot)
- followed by at least two English alphabets (like com/net etc) which is called Top–level domain
Now, a RegEx rule, for this pattern, can be defined as below-
[a-zA-Z][a-zA-Z0-9]*@[a-zA-Z0-9]+\.[a-zA-Z]{2,}
At first, above string looks a string of some random characters with no meaning (like a foreign language). But after understating “Regular expression” (explained below), this will make perfect sense.
To start with, let us first look into different entities involved in this RegEx.
- Character Classes : represented using meta characters within Square brackets [ ]
- one character class matches or validates only 1 character
- [abc] : This character class will match any of the “a” or “b” or “c”.
- [146] : This will match ant of the 1 or 4 or 6.
- Use hyphen (-) to define the range (Called Range separator)
- [a-f] : will validate any one character from the range a to f (lower case) of English alphabet. (i.e. a, b, c, d, e, or f).
- [a-z] : similarly, it will validate any lower case character from a to z.
- [a-zA-Z] : validates any character from ( lower case a to z ) or (upper case A to upper Z).
- [a-zA-Z0-9] : validates any character from ( lower case a to z ), (upper case A to Z), or (digits 0 to 9).
- We will discuses few more cases for “Character Classes” later.
- one character class matches or validates only 1 character
- Meta characters
- Meta charters have special meaning in RegEx.
- For example, a hyphen (-), at first point, is a meta character to define the range and Square brackets to define a character class.
- We used other meta character * (asterisk) in our example.
- * is a quantifier to define the quantity ZERO or MORE preceding part(s) of RegEx.
- It means that preceding part of RegEx can be repeated ZERO or MORE times.
- Anther meta character in our example is + (plus)
- Same as *, it is a quantifier and defines the quantity ONE or MORE.
- Using a ‘+’ means that preceding part of RegEx must be repeated at least ONCE can be repeated for any number of times.
- Meta character escaping
- In above example, we used one meta character “.” (DOT ). A “.” (DOT), as meta character, can match any character; like an alphabet, adigit, or any special character.
- But for email, we need exactly a “.” DOT (not any other character).
- To do this we need to tell RegEx not to use DOT as meta character. So we need to escape the DOT with a backward slash “\”.
- If any meta character is escaped with a backward slash “\”, it will lose all its power to be meta character and RegEx will consider that meta as a simple literal.
- String literals
- “@” sign, in the example, is a string literal.
- There is no special meaning to this and it must match as it is on the same position and same number of times.
- {n,}
- Curly brackets are also meta character of type quantifiers.
- “n,” (character “n” followed by “,” ) defines that preceding RegEx must match atleast “n” or MORE times (n is a number).
Now let us break our RegEx into different parts-
[a-zA-Z][a-zA-Z0-9]*@[a-zA-Z0-9]+\.[a-zA-Z]{2,}
- [a-zA-Z]
- It has Square brackets [ ] and Hyphen for range; so any one character from a to z or A to Z is valid
- So this complete part will match first part of the email and makes sure that first character must be an English alphabet.
- [a-zA-Z0-9]*
- It has Square brackets [ ] and Hyphen for range
- So as per this : one character from (English alphabet) or (digit is valid)
- last character of this part (“*”) is a meta character which means any character from (English alphabet) or (digit) can be repeacted ZERO or MORE time(s) (which is defined using preceding RegEx [a-zA-Z0-9]).
- This defines the 2nd part of email which says that after 1st character, there can be any number of alphanumeric characters or digits.
- It has Square brackets [ ] and Hyphen for range
- @
- This is a string literal and must match as it is.
- One @ is required.
- [a-zA-Z0-9]+
- Square bracket part [a-zA-Z0-9] is same as defined earlier (As per Square brackets part: a character from (English alphabet) or (digit) is valid).
- Last character + is another meta character
- So there must be at least one character from (English alphabet) or (digit) defined using preceding RegEx [a-zA-Z0-9].
- This define the 4th part (company name from domain ) of email
- \.
- It is an escaped meta character “.” (DOT) and must be a string literal “.” (DOT) and must match as it is.
- It means that one “.” (DOT) is required after company name.
- [a-zA-Z]{2,}
- Square bracket part [a-zA-Z] is same as defined earlier as per Square brackets part i.e. one character from English alphabet (a to z or A to Z).
- {2,} – This part defines that there must be at least 2 character from English alphabets (defined using preceding RegEx [a-zA-Z])
- this define the final part (Top level domain) of email.
You can use website http://www.regexr.com/ to verify our RegEx. At Footer part of the webpage there is a link “Explain” which will display how each part of RegEx is used to match the given text.