Searching documents for specific char­ac­ters or strings has always been one of the most common re­pet­it­ive tasks in in­form­a­tion tech­no­logy. You often want to replace or modify the text fragments or lines of code you’re searching for. This task becomes in­creas­ingly complex the more often the string appears in the document. In the 1950s, a solution was found in the formal languages of the­or­et­ic­al computer science: Regular ex­pres­sions (regex) can dra­mat­ic­ally simplify such re­pet­it­ive tasks and are widely used in software de­vel­op­ment to this day.

What is a regular ex­pres­sion?

A regular ex­pres­sion (regex) is a unit that describes regular languages, which are a type of formal language. As a central tool in the­or­et­ic­al computer science, they serve as the basis for de­vel­op­ing and running computer programs as well as con­struct­ing the necessary compilers. For this reason, regular ex­pres­sions, which are often referred to as regex and are based on well-defined syntax rules, are used primarily in software de­vel­op­ment.

For every regular ex­pres­sion, there is a finite automaton (also known as a state machine) that accepts the language specified by the ex­pres­sion and is formed from a regular ex­pres­sion using the Thompson’s con­struc­tion algorithm . At the same time, for every finite automaton there is also a regular ex­pres­sion that describes the language accepted by the automaton. This ex­pres­sion can be generated by by Kleene’s algorithm or Gaussian elim­in­a­tion.

Note

A state machine is a behaviour model con­sist­ing of states, state trans­itions, and actions. It is referred to as finite if the number of states that it can accept is finite (i.e. limited).

A well-known IT ap­plic­a­tion for regex is the search-and-replace function in text editors, which computer pioneer Ken Thompson, one of the de­velopers of the UNIX operating system, first im­ple­men­ted in the line-ori­ent­ated editor QED in the 1960s and later in its des­cend­ant ed. This function allows you to find specific strings in text and, if desired, replace them with any other string.

Defin­i­tion: Regular Ex­pres­sion (regex)

A regular ex­pres­sion is a string based on syntax rules that allow you to describe character strings. As such, they are part of regular language, a subgroup of formal language that is es­pe­cially important in in­form­a­tion tech­no­logy, par­tic­u­larly software de­vel­op­ment.

How does a regular ex­pres­sion work?

A regular ex­pres­sion can be formed by using regular char­ac­ters (such as abc) only or by using a com­bin­a­tion of regular char­ac­ters and metachar­ac­ters (such as ab*c). The task of metachar­ac­ters is to describe certain character con­struc­tions or ar­range­ments, such as whether a character should be at the beginning of the line or whether a character can or should occur exactly once or more or less fre­quently. The first regular ex­pres­sion example mentioned above work as follows:

abc: The simple regex pattern abc requires an exact match. In other words, the ex­pres­sion searches for all strings con­tain­ing the char­ac­ters “abc” in that exact order. This means the ex­pres­sion will match the question “Do you know your abcs?” as well as the sentence “The abcoulomb is an elec­tro­mag­net­ic unit of charge.

The second regular ex­pres­sion example works like this:

ab*c: By contrast, a regular ex­pres­sion with special char­ac­ters works slightly dif­fer­ently because it searches for exact matches as well as special scenarios. In this example, the asterisk ensures that the ex­pres­sion searches for strings that begin with the letter “a” and end with the letter “c”. However, there can be any number of bs between a and c. As a result, “abc” as well as the strings “abbbbc” and “cbbabbcba” also con­sti­tute a match.

Each regex can also be linked to a specific action such as the “replace” operation mentioned above. This action is performed wherever the regular ex­pres­sion is true, meaning wherever there is a match as described in the examples above.

What are the chal­lenges of using regex?

Regex in­struc­tions give you a lot of freedom because you always have several different options for solving any problem with a regular ex­pres­sion. However, the ability to achieve a desired result in various ways isn’t always an advantage.

For example, you can keep the in­struc­tions very general so that you always obtain the desired result in every case. But if you want to obtain the most accurate result possible, you have to form a specific regex pattern. There’s also a general rule for the length: The more compact a regular ex­pres­sion is, the less time it will take to process. Don’t lose sight of read­ab­il­ity, however. If you want to change your regular ex­pres­sions later, it’ll be a major obstacle if the original in­struc­tions are too com­plic­ated and, moreover, un­com­men­ted.

As a rule, when you create a regular ex­pres­sion, it’s important to find the optimal balance between com­pact­ness and spe­cificity.

Which syntax rules apply to regex?

As mentioned earlier, regex can be used in a variety of languages, such as Perl, Python, Ruby, JavaS­cript, XML, or HTML, but their use­ful­ness or function can differ con­sid­er­ably. For example, in JavaS­cript, regex patterns are used in the search (), match (), or replace () string methods, whereas ex­pres­sions in XML documents are used to delimit element content. However, in terms of syntax, there are hardly any dif­fer­ences between pro­gram­ming languages and markup languages when it comes to regex:

A regular ex­pres­sion can consist of up to three parts, re­gard­less of the language in which it is used:

Patterns The central element is the pattern, i.e. the general search pattern. Al­tern­at­ively, as explained in the previous section, the pattern can be composed solely of simple char­ac­ters or a com­bin­a­tion of simple char­ac­ters and special char­ac­ters.
De­lim­iters De­lim­iters mark the beginning and end of the pattern. Basically, all non-al­pha­nu­mer­ic char­ac­ters (except back­slashes) can be used as de­lim­iters. For example, PHP supports hashtags (#pattern), percent signs (%pattern), plus signs (+pattern+), or tildes (~pattern~) as de­lim­iters. However, most languages use straight quotes (“pattern") or slashes (/pattern/).
Modifiers Modifiers can be appended to a search pattern to modify the regular ex­pres­sion. For example, the modifier i ignores case sens­it­iv­ity. It ensures that upper- and lowercase letters are treated the same and apply to all regular ex­pres­sions by default.

The following are typical syntax symbols used for adding specific options to patterns:

Special char­ac­ters for regex syntax Function
[] A pair of square brackets denotes a character class that always rep­res­ents a single character in a search pattern.
() A pair of brackets denotes a group of char­ac­ters that consists of one or more char­ac­ters and can operate within one another.
- Specifies a range (from [...] to [...]) if it is between two regular char­ac­ters
^ Limit the search to the beginning of a line (also functions as a negator in character classes)
$ Limit the search to the end of a line
. Rep­res­ents any character
* The character, class, or group in front of an asterisk (zero included) can occur any number of times.
+ Character, class, or group in front of a plus sign must occur at least once.
? Character, class or group in front of a question mark is optional and may occur only once.
{n} The preceding character, class or group occurs exactly n times.
{n,m} The preceding character, preceding class or group occurs at least n times, but not more than m times.
{n,} The preceding character, class or group occurs n or more times.
\b Include the edge of a word
\B Ignore the edge of a word
\d Any decimal digit; shorthand for character class [0-9]
\D Any character that is not a decimal digit; short notation for character class [^0-9]
\w Any al­pha­nu­mer­ic character; short notation for character class [a-zA-Z_0-9]
\W Any non-al­pha­nu­mer­ic character; short notation for character class [^\w]

Tutorial: A regular ex­pres­sion example or two to explain the pos­sib­il­it­ies

The previous sections of this article explained the fun­da­ment­als of regex. The following tutorial il­lus­trates how these practical strings work. This tutorial il­lus­trates various pos­sib­il­it­ies and syntax tricks using a specific regular ex­pres­sion example or two for both simple and complex searches.

Single-element regex

The simplest form of regex is a search pattern that only matches a single element. As long as you’re not searching for a specific element, you can easily define a single-element regular ex­pres­sion using a character class. The following ex­pres­sion allows the digits “1,” “2,” “3,” “4,” “5,” “6” or “7” as possible matches:

[1234567]

Since the numbers are con­sec­ut­ive in this case, you could also use the following sim­pli­fied notation:

[1-7]

If you want to change the regular ex­pres­sion to exclude the digit “4” from the search, you can also use the simpler version with the minus sign:

[1-35-7]
Note

the in­di­vidu­al char­ac­ters of a regex pattern are not separated by spaces.

Multi-element regex

With a multi-element regular ex­pres­sion, you can also use character classes to allow for a selection of different matches. For example, if you want the ex­pres­sion to capture two elements for which different matches are possible, simply string together two character classes:

[1-7][a-c]

The first element, a number between “1” and “7,” follows one of the letters “a,” “b,” or “c.” As already mentioned, lower case is mandatory here. Before you start using modifiers at this point, you can already include capital letters by making the following minor change to the ex­pres­sion:

[1-7][a-cA-C]

Regex with optional elements

Re­gard­less of whether you search for multiple elements within a single regular ex­pres­sion or search with the help of multiple sets of char­ac­ters, it’s possible that certain elements may or may not be included under certain cir­cum­stances. This could happen with a regular ex­pres­sion example that’s supposed to filter out all street numbers. In some cases, the street number may consist of a single digit, whereas in other matches, the number may consist of two or even three digits. Ad­di­tion­ally, there may be addresses where a letter is added to the street number. You can capture the total set of possible com­bin­a­tions using the following regex in­struc­tions:

[1-9][0-9]?[0-9]?[a-z]?

The only mandatory element in this search pattern is a number between “1” and “9.” Two digits between “0” and “9” and any letter may follow, as indicated by the sub­sequent question mark in each case.

The con­struc­tion for three-digit numbers plus ad­di­tion­al letters is still very clear, but it would look much different for numbers with up to ten digits. In this case, curly brackets are re­com­men­ded, as in the following regular ex­pres­sion example:

[1-9][0-9]{0,9}

As in the previous example, the ex­pres­sion must start with a number between “1” and “9.” However, this number can be followed either by no digits or up to nine digits between “0” and “9.” This means that the search result can consist of up to ten digits.

Regular ex­pres­sion with any number of re­pe­ti­tions

In the previous examples of single- and multi-element ex­pres­sions, both the minimum and maximum number of char­ac­ters were known. But there are also scenarios where you shouldn’t precisely define the character set of a regex in advance. In this case, the necessary para­metres are the asterisk and plus signs, which allow for any number of re­pe­ti­tions of a character or a character class or group. You can capture all strings with any number of digits (even “zero”) using the following regular ex­pres­sion:

[0-9]*

The same applies if you’re searching for a specific com­bin­a­tion of char­ac­ters in which one (or more) char­ac­ters can occur any number of times. As in the following example:

ab*

Possible matches include the words “apple,” “abnormal” and “abbey.” If you want to exclude the first match or if the specified character occurs at least once, you should use the plus sign instead:

ab+

Negating character classes

You have to use the negator^” (caret) if you want to use a regular ex­pres­sion with character classes that represent one or more char­ac­ters, but you want to exclude one or more specific char­ac­ters as matches. This sign is always placed within the brackets of a character class and only applies within these brackets. The following in­struc­tion is a good example of a negated character class:

F[^u]n

In this example, the second character can be any character other than “u.” Matches would therefore include the word “Fan.” However, the word “fun” would not be matched, which is why it doesn’t apply to the regular ex­pres­sion.

Wildcards

Regex also allows you to use wildcards that represent one, more than one or no char­ac­ters within a search pattern (depending on the metachar­ac­ter you’re using). You create the wildcard using a dot combined with the above-mentioned special char­ac­ters for re­pe­ti­tions if you want a result other than a single character. A regular ex­pres­sion example such as this one would allow you to search a database for a person if you know the person’s first and last name but you don’t know whether a middle name was also entered for the person:

John.*Smith

In this case, possible matches would include “John William Smith” (as well as any other com­bin­a­tion with a middle name) or “John W. Smith” and “John Smith.” If you only want to include variants with a middle name, use a plus sign instead of an asterisk:

John.+Smith

The following search pattern matches both “back” and “buck” and is a good example of how to use a wildcard for a single character:

B.ck

Al­tern­at­ives

You can form a regular ex­pres­sion so that there are two or more al­tern­at­ives for a match. The al­tern­at­ives are separated with a vertical bar, as in the following example:

Tree|Flower

This ex­pres­sion would find matches for both “Tree” and “Flower.”

You can also use groups to form al­tern­at­ives within words or strings:

(Sun|Mon|Tues|Wednes|Thurs|Fri|Satur)day

In this example, each day of the week is a potential match. All weekday names are also re­cog­nised correctly in their ab­bre­vi­ated form because they are grouped in brackets.

Groups

Like character classes, the character groups in the example in the previous section are struc­tur­al elements of regex. They can be defined by a pair of brackets and basically represent a pattern con­sist­ing of one or more char­ac­ters. Strictly speaking, each regex is therefore a group, but it is not iden­ti­fied using brackets in this case. Groups allow you to apply operators such as hyphens or asterisks (plus sign and asterisks) to a subex­pres­sion within ex­pres­sions:

ab(cd)+

In this case, the desired unlimited re­pe­ti­tion applies to the character group “cd.” Written in the same notation without brackets, it would apply only to the “d.” There are no re­stric­tions on the number of groups within a regex.

Nested groups

A regular ex­pres­sion can not only contain any number of groups. It can also contain any number of nested groups in order to express complex re­la­tion­ships between simple char­ac­ters and special char­ac­ters without un­ne­ces­sar­ily long strings. Possible matches for the regex pattern in the following example are the four car models “VW Golf,” “VW Jetta,” “Ford Explorer” or “Ford Focus”:

(VW (Golf|Jetta)|Ford (Explorer|Focus))

Word bound­ar­ies

If you want to include word bound­ar­ies, meaning the beginning or end of an al­pha­nu­mer­ic sequence, in a regular ex­pres­sion, you have to specify this with a metachar­ac­ter. Many languages use the com­bin­a­tion “\b,” which can be added at the beginning, end or beginning and end of a search pattern.

The first option requires that the search sequence be at the beginning of the word:

\band

The word “andromeda” is included in the matches for this regular ex­pres­sion example. On the other hand, the word “band” is not matched because the char­ac­ters being searched are preceded by the letter “B.” To flip things around, use the second option and add the special char­ac­ters at the end:

and\b

Finally, with the third option, you make both word bound­ar­ies a re­quire­ment. In this example, the only possible match is the con­junc­tion “and.”

\band\b

Ignoring the meta-meaning of special char­ac­ters

In the previous section, we used the backslash to ensure that the “b” following it was treated as a metachar­ac­ter and not as a letter. If you combine it with char­ac­ters that are standard special char­ac­ters for regex syntax, it has exactly the opposite effect and the character is treated as an ordinary literal. Thanks to this option, you can easily search for a specific date with a regular ex­pres­sion.

11\.10\.2019

In this case, the date “11.10.2019” is the only character string that matches the required search criteria. Without the backslash, the two dots would be in­ter­preted as wildcards for any character, which is why matches such as “1101092019” or “11a10b2019” would be possible.

“Re­strict­ing” greedy regex

Quan­ti­fi­ers (“?,” “+,” “*,” “{}”) are the default method of ensuring that an ex­pres­sion is “greedy” and tries to find the longest possible match. However, since this behaviour isn’t always desired, you can modify quan­ti­fi­ers in a regular ex­pres­sion to make it less “greedy.” The following example il­lus­trates this modi­fic­a­tion process:

A.*B

When applied to the string “ABCDEB,” this greedy ex­pres­sion would include the entire string in the search for matches instead of stopping the search after “AB.” On the other hand, if you want the search to stop as soon as the first Bis found, you have to use the above modi­fic­a­tion. In many languages (including Perl, Tcl, HTML), you add a question mark after quan­ti­fi­ers for this purpose:

A.*?B

Al­tern­at­ively, you can replace the original greedy ex­pres­sion with the following equi­val­ent “non-greedy” ex­pres­sion to arrive at the same result:

A[^B]*B
Note

Re­strict­ing greedy regex makes pro­cessing a search pattern more com­plic­ated and increases search time.

Go to Main Menu