Regular expression
Definition
A regular expression (often written as regex or regexp) is a sequence of characters that defines a search pattern. It is used to match, locate, and manipulate strings based on specific rules and symbols. A regex can represent simple literal text or complex patterns using metacharacters, quantifiers, character classes, anchors, and groups.
Example:
- Pattern:
cat - Matches:
cat,category,scatter
It finds the substringcatinside larger text.
Pattern:
^\d{10}$- Matches a string containing exactly 10 digits and nothing else.
Main Content
1. Basic Pattern Matching
Literal matching
- : The simplest form of regular expression matches exact text. For example, the pattern
applematches the wordapplewherever it appears.
Case sensitivity and partial search
- : Depending on the programming language or tool, matching may be case-sensitive by default. Pattern matching can also find a text fragment inside a larger string, such as matching
catinThe cat sat on the mat.
Regular expression matching begins with the idea that certain characters have special meanings while others represent themselves. If a regex contains only ordinary characters, it behaves like a normal text search.
Examples:
dogmatchesdogJavamatchesJava2024matches2024
However, regex becomes much more useful when it is used to identify patterns instead of exact words:
\d\d\dcan match any three digits[A-Za-z]can match any single alphabetic lettercolou?rcan match bothcolorandcolour
This type of pattern matching is essential when text is not fixed but follows a structure.
2. Metacharacters and Character Classes
Metacharacters
- : These are special symbols with predefined meanings in regex. Common metacharacters include
.,^,$,*,+,?,|,(),[],{}, and\.
Character classes
- : These define a set of acceptable characters. For example,
[abc]matches one ofa,b, orc, while[0-9]matches any digit.
Metacharacters give regular expressions their flexibility. Instead of searching for exact text only, they allow the pattern to describe a family of possible strings.
Common examples:
.matches any single character except newline in many regex engines\dmatches any digit from0to9\wmatches letters, digits, and underscore\smatches whitespace characters such as space, tab, and newline\D,\W, and\Sare the negated forms
Character classes are especially useful for controlled matching:
[aeiou]matches any lowercase vowel[A-Z]matches any uppercase English letter[0-9a-fA-F]matches hexadecimal digits
Example:
- Pattern:
[A-Z][a-z]+ - Meaning: one uppercase letter followed by one or more lowercase letters
- Matches:
John,Paris,Laptop
Negated character classes:
[^0-9]means any character except a digit[^aeiou]means any character except the listed vowels
3. Quantifiers, Anchors, and Grouping
Quantifiers
- : These control how many times a character or group may repeat. Examples include
*for zero or more,+for one or more,?for zero or one, and{m,n}for a specific range.
Anchors and grouping
- : Anchors such as
^and$show the beginning and end of a string. Grouping with parentheses()allows patterns to be combined and repeated as a unit.
Quantifiers are essential for expressing flexible length patterns.
Examples:
a*matches`,a,aa,aaa`, and so ona+matchesa,aa,aaa, but not empty textcolou?rmatchescolorandcolour\d{4}matches exactly four digits\d{2,4}matches 2 to 4 digits
Anchors help validate complete strings:
^Hellomeans the string must start withHelloworld$means the string must end withworld^\d{5}$matches exactly a five-digit number, such as a postal code
Grouping allows repeated sequences or subpatterns:
(ab)+matchesab,abab,ababab(\d{2})/(\d{2})/(\d{4})can capture a date indd/mm/yyyyformat
A simple visual idea of a regex match for an email-like pattern:
start -> letters/digits -> @ -> domain -> . -> extension -> end
(user name) (site) (type)
This shows how regex can model structure rather than a single word.
Working / Process
1. Identify the text requirement
- Decide what kind of string you want to search, validate, or extract.
- Example: a phone number, email, date, or keyword list.
- Determine whether you need an exact match, partial search, or data extraction.
2. Build the regular expression pattern
- Choose literal characters, metacharacters, character classes, quantifiers, anchors, and groups.
- Example for a 10-digit number:
^\d{10}$ - Example for a basic email-like pattern:
^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$
3. Test and refine the pattern
- Apply the regex to sample inputs and check whether it matches the desired strings only.
- Adjust the pattern to avoid false positives or false negatives.
- Example: If a pattern is too strict, valid strings may fail; if too broad, invalid strings may pass.
A general regex workflow:
Need text rule
↓
Write pattern
↓
Test with examples
↓
Fix mistakes
↓
Use in program/tool
In practice, regular expressions are often used with functions such as:
search()to find a pattern anywhere in textmatch()to check at the start of textfindall()to extract all matchesreplace()or substitution functions to modify matched text
Advantages / Applications
Efficient text searching and validation
- : Regex quickly checks whether input follows a required format, such as dates, postal codes, usernames, or passwords.
Data extraction and cleanup
- : It is very useful for pulling out structured information from large text, logs, HTML-like content, or documents.
Wide use across tools and languages
- : Regular expressions are supported in many programming languages, text editors, databases, and command-line tools, making them highly portable and practical.
Applications include:
- Form validation on websites
- Searching and replacing text in editors
- Parsing log files
- Extracting phone numbers, emails, URLs, and IDs
- Splitting text into parts
- Filtering records in databases
- Tokenizing text in basic language-processing tasks
Examples of common use:
- Validate email input
- Check password strength rules
- Find all words beginning with a capital letter
- Remove extra spaces from text
- Extract dates from a document
Summary
- Regular expression is a pattern language used to match and process text.
- It uses symbols, classes, quantifiers, anchors, and groups to describe string patterns.
- It is widely used for searching, validation, extraction, and text manipulation.
- Important terms to remember: regex, metacharacter, character class, quantifier, anchor, group, match, and pattern