Regular expression

Comprehensive study notes, diagrams, and exam preparation for Regular expression.

Regular expression

Definition

A regular expression (often written as regex or regexp) is a sequence of characters that defines a search pattern. It is used to match, locate, and manipulate strings based on specific rules and symbols. A regex can represent simple literal text or complex patterns using metacharacters, quantifiers, character classes, anchors, and groups.

Example:

  • Pattern: cat
  • Matches: cat, category, scatter
    It finds the substring cat inside larger text.

Pattern:

  • ^\d{10}$
  • Matches a string containing exactly 10 digits and nothing else.

Main Content

1. Basic Pattern Matching

Literal matching

  • : The simplest form of regular expression matches exact text. For example, the pattern apple matches the word apple wherever it appears.

Case sensitivity and partial search

  • : Depending on the programming language or tool, matching may be case-sensitive by default. Pattern matching can also find a text fragment inside a larger string, such as matching cat in The cat sat on the mat.

Regular expression matching begins with the idea that certain characters have special meanings while others represent themselves. If a regex contains only ordinary characters, it behaves like a normal text search.

Examples:

  • dog matches dog
  • Java matches Java
  • 2024 matches 2024

However, regex becomes much more useful when it is used to identify patterns instead of exact words:

  • \d\d\d can match any three digits
  • [A-Za-z] can match any single alphabetic letter
  • colou?r can match both color and colour

This type of pattern matching is essential when text is not fixed but follows a structure.

2. Metacharacters and Character Classes

Metacharacters

  • : These are special symbols with predefined meanings in regex. Common metacharacters include ., ^, $, *, +, ?, |, (), [], {}, and \.

Character classes

  • : These define a set of acceptable characters. For example, [abc] matches one of a, b, or c, while [0-9] matches any digit.

Metacharacters give regular expressions their flexibility. Instead of searching for exact text only, they allow the pattern to describe a family of possible strings.

Common examples:

  • . matches any single character except newline in many regex engines
  • \d matches any digit from 0 to 9
  • \w matches letters, digits, and underscore
  • \s matches whitespace characters such as space, tab, and newline
  • \D, \W, and \S are the negated forms

Character classes are especially useful for controlled matching:

  • [aeiou] matches any lowercase vowel
  • [A-Z] matches any uppercase English letter
  • [0-9a-fA-F] matches hexadecimal digits

Example:

  • Pattern: [A-Z][a-z]+
  • Meaning: one uppercase letter followed by one or more lowercase letters
  • Matches: John, Paris, Laptop

Negated character classes:

  • [^0-9] means any character except a digit
  • [^aeiou] means any character except the listed vowels

3. Quantifiers, Anchors, and Grouping

Quantifiers

  • : These control how many times a character or group may repeat. Examples include * for zero or more, + for one or more, ? for zero or one, and {m,n} for a specific range.

Anchors and grouping

  • : Anchors such as ^ and $ show the beginning and end of a string. Grouping with parentheses () allows patterns to be combined and repeated as a unit.

Quantifiers are essential for expressing flexible length patterns.

Examples:

  • a* matches `,a,aa,aaa`, and so on
  • a+ matches a, aa, aaa, but not empty text
  • colou?r matches color and colour
  • \d{4} matches exactly four digits
  • \d{2,4} matches 2 to 4 digits

Anchors help validate complete strings:

  • ^Hello means the string must start with Hello
  • world$ means the string must end with world
  • ^\d{5}$ matches exactly a five-digit number, such as a postal code

Grouping allows repeated sequences or subpatterns:

  • (ab)+ matches ab, abab, ababab
  • (\d{2})/(\d{2})/(\d{4}) can capture a date in dd/mm/yyyy format

A simple visual idea of a regex match for an email-like pattern:

start -> letters/digits -> @ -> domain -> . -> extension -> end
         (user name)        (site)      (type)

This shows how regex can model structure rather than a single word.


Working / Process

1. Identify the text requirement

  • Decide what kind of string you want to search, validate, or extract.
  • Example: a phone number, email, date, or keyword list.
  • Determine whether you need an exact match, partial search, or data extraction.

2. Build the regular expression pattern

  • Choose literal characters, metacharacters, character classes, quantifiers, anchors, and groups.
  • Example for a 10-digit number: ^\d{10}$
  • Example for a basic email-like pattern: ^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$

3. Test and refine the pattern

  • Apply the regex to sample inputs and check whether it matches the desired strings only.
  • Adjust the pattern to avoid false positives or false negatives.
  • Example: If a pattern is too strict, valid strings may fail; if too broad, invalid strings may pass.

A general regex workflow:

Need text rule
      ↓
Write pattern
      ↓
Test with examples
      ↓
Fix mistakes
      ↓
Use in program/tool

In practice, regular expressions are often used with functions such as:

  • search() to find a pattern anywhere in text
  • match() to check at the start of text
  • findall() to extract all matches
  • replace() or substitution functions to modify matched text

Advantages / Applications

Efficient text searching and validation

  • : Regex quickly checks whether input follows a required format, such as dates, postal codes, usernames, or passwords.

Data extraction and cleanup

  • : It is very useful for pulling out structured information from large text, logs, HTML-like content, or documents.

Wide use across tools and languages

  • : Regular expressions are supported in many programming languages, text editors, databases, and command-line tools, making them highly portable and practical.

Applications include:

  • Form validation on websites
  • Searching and replacing text in editors
  • Parsing log files
  • Extracting phone numbers, emails, URLs, and IDs
  • Splitting text into parts
  • Filtering records in databases
  • Tokenizing text in basic language-processing tasks

Examples of common use:

  • Validate email input
  • Check password strength rules
  • Find all words beginning with a capital letter
  • Remove extra spaces from text
  • Extract dates from a document

Summary

  • Regular expression is a pattern language used to match and process text.
  • It uses symbols, classes, quantifiers, anchors, and groups to describe string patterns.
  • It is widely used for searching, validation, extraction, and text manipulation.
  • Important terms to remember: regex, metacharacter, character class, quantifier, anchor, group, match, and pattern