Regular expressions: a précis

Regular expressions are patterns describing text. For example, they can be particularly useful in describing parts of the file system where you are looking for or "matching" directories and files of a particular form. This can be useful for intrusion detection systems (e.g. you can describe types of files and directories that should never change, and types where certain changes can be ignored). It can also be useful in systems like SElinux. This is my quick reference/reminder on some of the regular expression syntax often used in these filesystem contexts.

Simple text

Simple/literal text such as "home" can be regular expressions. This example literal text would match /home/domenico, and it would also match /var/www/homepages. It would match any text that contained the sequence of letters "home".

There are lots of special characters used in regular expressions, such as [ ] ^ $ . | * + ( ) { } . To use any of these in literal text there must be a way of indicating that they don't take on their special meaning. This can be done by escapeing them ("escape from their special meaning") using the backslash \. For example, the literal "home\+" will match /var/www/home+pages

Character sets

You can specify a set or group of characters where any one of them will match using the square brackets [ ]. For example, the expression "dhcp6[rs]" will match both /usr/sbin/dhcp6r and /usr/sbin/dhcp6s. Note it will not match /usr/sbin/dhcp6 as at least one of the set of characters must match.

In some implementations, there are some character sets predefined. For example, \d is short for [0123456789], i.e. matches a single digit. \s matches a white space character.

Also, the dot . matches any single character (except line breaks). For example "dhcp.[rs]" matches /usr/sbin/dhcp6r and /usr/sbin/dhcp6s and /usr/sbin/dhcp4s


Anchors match positions in the text, not characters themselves. The hat ^ matches the start of a line, and $ the end of a line (this behaviour may be modified to mean start and end of some text, i.e. that may contain multiple lines). For example, "\.odt$" will match all text/lines that end with ".odt". In essence the hat and dollar anchor the matching text to a certain position.

In some systems anchor shorthands also exist, such as \b which means at the boundary of a word (where a word is some characters surrounded by white space \s).


The question mark ? marks the preceding token as optional. For example, "dhcp6?[rs]" will match /usr/sbin/dhcp6r and /usr/sbin/dhcp6s and /usr/sbin/dhcps (i.e. the character "6" is optional).

The asterisk * matches the preceding token 0 or more times. For example, "dhcp.*[rs]$" will match any text with "dhcp" followed by 0 or more characters and ending in either "r" or "s". So /usr/sbin/dhcp6r and /usr/sbin/dhcp6s and /usr/sbin/dhcps all match. Note that the preceding token in this example is not a character, but the special character . which means any character.

The plus sign + is similar to the asterisk except it matches the preceding token 1 or more times. For example, "dhcp.+[rs]$" would match /usr/sbin/dhcp6r and /usr/sbin/dhcp6s but not /usr/sbin/dhcps

Using the curly brackets { } it is possible to specify exact numbers of repetitions. For example, "(/.+){3}" would match any directory structure containing at lest three levels. So, this would match /usr/sbin/dhcp6r but not /usr/sbin

Note, repetition is enabled in the unix/linux grep command with the "-E" option (grep has "basic" and "extended" regular expression syntax, the -E option enables the extended format which is the type we are prècising here).


Round brackets ( ) are used to group tokens together. For example, "/usr/sbin(/.*)?" will match /usr/sbin and any text that has /usr/sbin/ as part of it. In filesystem terms this matches the path /usr/sbin and any files, directories and sub-directories that lie under it.

This or that

The vertical line | is basically the logical or. For example, "/usr/sbin/(dhcpd|named)" will match /usr/sbin/dhcpd and /usr/sbin/named