regcmp() — Compile regular expression

Standards

Standards / Extensions C or C++ Dependencies
XPG4.2 C only  

Format

#define _XOPEN_SOURCE_EXTENDED  1
#include <libgen.h>

char *regcmp(const char *pattern[,...], (char *)0);

char *regex(const char *cmppat, const char *subject[,subexp,...]);

extern char *__loc1;

General description

Restriction: This function is not supported in AMODE 64.

The regcmp() function concatenates regular expression (RE) patterns specified by a list of one or more pattern arguments. The end of this list must be delimited by a NULL pointer. The regcmp() function then converts the concatenated RE pattern into an internal form suitable for use by the pattern matching regex() function. If conversion is successful, regcmp() returns a pointer to the converted pattern. Otherwise, it returns a NULL pointer. The regcmp() function uses malloc() to obtain storage for the converted pattern. It is the application's responsibility to free unneeded space so allocated.

The regex() function executes a converted pattern cmppat against a subject string. If cmppat matches all or part of the subject string, the regex() function returns a pointer to the next unmatched character in the subject string and sets the external variable __loc1 to point the first matched character in the subject string. If no match is found between cmppat and the subject string, the regex() function returns a NULL pointer.

The regcmp() and regex() functions are supported in any locale. However, results are unpredictable if they are not run in the same locale.

Following are valid RE symbols and their meaning to the regcmp() and regex() functions:
Expression
Meaning
NUL
Terminate RE pattern and text string
c
Any non-special character, c, is a one-character RE which matches itself.
\s
A backslash (\) followed by a special character, s, is a one-character RE which matches the special character itself.
The following characters are special:
  • period, ., asterisk, *, plus, +, dollar, $, left square bracket, [, left brace, {, right brace, }, left parenthesis, (, right parenthesis, ), and backslash, \, are always special except when they appear within square brackets ([]).
  • caret (^) is special at the beginning of an entire RE (which is another name for a pattern).
Note: An non-special character preceded by \ is a one-character RE which matches the non-special character.
yz
Concatenation of REs y and z matches concatenation of strings matched by y and z.
.
The period (.) special character RE matches any single character except the <newline> character.
^
The caret (^) at the beginning of an entire RE is an RE which matches the beginning of a string. Thus, it anchors or limits matches by the entire RE to the beginning of strings.
$
The dollar ($) at the end of an entire RE is an RE which only the end of a string (delimited by the <NUL> character). Thus, it anchors or limits matches by the entire RE to the end of strings.
Note: \n (the C language designation for a <newline> character) must be used in an entire RE to match any embedded or trailing <newline> character in a text string.
(...)
Parentheses are used to delimit a sub-expression which matches whatever the REs comprising the sub-expression would have matched without the delimiting parentheses.
(...)$n
$n, where n is a digit between 0 and 9, inclusive, may be used to tag a sub-expression. The tag tells the regex() function to return the substring matched by the sub-expression at address specified by (n+1)th argument after subject.
*
A one-character RE or sub-expression followed by an asterisk (*) is a RE that matches zero or more occurrences of the one-character RE or sub-expression. If there is any choice, the longest leftmost string that permits a match is chosen.
+
A one-character RE or sub-expression followed by a plus (+) is a RE that matches one or more occurrences of the one-character RE or sub-expression. Whenever a choice exists, the RE matches as many occurrences as possible.
{m,n}
A one-character RE or sub-expression followed by integer values, m and n, enclosed in braces is a RE which matches repeated occurrences of whatever the preceding one-character RE or sub-expression matched. The value of m, which must be in the range 0 to 255, inclusive, is the minimum number of occurrences required for a match. The value of n which, if specified, must also must be in the range 0 to 255, inclusive, is the maximum. The value of n, if specified, must be greater than or equal to the value m. The following brace expressions are valid:
{m}
Matches exactly m occurrences of the preceding one-character RE or sub-expression.
{m,}
Matches m or more occurrences of the preceding one-character RE or sub-expression. There is no limit on the number of occurrences which will be matched. The plus (+) and asterisk (*) operations are equivalent to {1,} and {0,}, respectively.
{m,n}
Matches between m and n occurrences, inclusive.

Whenever a choice exists, the RE matches as many occurrence as possible.

[...]
A non-empty list of characters enclosed by square brackets is a one-character RE that matches any one character in the list.
[^...]
A non-empty list of characters preceded by a caret (^) enclosed by square brackets is a one-character RE that matches any character except <newline> and the characters in the list. The ^ has special meaning only if it is the first character after the left bracket ([).
[c1-c2]
The hyphen (-) between two characters c1 and c2 within square brackets designates the list of characters whose collating values fall between the collating values of c1 and c2 in the current locale. The collating value of c2 must be greater than or equal to c1. Also, c2 may not be used as the ending point of one range and the starting point of another range. In other words, c1-c2-c3 is invalid.

The - loses special meaning if it occurs first or last in the bracket expression or if it is used for c1 or c2.

The right bracket, ], does not terminate a bracket expression when it is the first character within it (after an initial ^, if any). For example, the expression []0-9] matches a right bracket or a digit in the range 0-9, inclusive.
Notes:
  1. Multiple duplication symbols applied to the same RE will be interpreted in the following order of precedence:
    1. *
    2. +
    3. {}
  2. RE Order of precedence is as follows, from high to low:
    1. escaped character \character
    2. bracket expression [...]
    3. sub-expression (...)
    4. duplication * + {}
    5. concatenation yz
    6. anchors ^ $
Note:

The regcmp() and regex() functions are provided for historical reasons. These functions were part of the Legacy Feature in Single UNIX Specification, Version 2. They have been withdrawn and are not supported as part of Single UNIX Specification, Version 3. New applications should use the newer functions fnmatch(), glob(), regcomp() and regexec(), which provide full internationalized regular expression functionality compatible with IEEE Std 1003.1-2001.

If it is necessary to continue using these functions in an application written for Single UNIX Specification, Version 3, define the feature test macro _UNIX03_WITHDRAWN before including any standard system headers. The macro exposes all interfaces and symbols removed in Single UNIX Specification, Version 3.

Returned value

If the pattern formed by concatenating the list of pattern arguments is successfully converted, regcmp() returns a pointer to the converted pattern. Otherwise, it returns a NULL pointer. If regcmp() is unable to allocate storage for the converted pattern, it sets errno to ENOMEM.

If regex() successfully matches the converted pattern cmppat to all or part of the subject string, it returns a pointer to the next unmatched character in subject. Otherwise, it returns a NULL pointer.

Related information