compile() — Compile regular expression

Standards

Standards / Extensions	C or C++	Dependencies
XPG4 XPG4.2	both

Format

#define INIT declarations
#define GETC() getc_code
#define PEEK() peek_code
#define UNGETC() ungetc_code
#define RETURN(ptr) return_code
#define ERROR(val) error_code

#define _XOPEN_SOURCE
#include <regexp.h>

char *compile(char *instring, char *expbuf, const char *endbuf, int eof);

General description

Restriction: This function is not supported in AMODE 64.

The compile() function takes as input a simple regular expression and produces a compiled expression that can be used with the step() and advance() functions.

The first parameter instring is never used explicitly by compile(). It is a pointer to a character string defining a source regular expression. It is useful for programs that pass down different pointers to input characters. Programs which invoke functions to input characters or have characters in an external array can pass down (char *)0 for this parameter.

expbuf is a pointer to the place where the compiled regular expression will be placed.

endbuf points to one more than the highest address where the compiled regular expression may be placed. If the compiled expression cannot fit in (endbuf-expbuf) bytes, a call to ERROR(50) is made. (See "Returned Value" below.)

eof is the character which marks the end of the regular expression.

The z/OS® UNIX services implementation of the compile() function does not accept internationalized simple expressions as input. Internationalized simple expressions (for example, [[=c=]] (an equivalence class)) may yield unpredictable results.

Programs must have the following five macros declared before the #include <regexp.h> statement. The macros GETC(), PEEKC() and UNGETC() operate on the regular expression given as input to compile().

GETC(): This macro returns the value of the next character (byte) in the regular expression pattern. Successive calls to GETC() should return successive characters of the regular expression.
PEEK(): This macro returns the next character (byte) in the regular expression pattern. Immediate successive calls to PEEK() should return the same byte, which should also be the next character returned by GETC().
UNGETC(c): This macro causes the argument c to be returned by the next call to GETC(). No more than one character is ever needed and this character is guaranteed to be the last character read by GETC(). The value of the macro UNGETC() is always ignored.
RETURN(ptr): This macro is used on normal exit of the compile() function. The value of the argument ptr is a pointer to the character after the last character of the compiled regular expression.
ERROR(val): This macro is the abnormal return from compile(). The argument val is an error number. (See "Returned Value" below for meanings.) This call should never return.

Notes:

z/OS UNIX services do not provide any default macros if the above user macros are not provided.
Each program that includes the <regexp.h> must have a #define statement for INIT. It is used for dependent declarations and initializations. For example, it can be used to set a variable to point to the beginning of the regular expression so that this variable can be used in the declarations for GETC(), PEEK(), and UNGETC().
The external variables cirf, sed, and nbra are reserved.
The application must provide the proper serialization for the compile(), step(), and advance() functions if they are run under a multithreaded environment.

Simple regular expressions

A Simple Regular Expression (SRE) specifies a set of character strings. The simplest form of regular expression is a string of characters with no special meaning. A small set of special characters, known as metacharacters, do have special meaning when encountered in patterns.

Expression

Meaning

c

The character c where c is not a special character.

\c

The character c where c is any special character. For example, a\.e is equivalent to a.e.

^

The beginning of the string being compared

$

The dollar symbol matches the end of the string.

.

The period symbol matches any one character.

[string]

A string within square brackets specifies any of the characters in string. Thus, [abc], if compared to other strings, would match any which contained a, b, or c.

The ] (right bracket) can be used alone within a pair of brackets, but only if it immediately follows either the opening left bracket or if it immediately follows [^.

Ranges may be specified as c–c. The hyphen symbol, within square brackets, means "through". It fills in the intervening characters according to the collating sequence. For example, [a–z] is equivalent to [abc…xyz]. If the end character in the range is lower in collating sequence to the start character, then only the range start and range end characters are accepted in the search pattern. For example, [9–1] is equivalent to [91]. Note that ranges in Simple Regular Expressions are only valid if the LC_COLLATE category is set to the C locale.

The – (hyphen) can be used by itself, but only if it is the first or last character in the expression. For example, the expression []a-f] matches either the ] or one of the characters a through f.

[^string]

The caret symbol, when inside square brackets, negates the characters within the square brackets. Thus, [^abc], if compared to other strings, would fail to match any which contains even one a, b, or c.

Note: Characters ., *, [, and \ (period, asterisk, left square bracket, and backslash, respectively) have special meaning, except when they appear within square brackets ([]), or are preceded by \.

*

The asterisk symbol indicates 0 or more of any preceding characters. For example, (a*e) will match any of the following: e, ae, aae, aaae, .... The longest leftmost match is chosen.

rx

The occurrence of regular expression r followed by the occurrence of regular expression x.

\{m\} \{m,\} \{m,u\}

Integer values enclosed in \{\} indicate the number of times to apply the preceding regular expression. m is the minimum number and u is the maximum number. u must be less than 256. If you specify only m, it indicates the exact number of times to apply the regular expression.

\{m,\} is equivalent to \{m,255\}. They both match m or more occurrences of the expression. The * (asterisk) operation is equivalent to \{0,\}.

The maximum number of occurrences is matched.

$r$

The regular expression r. The $ and $ sequences are ignored.

\n

When \n (where 1 <= n <= 9) appears in a concatenated regular expression, it stands for the regular expression x, where x is the nth regular expression enclosed in $ and $ sequences that appeared earlier in the concatenated regular expression. For example, in the pattern $c$onc$ate$n\2, the \2 is equivalent to ate, giving concatenate.

The character ^ at the beginning of an expression permits a successful match only immediately after a newline or at the beginning of each of the string to which a match is to be applied. The character $ at the end of an expression requires a trailing newline.

Notes:

The compile() function is physically embedded in the regexp.h header. This header will be protected from multiple invocations just like other c headers.
The compile(), step(), and advance() functions are provided for historical reasons. These functions were part of the Legacy Feature in Single UNIX Specification, Version 2. They have been withdrawn and are not supported as part of Single UNIX Specification, Version 3. New applications should use the newer functions fnmatch(), glob(), regcomp() and regexec(), which provide full internationalized regular expression functionality compatible with IEEE Std 1003.1-2001.

Returned value

If successful, compile() exits using the user-provided macro RETURN(ptr). The value of the argument ptr is a pointer to the character after the last character of the compiled regular expression.

If unsuccessful, compile() exits using the user-provided macro ERROR(val). The argument val is an error number identifying the error. The following error numbers are defined:

Errcode: Description String
11: Range endpoint too large
16: Bad number
25: \digit out of range
36: Illegal or missing delimiter
41: No remembered search string
42:  imbalance
43: Too many \(
44: More than two numbers given in \{ \}
45: } expected after \
46: First number exceeds second in \{ \}
49: [ ] imbalance
50: Regular expression overflow