Using+regular+expressions

//by Richard Russell, December 2006//

Regular expressions provide a means to specify a pattern of characters, or syntax rule, which a string (or part of a string) must match. Certain //metacharacters// have special significance; for example a dot (.) matches any single character, square brackets [] enclose a list of matching characters, a caret (^) signifies negation and so on. Here are some simple examples:


 * a..d||matches "abcd", "axyd", "a12d" etc.||
 * [abc]||matches "a", "b" or "c"||
 * [a-z]||matches any lowercase letter||
 * [^b]at||matches "cat", "fat", "hat" etc. but not "bat"||

For more information on the syntax of regular expressions see this [|Wikipedia article].

You can make use of regular expressions in your BBC BASIC program by means of the **gnu_regex** DLL which can be downloaded from [|here][1]. To start with you must load the DLL in the usual way:

code format="bb4w" SYS "LoadLibrary", "gnu_regex.dll" TO gnu_regex% IF gnu_regex% = 0 ERROR 100, "Cannot load gnu_regex.dll" SYS "GetProcAddress", gnu_regex%, "regcomp" TO regcomp% SYS "GetProcAddress", gnu_regex%, "regexec" TO regexec% code For this to work **gnu_regex.dll** needs to be in the current directory, the Windows directory (often C:\WINDOWS), the Windows system directory (often C:\WINDOWS\SYSTEM32) or one of the directories specified in the PATH environment variable. Alternatively you can copy the file to your BBC BASIC for Windows library folder and load it explicitly from there:

code format="bb4w" SYS "LoadLibrary", @lib$+"gnu_regex.dll" TO gnu_regex% code The code below illustrates a very simple example of setting up a pattern and inputting strings from the user which are tested against this pattern:

code format="bb4w" DIM buffer% 255 pattern$ = "[abcxyz]" SYS regcomp%, buffer%, pattern$, 0 TO result% IF result% ERROR 101, "Failed to compile regular expression" REPEAT INPUT "Enter a string: " test$ SYS regexec%, buffer%, test$, 0, 0, 0 TO result% IF result% PRINT "Not matched" ELSE PRINT "Matched" UNTIL FALSE code You should ensure that **buffer%** points to a memory buffer large enough to contain the //compiled// regular expression (although it's not clear how you are supposed to ascertain this!). As always, make sure you execute the **DIM** statement only once, or use **DIM LOCAL**, to avoid a memory leak and an eventual **No room** error.

In this example the pattern matches the characters "a", "b", "c", "x", "y" or "z" anywhere in the string. The program as listed provides no information on //where// in the string the match occurred. You can discover that information by amending the program as follows:

code format="bb4w" DIM offsets{start%, finish%} REPEAT INPUT "Enter a string: " test$ SYS regexec%, buffer%, test$, 1, offsets{}, 0 TO result% IF result% PRINT "Not matched" ELSE PRINT "Matched at ";offsets.start% UNTIL FALSE code Here **offsets.start%** is set to the offset from the beginning of the string of the first match.

You can specify that the matching is //case insensitive// by changing the final parameter of **regcomp** from 0 to 2 as follows:

code format="bb4w" _REG_ICASE = 2 SYS regcomp%, buffer%, pattern$, _REG_ICASE TO result% code You can also specify the use of **extended regular expressions** by setting the final parameter to 1:

code format="bb4w" _REG_EXTENDED = 1 SYS regcomp%, buffer%, pattern$, _REG_EXTENDED TO result% code In this mode additional //metacharacters// are recognised, for example the vertical bar (|) signifies alternatives:


 * abc|def||matches "abc" or "def"||

[1] When last checked, the file **gnu_regex.exe** was corrupted (missing the last byte). To repair it you can use this simple BBC BASIC program:

code format="bb4w" F% = OPENUP("gnu_regex.exe") PTR#F% = EXT#F% BPUT #F%,0 CLOSE #F% code