Tokeniser

//by JGH, June 2006//

BBC BASIC programs are tokenised, that is, BASIC keywords are stored as one-byte values. This results in programs which execute faster and are more compact.

A tokenised line can easily be detokenised, or expanded, as there is a one-to-one mapping between token values and the expanded string. For example, code similar to the following would expand a tokenised line: code format="bb4w" quote%=FALSE REPEAT IF (?addr%<128 AND ?addr%>31) OR quote% THEN VDU ?addr% ELSE P.token$(?addr%); IF ?addr%=34 quote%=NOT quote% addr%=addr%+1 UNTIL ?addr%=13 code Tokenising, however, is more fiddly. Tokens can be abbreviated on entry and characters are only tokenised at certain parts of the line. For instance, in the following line: code format="bb4w" ON NOON GOTO 1,2 code the first 'ON' is the token ON, but the second 'ON' is part of the variable 'NOON'. The second 'ON' must be left untokenised.

The **EVAL** function tokenises the supplied string and evaluates it as an expression. Usefully, the tokenised string can be retrieved from where BASIC has stored it.

In Windows BASIC:
code format="bb4w" B%=EVAL("0:"+A$) token$=$(!332+2) code This code may fail if an event interrupt (e.g. ON TIME) occurs between the two statements. To avoid this use the following alternative which (in //BBC BASIC for Windows// version 6 only) does not allow an intervening interrupt: code format="bb4w" IF EVAL("1:"+A$) token$=$(!332+2) code The input and output share the same memory buffer, which is OK so long as the tokenising process shortens the code (which is almost always the case) but can cause a crash if it lengthens the code. That can happen only in exceptional circumstances such as the following code: code format="bb4w" ON A% GOTO 10,20,30,40,50 code The tokenising process encodes the line numbers in a special internal format which results in the overall length increasing from 25 to 31 bytes. To reduce the chance of this causing a crash the tokenising routine can be adapted as follows: code format="bb4w" IF EVAL("1RECTANGLE:"+A$) token$=$(!332+3) code

In ARM BASIC:
code format="bb4w" SYS "XOS_GenerateError",0,STRING$(255,"*") TO ,A% B%=EVAL("0:"+A$) token$=$(A%-14) code

In 6502 BASIC:
code format="bb4w" A%=EVAL("0:"+A$) token$=$((!4 AND &FFFF)-LENA$-1) code

By preceding the code you want to tokenise with "0:" you can safely pass it to **EVAL** without provoking a Syntax error. You can then extract the tokenised code from memory, so long as you do it immediately after calling **EVAL**.

This can be written as functions as follows: code format="bb4w" DEF FNTokenise_Win(A$):LOCAL A%,B% WHILELEFT$(A$,1)=" ":A$=MID$(A$,2):ENDWHILE B%=EVAL("0:"+A$):=$(!332+2) :     DEF FNTokenise_ARM(A$):LOCAL A%,B% SYS "XOS_GenerateError",0,STRING$(255,"*") TO ,A% B%=EVAL("0:"+A$):=$(A%-13) :     DEF FNTokenise_65(A$):LOCAL A%      A%=EVAL("0:"+A$):=$((!4 AND &FFFF)-LENA$-1) code

These functions are used in full in the 'Tokenise' BASIC library at [|mdfs.net].

A text file can then be tokenised using the following code: code format="bb4w" in%=OPENIN(text$) out%=OPENOUT(basic$) line%=1                                 :REM Start from an arbitary line number REPEAT line$=FNTokenise_Win(GET$#in%)        :REM Read line and tokenise it      BPUT#out%,LENline$+4                   :REM Output line length BPUT#out%,line%:BPUT#out%,line%DIV256 :REM Output line number BPUT#out%,line$;:BPUT#out%,13         :REM Output line and  line%+=1                              :REM Increment line number UNTIL EOF#in% BPUT#out%,0:BPUT#out%,&FF:BPUT#out%,&FF :REM Output program terminator CLOSE#out%:out%=0 CLOSE#in%:in%=0 code