2.3. Lexer

2.3.1. Functions implemented in this module

2.3.1.1. lex_SymbolicName()
2.3.1.2. lex_HandleContinuationLines()
2.3.1.3. lex_RemoveSkipSymbols()
2.3.1.4. lex_RemoveComments()
2.3.1.5. lex_NextLexeme()
2.3.1.6. lex_SavePosition()
2.3.1.7. lex_RestorePosition()
2.3.1.8. lex_StartIteration()
2.3.1.9. lex_EOF()
2.3.1.10. lex_Type()
2.3.1.11. lex_Double()
2.3.1.12. lex_String()
2.3.1.13. lex_StrLen()
2.3.1.14. lex_Long()
2.3.1.15. lex_LineNumber()
2.3.1.16. lex_FileName()
2.3.1.17. lex_XXX()
2.3.1.18. lex_Finish()
2.3.1.19. lex_DumpLexemes()
2.3.1.20. lex_ReadInput()
2.3.1.21. lex_InitStructure()

The module lexer is implemented in the C source file `lexer.c' This is a module that converts the read characters to a list of tokens. The lexer recognizes the basic lexical elements, like numbers, strings or keywords. It starts to read the characters provided by the reader and group it p into lexical elements. For example whenever the lexical analyzer sees a " character it starts to process a string until it finds the closing ". When it does the module creates a new token, links it to the end of the list and goes on.

To do this the lexical analyzer has to know what is a keyword, string or number.

Because general purpose, table driven lexical analyzers are usually rather slow ScriptBasic uses a proprietary lexical analyzer that is partially table driven, but not so general purpose as one created using the program LEX.

There are some rules that are coded into the C code of the lexical analyzer, while other are defined in tables. Even the rules coded into the C program are usually parameterized in the module object.

Lets see the module object definition from the file `lexer.c' (Note that the C .h header files are extracted from the .c files thus there is no need to double maintain function prototypes.)

Note however that this is actually a copy of the actual definition from the file `lexer.c' and it may have been changed since I wrote this manual. So the lexer object by the time I wrote this manual was:

typedef struct _LexObject {
  int (*pfGetCharacter)(void *);
  char * (*pfFileName)(void *);
  long (*pfLineNumber)(void *);
  void *pvInput;
  void *(*memory_allocating_function)(size_t, void *);
  void (*memory_releasing_function)(void *, void *);
  void *pMemorySegment;

  char *SSC;
  char *SCC;
  char *SFC;
  char *SStC;
  char *SKIP;

  char *ESCS;
  long fFlag;

  pReportFunction report;
  void *reportptr;
  int iErrorCounter;
  unsigned long fErrorFlags;

  char *buffer;
  long cbBuffer;

  pLexNASymbol pNASymbols;
  int cbNASymbolLength;

  pLexNASymbol pASymbols;

  pLexNASymbol pCSymbols; 
  pLexeme pLexResult;
  pLexeme pLexCurrentLexeme;
  struct _PreprocObject *pPREP;
  }LexObject, *pLexObject;

This struct contains the global variables of the lexer module. In the first "section" of the structure you can see the variables that may already sound familiar from the module reader. These parameterize the memory allocation and the input source for the module. The input functions are usually set so that the characters come from the module reader, but there is no principal objection to use other character source for the purpose.

The variable pvInput is not altered by the module. It is only passed to the input functions. The function pointer name pfGetCharacter speaks for itself. It is like getc returns the next character. However when this function pointer is set to point to the function reader_NextCharacter the input is already preprocessed a bit. Namely the include and import directives were processed.

This imposes some interesting feature that you may recognize now if you read the reader module and this module definition carefully. include and import works inside multi-line strings. (OK I did not talk about multi-line strings so far so do not feel ashamed if you did not realize this.)

The function pointers pfFileName and pfLineNumber should point to functions that return the file name and the line number of the last read character. This is something that a getc will not provide, but the reader functions do. This will allow the lexical analyzer to store the file name and the line number for each token.

The next group of variables seems to be frightening and unreadable at first, but here is this book to explain them. These variables define what is a string, a symbol, what has to be treated as unimportant space and so on. Usually symbols start with alpha character and are continued with alphanumeric characters in most programming languages. But what is an alpha character? Is _ one or is $ a valid alphanumeric character. Well, for the lexer module if any of these characters appear in the variable SSC then the answer is yes. The name stands for Symbol Start Characters. But lets go through all these variables one by one.

char *SSC;
This Symbol Start Character variable contains all the characters that may be used to start a symbol. This symbol can be a variable or a symbol that appears for itself in the code like in the command SET FILE. (See the users guide.)
```
QWERTZUIOPASDFGHJKLYXCVBNMqwertzuiopasdfghjklyxcvbnm_:$
```
char *SCC;
This Symbol Continuation Character variable contains all the characters that may be used inside a symbol after the opening first character. The default value for this variable is
```
QWERTZUIOPASDFGHJKLYXCVBNMqwertzuiopasdfghjklyxcvbnm_1234567890:$
```
char *SFC;
This Symbol Finishing Character variable contains all the characters that may be used as the last character inside a symbol. The default value for this variable is
```
QWERTZUIOPASDFGHJKLYXCVBNMqwertzuiopasdfghjklyxcvbnm_1234567890$
```
which works fine for ScriptBasic. Note that this prohibits ScriptBasic variables to finish with colon.
char *SStC;
This Start String Character variable contains the characters that may start a string. The ScriptBasic value contains only the " character thus ScriptBasic strings can only start and end with the " character. However some other languages may use different string starting and finishing characters.
If there are more than one characters in this string then a string opened using a character should be closed using the same character. This is hard coded into the C program of the lexer.
The lexer also recognizes single-line strings and multi-line strings. A single-line string starts with a single " (or whatever characters are allowed in the SStC field) and finish with a single ". There can not be new-line character in a single-line string and any " character in the string should be quoted using the \ character. The \ character is not hard-coded it is configured in the field ESCS, as you will see later.
A multi-line string starts with """ characters that is three " characters and finishes the same way. Multi-line string may span several lines. This notation of multi-line string was inherited from the language Python. (At least I did not see it anywhere else.)
char *SKIP;
This Skip variable contains all characters that are to be skipped. This is the space, tab and the carriage-return character in case of ScriptBasic.
Skipping these characters does not mean that these characters are not taken into account. They serve a very important role: they stop tokens, thus no space can appear inside the name of a variable for example. However there is no token generated from these characters.
Note that the carriage-return character included in this string allows ScriptBasic to compile any DOS edited and binary transferred files under UNIX. However the operating system may have problem with the terminating carriage-return on the very first line.
char *ESCS;
This Escape String variable list all those characters that can be escaped in a string. The line that initializes this variable in lex_InitStructure:
```
pLex->ESCS = "\\n\nt\tr\r\"\"\'\'";
```
The first character of the ESCS string is the character used to escape other characters. This is the \ character for ScriptBasic. The latter characters list the original character on the odd positions and the replacement characters on the following even position. For example the second character of this string is n and the replacement character is a new-line character, thus \n will be new-line in any sinle- or multi-line string in a BASIC program.
long fFlag;
This variable is a bit field that controls how numbers are treated in strings. The lines that initialize this variable are
```
  pLex->fFlag = LEX_PROCESS_STRING_NUMBER       |
                LEX_PROCESS_STRING_OCTAL_NUMBER |
                LEX_PROCESS_STRING_HEX_NUMBER   |
                0;
```
The constants defined also in `lexer.c' tell the lexical analyzer that an escape character in a string followed by numeric characters should be converted to characters of the code. This the string "a\10a" will contains two a character separated by a new line. When the first character following the escape character is 0 the numbers are treated as octal numbers. If this character is x (lower case only and not X) the number is treated as hexadecimal. The escaped number is as long as there are numbers following each other without space. If the number is hexadecimal the letters a-f and A-F are also treated as digits.

The default values for these variables are set in the function lex_InitStructure. Interestingly these default values are perfectly ok for ScriptBasic.

The field pNASymbols points to an array that contains the non-alpha symbols list. Each element of this array contains a string that is the textual representation of the symbol and a code, which is the token code of the symbol. For example the table NASYMBOLS in file `syntax.c' is:


LexNASymbol NASYMBOLS[] = {
{ "@\\" , CMD_EXTOPQN } ,
{ "@`" , CMD_EXTOPQO } ,
{ "@'" , CMD_EXTOPQP } ,
{ "@" , CMD_EXTOPQQ } ,

...

{ "@" , CMD_EXTOPQ } ,
{ "^" , CMD_POWER } ,
{ "*" , CMD_MULT } ,
{ NULL, 0 }
  };

When the lexical analyzer finds something that is not a string, number or alphanumeric symbol it tries to read forward and recognize any of the non-alpha tokens listed in this table. It is extremely important that the symbols are ordered in this table so that the longer symbols come first thus a symbol abc is not presented before abcd. Otherwise abcd will never be found!

The variable cbNASymbolLength is nothing to care about. This is used internally and is calculated automatically by the lexical analyzer.

The variable pASymbols is similar to the variable pNASymbols pointing to a same kind of table. This variable however should point to an array that contains the alphanumeric symbols. You can find the array ASYMBOLS in file `syntax.c' that is pointed by this variable for ScriptBasic.

The order of the words in this array is not important except that more frequent words being listed earlier result faster compilation.

The field pCSymbols points to an array that is used only for debugging purposes. I mean debugging ScriptBasic code itself and not debugging BASIC programs.

The rest of the variables are used by the functions that iterate through the list of tokens when the syntax analyzer reads the token list or to report errors during lexical analysis. Error reporting is detailed in a separate section.

The tables that list the lexical elements are not maintained "by hand". The source for ScriptBasic syntax is maintained in the file `syntax.def' and the program `syntaxer.pl' creates the C syntax file `syntax.c' from the syntax definition.

The program `syntaxer.pl' is so complex that after two years I wrote it I had hard time to understand it and I rather treat it as a holly code: blessed and untouchable. (Ok: see: that code is quite compound, but if there was any bug found in that I could understand what I did in a few hours. Anyway, the brain created that code once belonged to me.)

[<<<] [>>>]