2. Interpreter Architecture

This chapter tells you the architecture of the interpreter. It is not a must to read this chapter, and you may find that some topic is irrelevant or not needed to learn to embed or to extend ScriptBasic. However understanding the internal working order of ScriptBasic should help you understand why some of the extending or embedding interfaces work the way they actually do. So I recommend that you read on and not skip this chapter.

To read this chapter and to understand the internal working of the interpreter is vital when you decide to write an internal preprocessor. Internal preprocessors interact with not only the execution system but also the reader, lexer, syntaxer and builder modules thus internal preprocessor writers have to understand how these modules work.

ScriptBasic is not a real scripting language. This is a mix of a scripting language and compiled languages. The language is scripting in the sense that it is very easy to write small programs, there is no need for long variable and function declarations. On the other hand the language is compiled into an internal code that it executed afterwards. This is the technique for example the implementations of the language Java.

ScriptBasic as a language is a BASIC dialect that implements most BASIC features that BASIC implementations usually do. However ScriptBasic variables are not typed, and dynamic storage, like arrays are automatically allocated and released. A ScriptBasic variable can store a string, an integer, a real value or an array. Naturally a variable can not store more than one of any of these types of values. But you need not declare a variable to be INTEGER or REAL, or STRING. A variable may store a string at a time and the next assignment command may release the original value and store a different value in the variable.

When a program is executed it goes through several steps. The individual steps are implemented in different modules each being coded in a separate C language source file. The modules are

EXTERNAL PREPROCESSOR
This module executes external preprocessors. These preprocessors are standalone executable programs that read the source program and create another file that is read and processed by the ScriptBasic interpreter. If an external preprocessor is used the source file is usually not BASIC but rather some other language, usually a BASIC like language, which is extended some way and the preprocessor creates the pure ScriptBasic conformant BASIC program. The sample preprocessor supplied with ScriptBasic is the HEB preprocessor that reads HTML embedded BASIC code and creates BASIC program. This HEB source file is a kind of HTML with embedded program fragments, which you may be familiar with in case you program PHP or Microsoft BASIC ASP pages. The HEB preprocessor itself is written in BASIC and is executed by ScriptBasic. Thus when a HEB "language" is executed by ScriptBasic it starts a separate instance of the interpreter and executes the HEB preprocessor on the source file. Of course the HEB preprocessor could be implemented in any language that can be compiled or some way executed on the target machine. Actually the very first version of the HEB preprocessor was written in Perl so when it was first tested the ScriptBasic interpreter started a Perl interpreter before reading the generated BASIC code.
READER
This module reads the source file into the computer memory. Usually source programs are not too big compared to computer memory and thus can be read into the operational memory (RAM). ScriptBasic source code is approximately 1MB and I develop it on a station that has 386MB memory.
The source code is stored in memory pieces that form a linked list. Each element of the list contains one line of the source memory and the information of the line for debugging and error reporting purposes. This information includes the file name that the line was read and the line number. Later when the lexer (read further) performs lexical analysis it will inherit this information and when there is a lexical or syntactical error the line number is reported correct.
The reader module also handles the include and import directives that are used to include files into the source file. (Note that import inserts the content of the file only if it was not loaded yet.)
The module also processes the lines that look
```
use sample_preprocessor
```
and loads the internal preprocessor named on the line. Preprocessors
When the module is ready the latter modules have the full source file in memory ready to be processed. The module also provides getc and ungetc like functions to get the read characters one by one.
LEXER
The lexer module uses the line stream (or the character stream if we view it from a different point of view) provided by the reader. It reads the characters and builds up a linked list. Each element of the list contains a token, like BASIC keyword, a real or integer number, symbol, string, multi-line string, or character. The list of tokens is stored in a form of linked list in the order the tokens appear in the input. Each element also contains extra information about the token that identifies the name of the file and the line number inside the file where the token originally was.
When the lexer is finished the list of lines is not really needed any more and the reader is ready to release the memory occupied by the source lines read into memory.
The lexer also provides functions that are used by the syntax analyzer to read the tokens in sequence one after the other as needed by the syntax analysis.
SYNTAXER
The syntaxer reads the list of tokens provided by the lexical analysis module and creates an internal structure that is already very similar to the internal code of ScriptBasic, which is executed. The syntax analyzer finds any programming error that is not syntactically correct and when it is ready the result is a huge, cross-linked memory structure that contains the almost-executable code.
The syntax analyzer is responsible building up the evaluation trees of the expressions, the execution nodes, variable numbering and so on.
When the code refers to a variable named for example variable the syntax analyzer is responsible to allocate a slot for the variable and to convert the name to a serial number that identifies the variable whenever it is used. Beyond the syntax analyzer there are no named variables anymore (except in case of debuggers). There are global variables listed from 1 to n and local variables also listed by numbers. There are also no names for the functions. Each function is identified by a C pointer to the node where the execution of the function starts. (It is not really true. This node is not used for execution. Rather the builder will convert this node to another node that is used to execute. But for the builder read later.)
To ease the life of those who want to embed ScriptBasic the symbol table that list the global variables and the functions and subroutines is appended to the byte-code and there are functions in the scriba_* interface that handles these symbol tables. However ScriptBasic itself does not use variable or functions/subroutine names whenever it executes.
BUILDER
The builder is the module that creates the code, which is used by the execution system. Why do we have a separate builder? Isn't it the role of the syntax analyzer to build the code?
Yes, and no. The code that was created by the syntax analyzer could be used to execute the BASIC program, but ScriptBasic still inserts an extra transformation before executing the program. The reason for this extra step is to create a byte code that can be stored in a continuous memory area and thus can easily be saved to or loaded from disk.
When the syntax analyzer creates the nodes it does not know the actual number of nodes of the byte-code, nor the number of different strings, or size of the string table. While the code is created the syntax analyzer allocates memory for each new block it creates one by one. The nodes are linked together using C pointers. This means that the final memory structure is neither continuous in memory nor can be saved or loaded back to disk.
When the builder starts the number of the nodes is known. The builder allocates the memory needed for the whole code and fills in the actual code. The node size is a bit smaller than that of the syntax analyzer and they refer to each other using node serial numbers instead of pointers. This is almost as efficient as using pointers and the actual value does not depend on the location of the node in memory and this way the code can be saved to disk and loaded again for execution.
EXECUTOR
The executor kills the code. Oh no! I am just kidding. It actually executes the code. It gets the code that was generated by the module builder and executes the nodes one by one and finally exits.

The following sections detail these modules and also some other modules that help these modules to perform their actual tasks.

[<<<] [>>>]