Translators
- Processes one language into another (often lower) language
- Compiler (high level into low level with object code)
- Interpreter (high level into low level without object code)
- Assembler (assembly language into machine code)
Object Code
- Object code is a term that is relative
- Often binary machine code, or sometimes assembly code depending on the programmer’s needs
- Compilers produce object code for a specific operating system. e.g. a C program compiled for windows would need to be recompiled for Linux
- Programmers make reference to OS system calls which forces OS compatibility
Bytecode
- Java produces bytecode, which is a generalised language which is then interpreted to machine code by the JBM for a specific OS
- The JVM acts as a virtual processor, interpreting the compiled and optimised bytecode
Binary object code
- Binary object code is a portion of machine code that has not yet been linked into a complete program or executable
- The linker is a program that takes objects (such as object code, libraries, etc.) and links them together along with mapping memory addresses
- Static linking is done at compilation time, dynamic is done by the OS linker and saves re-compiling if shared libraries change. Windows’ model is DLL.
Compilation process
- 4 Main stages:
-
Lexical analysis (code split into tokens)
- Is the extraction (scan) of individual words or lexemes from an input stream (source code) and passing corresponding tokens back to the parse
- The source code is scanned and split into tokens. A token is a word or symbol in source code e.g.
- Identifiers: count, x, sum, etc
- Operators: +, /, =, :=, etc.
- Spaces, line breaks and comments are removed
- Unrecognised symbols (not belonging to the language’s vocabulary) will halt the compilation process
- TLDR: Figure out what is present in the source code by using the keyword table and generating a symbol table to generate a tokenised version of the source code
- Strip out all the non-essential parts
-
Syntax analysis (tokens used to build abstract syntax tree) (Syntax Analysis)
- Will look at the tokens and check they are what is expected against the rules (syntax) of the language (stored as syntax trees)
-
Code generation (Syntax tree converted to low-level code)
- Transforms the annotated syntax tree into Object Code, either by three-address code or code for a virtual machine (Bytecode)
- Note: Object code can be intermediate code or machine code / assembly code. However, no true optimisation has been carried out
- A basic level of optimisation can occur at this stage
- TLDR: Code generation converts the tokens into machine code or intermediate code, with ability to link to shared libraries, etc.
- Transforms the annotated syntax tree into Object Code, either by three-address code or code for a virtual machine (Bytecode)
-
Optimisation (Code made more efficient (memory / speed))
- Analyses the generated code to see if memory, CPU, IO, or other resources can be optimised. This is a challenging task and extremely rare in interpreters
-
Symbol Table
- Identifiers are also stored in the symbol table and used in various stages of translation
- This is a table that stores information on identifiers (names of variables, constants, functions & subroutines, objects, etc.)
- The symbol tables maintains an entry in the following format:
- <symbol name, token value (ID), type, attribute> e.g.
<Count, A2, integer, variable>
- <symbol name, token value (ID), type, attribute> e.g.
- The symbol table exists for many reasons, including:
- Structured storage of all identifiers (entities) in one place
- To verify if a variable has been declared
- To implement type checking (semantic checking of data type, etc.)
- To determine scope of an identifier
Tokens, Keywords & Symbols
- A token is any valid word / symbol within the source code
- Each unique token is given a hexadecimal token ID (value)
- A symbol
Interpreters VS Compilers
Compiler Characteristics | Interpreter Characteristics |
---|---|
Works on the complete program at once. Takes the entire program as input. | Works line-by-line. Takes one statement at a time as input. |
Generates object code or machine code. | Does not generate object code or machine code. |
Executes conditional control statements (like if-else and switch-case) faster. | Executes conditional control statements at a much slower rate. |
Compiled programs take more memory because the entire object code has to reside in memory. | More memory efficient as it does not generate intermediate object code. |
Compile once and run anytime. Does not need to be compiled every time. | Interpreted line-by-line every time they are run. |
Errors are reported after the entire program is checked for syntactical and other errors. | Error is reported as soon as the first error is encountered. Rest of the program will not be checked until the existing error is removed. |
Does not allow a program to run until it is completely error-free. | Runs the program from the first line and stops execution only if it encounters an error. |