Translators

  • Processes one language into another (often lower) language
    • Compiler (high level into low level with object code)
    • Interpreter (high level into low level without object code)
    • Assembler (assembly language into machine code)

Object Code

  • Object code is a term that is relative
    • Often binary machine code, or sometimes assembly code depending on the programmer’s needs
    • Compilers produce object code for a specific operating system. e.g. a C program compiled for windows would need to be recompiled for Linux
      • Programmers make reference to OS system calls which forces OS compatibility

Bytecode

  • Java produces bytecode, which is a generalised language which is then interpreted to machine code by the JBM for a specific OS
    • The JVM acts as a virtual processor, interpreting the compiled and optimised bytecode

Binary object code

  • Binary object code is a portion of machine code that has not yet been linked into a complete program or executable
    • The linker is a program that takes objects (such as object code, libraries, etc.) and links them together along with mapping memory addresses
    • Static linking is done at compilation time, dynamic is done by the OS linker and saves re-compiling if shared libraries change. Windows’ model is DLL.

Compilation process

  • 4 Main stages:
    • Lexical analysis (code split into tokens)

      • Is the extraction (scan) of individual words or lexemes from an input stream (source code) and passing corresponding tokens back to the parse
      • The source code is scanned and split into tokens. A token is a word or symbol in source code e.g.
        • Identifiers: count, x, sum, etc
        • Operators: +, /, =, :=, etc.
      • Spaces, line breaks and comments are removed
      • Unrecognised symbols (not belonging to the language’s vocabulary) will halt the compilation process
      • TLDR: Figure out what is present in the source code by using the keyword table and generating a symbol table to generate a tokenised version of the source code
        • Strip out all the non-essential parts
    • Syntax analysis (tokens used to build abstract syntax tree) (Syntax Analysis)

      • Will look at the tokens and check they are what is expected against the rules (syntax) of the language (stored as syntax trees)
    • Code generation (Syntax tree converted to low-level code)

      • Transforms the annotated syntax tree into Object Code, either by three-address code or code for a virtual machine (Bytecode)
        • Note: Object code can be intermediate code or machine code / assembly code. However, no true optimisation has been carried out
      • A basic level of optimisation can occur at this stage
      • TLDR: Code generation converts the tokens into machine code or intermediate code, with ability to link to shared libraries, etc.
    • Optimisation (Code made more efficient (memory / speed))

      • Analyses the generated code to see if memory, CPU, IO, or other resources can be optimised. This is a challenging task and extremely rare in interpreters

Symbol Table

  • Identifiers are also stored in the symbol table and used in various stages of translation
    • This is a table that stores information on identifiers (names of variables, constants, functions & subroutines, objects, etc.)
  • The symbol tables maintains an entry in the following format:
    • <symbol name, token value (ID), type, attribute> e.g. <Count, A2, integer, variable>
  • The symbol table exists for many reasons, including:
    • Structured storage of all identifiers (entities) in one place
    • To verify if a variable has been declared
    • To implement type checking (semantic checking of data type, etc.)
    • To determine scope of an identifier
Tokens, Keywords & Symbols
  • A token is any valid word / symbol within the source code
    • Each unique token is given a hexadecimal token ID (value)
  • A symbol

Interpreters VS Compilers

Compiler CharacteristicsInterpreter Characteristics
Works on the complete program at once. Takes the entire program as input.Works line-by-line. Takes one statement at a time as input.
Generates object code or machine code.Does not generate object code or machine code.
Executes conditional control statements (like if-else and switch-case) faster.Executes conditional control statements at a much slower rate.
Compiled programs take more memory because the entire object code has to reside in memory.More memory efficient as it does not generate intermediate object code.
Compile once and run anytime. Does not need to be compiled every time.Interpreted line-by-line every time they are run.
Errors are reported after the entire program is checked for syntactical and other errors.Error is reported as soon as the first error is encountered. Rest of the program will not be checked until the existing error is removed.
Does not allow a program to run until it is completely error-free.Runs the program from the first line and stops execution only if it encounters an error.