A compiler transforms source code written in a high-level programming language (like C) into executable machine code that can be directly understood by the computer’s processor. This transformation is complex and involves multiple stages, each with its own theoretical foundations and practical implementations.
Before diving into the specific components of our compiler project, let’s establish some foundational concepts.
A typical compiler operates in several phases:
Now, let’s explore each component of our compiler in detail.
Lexical analysis is based on the theory of regular languages and finite automata. Regular expressions define patterns for tokens, and these patterns are converted into finite state machines that can efficiently recognize tokens in the input stream.
The lexical analyzer (or scanner) reads the source code character by character and groups them into meaningful units called tokens. Each token has:
Our project uses Flex (a modern version of the classic Lex tool) to generate the lexical analyzer. Here’s an in-depth look at how we define tokens:
%option noyywrap
%{
#include "LexUtils.h"
#include<iostream>
#include<string.h>
#include<stdio.h>
using namespace std;
FILE *input;
%}
NEWLINE \r?\n
NOT_NEWLINE ([^\r\n])
WHITESPACE [ \t\f\r\v]+
LETTER [a-zA-Z]
DIGIT [0-9]
ALPHABET {LETTER}|_
ALPHANUM {ALPHABET}|{DIGIT}
ID {ALPHABET}({ALPHANUM})*
INTEGER {DIGITS}
FLOAT ({INTEGER})?(\.?{DIGITS}*)([Ee][+-]?{INTEGER})?
%%
{COMMENT} |
{STAR_COMMENT} {handle_comment(yytext);}
"if" {return IF;}
"else" {return ELSE;}
"for" {return FOR;}
// More keywords...
{INTEGER} {handle_const_int(yytext); return CONST_INT;}
{FLOAT} {handle_const_float(yytext); return CONST_FLOAT;}
{ID} {handle_id(yytext); return ID;}
{ADDOP} {handle_operator(yytext, "ADDOP"); return ADDOP;}
// More operators...
{NEWLINE} {line_count++;}
{WHITESPACE} {}
. {handle_error(yytext, "Unrecognized character");}
%%
Let’s break down what’s happening here:
%{...%}
section contains C/C++ code that will be included in the generated scanner.LETTER [a-zA-Z]
create named patterns.%%
), we define what action to take when a pattern is matched.For example, when the scanner encounters “if”, it returns the token IF to the parser. When it finds an identifier that matches the pattern {ID}
, it calls handle_id(yytext)
and returns ID.
The functions like handle_const_int()
and handle_id()
create SymbolInfo objects for the tokens and set their attributes. These objects are then passed to the parser.
When the scanner recognizes a token, it creates a SymbolInfo object with the token’s attributes:
void handle_const_int(char *str) {
string s(str);
assignSymbol("CONST_INT");
}
void assignSymbol(string type) {
yylval.symbol = new SymbolInfo((string)yytext, type);
}
This creates a new SymbolInfo object with the token’s text and type, and assigns it to yylval.symbol
, which the parser will access.
Parsing is based on the theory of context-free grammars (CFGs) and pushdown automata. A CFG consists of:
The parser uses these rules to determine if the input sequence of tokens forms a valid program according to the grammar, and if so, to build a parse tree or abstract syntax tree (AST).
There are two main parsing approaches:
Our project uses Bison (a modern version of the classic Yacc tool) to generate the parser. Bison generates LALR(1) parsers, which are a type of bottom-up parser.
Here’s a detailed look at our grammar definition:
%{
#ifndef PARSER
#define PARSER
#include "ParserUtils.h"
#endif
int yyparse(void);
int yylex(void);
extern FILE *yyin;
extern SymbolTable st;
%}
%union{
SymbolInfo* symbol;
}
%token IF FOR DO INT FLOAT VOID SWITCH DEFAULT ELSE WHILE BREAK CHAR DOUBLE RETURN CASE CONTINUE
%token INCOP DECOP NOT
%token LPAREN RPAREN LCURL RCURL LTHIRD RTHIRD COMMA SEMICOLON
%token PRINTLN
%token STRING
%token <symbol> ID
%token <symbol> CONST_INT
%token <symbol> CONST_FLOAT
%token <symbol> CONST_CHAR
%token <symbol> ADDOP
%token <symbol> MULOP
%token <symbol> LOGICOP
%token <symbol> RELOP
%token <symbol> BITOP
%token <symbol> ASSIGNOP
%type <symbol> program
%type <symbol> unit
// More non-terminal types...
%nonassoc second_prec
%nonassoc ELSE
%%
start : program
{
if (!error_count){
cout<<"No error, generating assembly code"<<endl;
addDataSegment();
startCodeSegment();
printCode($$);
addPrintFunc();
endCodeSegment();
optimize();
}
}
;
program : program unit
{
$$ = new SymbolInfo($1->getName() + "\n" + $2->getName(), "NON_TERMINAL");
$$->setCode($1->getCode() + "\n" +$2->getCode());
printRule("program : program unit");
printSymbol($$);
}
| unit
{
$$ = $1;
$$->setCode($1->getCode());
printRule("program : unit");
printSymbol($$);
st.printAll();
}
;
// More grammar rules...
%%
Let’s examine what’s happening here:
%{...%}
section contains C/C++ code that will be included in the generated parser.%union
declaration specifies the type of semantic values (in our case, SymbolInfo pointers).%token
and %type
declarations specify the types of tokens and non-terminals.%%
section contains the grammar rules, with semantic actions enclosed in braces.Let’s look at a more complex rule, the if-statement:
statement : IF LPAREN expression RPAREN statement %prec second_prec
{
$$ = handle_if($3, $5);
printRule("IF LPAREN expression RPAREN statement");
printSymbol($$);
}
| IF LPAREN expression RPAREN statement ELSE statement
{
$$ = handle_if_else($3, $5, $7);
printRule("IF LPAREN expression RPAREN statement ELSE statement");
printSymbol($$);
}
This rule says that a statement can be an if-statement without an else clause (IF LPAREN expression RPAREN statement
) or an if-else statement (IF LPAREN expression RPAREN statement ELSE statement
).
The %prec second_prec
part resolves the classic “dangling else” ambiguity by specifying that this rule has lower precedence than the ELSE token.
The semantic actions (code in braces) create a new SymbolInfo object for the statement, generate code for it, and log the rule application for debugging.
As the parser recognizes grammatical constructs, it builds a parse tree (implicitly, through the semantic values). Each node in the tree is a SymbolInfo object that contains:
A symbol table is a data structure used by compilers to keep track of identifiers (variables, functions, etc.) and their attributes (type, scope, etc.). It must support operations like:
Symbol tables can be implemented using various data structures like hash tables, trees, or lists, with hash tables being the most common due to their efficiency.
Our project uses a sophisticated symbol table that supports multiple scopes and has specialized handling for different types of symbols:
class SymbolTable {
ScopeTable *current = NULL;
ofstream *log;
public:
SymbolTable(ofstream *log) {
this->log = log;
enterScope();
}
void enterScope(int buckets = SYMBOL_TABLE_SIZE) {
ScopeTable *st = new ScopeTable(buckets, current, log);
current = st;
*log << "\nNew ScopeTable #" << st->getID() << " created" << endl;
}
void exitScope() {
if (current == NULL) {
return;
}
ScopeTable *temp = current;
current = current->getParentScope();
delete temp;
}
bool insertSymbol(string name, string type) {
if (current == NULL) {
return false;
}
return current->insertSymbol(name, type);
}
bool insertSymbol(SymbolInfo* symbol) {
if (symbol->getIdType() == "VARIABLE" || symbol->getIdType() == "ARRAY")
symbol->setAsmVar(symbol->getName()+current->getID());
if (current == NULL){
return false;
}
return current->insertSymbol(symbol);
}
SymbolInfo *lookup(string name) {
if (current == NULL) {
return NULL;
}
ScopeTable *temp = current;
SymbolInfo *symbol = NULL;
while (temp != NULL) {
symbol = temp->lookUp(name);
if (symbol != NULL) {
return symbol;
}
temp = temp->getParentScope();
}
return NULL;
}
// More methods...
};
A key aspect of the symbol table is scope management. When the parser enters a new block (e.g., a function body or a compound statement), it creates a new scope table:
inline void enterScope() {
st.enterScope();
if (currentFunction != "") {
SymbolInfo *funcVal = st.lookup(currentFunction);
for (param p : paramList) {
SymbolInfo *sym = new SymbolInfo(p.name, "ID");
sym->setIdType("VARIABLE");
for (auto &c : p.type)
c = toupper(c);
sym->setVarType(p.type);
insertSymbol(sym);
asmVarList.push_back(sym->getAsmVar());
funcVal->paramSymList.push_back(sym);
}
if (currentFunction != "main") {
// Set up function prologue and epilogue
funcVal->setFuncStart(funcVal->getName() + " PROC\n");
funcVal->setFuncStart(funcVal->getFuncStart() + "\n" + "POP return_loc");
for (int i = funcVal->paramSymList.size() - 1; i >= 0; --i) {
funcVal->setFuncStart(funcVal->getFuncStart() + "\nPOP " +
funcVal->paramSymList[i]->getAsmVar());
}
funcVal->setFuncStart(funcVal->getFuncStart() + "\nPUSH BX\nPUSH DX\n");
funcVal->setFuncEnd(funcVal->funcEndLabel + ": \n" +
"POP DX\nPOP BX\nPUSH return_loc\nRET\n" +
funcVal->getName() + " ENDP\n\n");
}
paramList.clear();
}
}
When it exits a block, it removes the current scope table:
inline void exitScope() {
st.printAll();
st.exitScope();
}
Each symbol in the table is represented by a SymbolInfo object, which contains detailed information about the symbol:
class SymbolInfo {
string name = "";
string type = "";
string idType = ""; // FUNCTION, VARIABLE, ARRAY
string varType = ""; // INT, FLOAT, VOID
string returnType = ""; // INT, FLOAT, VOID
bool funcDefined = false;
int arrSize = 0;
int arrIndex = 0;
int defaultInt = -1;
float defaultFloat = -1.0;
SymbolInfo *next;
string code = " ";
string asmVar = "";
bool isConst = false;
string funcStart;
string funcEnd;
vector<string> asmVarList;
public:
vector<int> intData;
vector<float> floatData;
bool isDummy = false;
SymbolInfo *real = NULL;
vector<SymbolInfo *> paramSymList;
vector<SymbolInfo *> varList;
string funcEndLabel = "";
string arrAsmVar = "";
bool isFunctionCall = false;
// Methods...
};
This rich structure allows the compiler to handle complex language features like arrays, functions with parameters, and type checking.
Type checking ensures that operations are applied to compatible operands. For example, you can’t add a string to a number or call a non-function. Type systems can be:
C has a static, relatively weak, nominal type system.
Our compiler performs type checking during parsing. For example, when handling binary operations:
inline SymbolInfo *handleADDOP(SymbolInfo *sym1, SymbolInfo *op, SymbolInfo *sym2) {
SymbolInfo *result = new SymbolInfo("", "");
if (sym1->getVarType() == "VOID" || sym2->getVarType() == "VOID") {
printError("Operand of void type");
return nullSym();
}
if (sym1->getVarType() == "FLOAT" || sym2->getVarType() == "FLOAT") {
result->setVarType("FLOAT");
} else {
result->setVarType("INT");
}
result->setIdType("VARIABLE");
// Code generation...
return result;
}
This function checks that neither operand is void, determines the result type based on the operand types, and generates appropriate code.
Intermediate code is a representation of the program that’s lower-level than the source code but higher-level than machine code. Common forms include:
Our compiler generates a form of three-address code, which is then translated to x86 assembly.
An important aspect of code generation is register allocation: deciding which values to keep in registers and which to store in memory. Our compiler uses a simple approach where temporary values are assigned to registers or memory locations:
class VarManager {
int size = 0;
stack<string> free;
public:
string getTempVar() {
string tempVar;
if (free.empty()) {
tempVar = "temp" + to_string(size);
size++;
asmVarList.push_back(tempVar);
} else {
tempVar = free.top();
free.pop();
}
return tempVar;
}
void freeTempVar(string tempVar) {
if (tempVar.substr(0, 4) == "temp") {
free.push(tempVar);
}
}
int getSize() { return size; }
};
This class manages temporary variables, reusing them when possible to minimize memory usage.
Let’s look at how our compiler generates code for various language constructs:
inline SymbolInfo *getArrayIndexVar(SymbolInfo *arr, SymbolInfo *index) {
SymbolInfo *arrIdxVar = st.lookup(arr->getName());
SymbolInfo *var;
// Error checking...
var = new SymbolInfo(*arrIdxVar);
var->setArrIndex(index->getIntValue());
var->setName(arr->getName() + "[" + to_string(index->getIntValue()) + "]");
var->setReal(arrIdxVar);
var->addCode(index->getCode());
var->addCode("MOV BX, " + index->getAsmVar());
var->addCode("SHL BX, 1"); // Multiply by 2 (size of int)
if (SIorBX) {
var->addCode("MOV SI, BX");
var->setAsmVar(arrIdxVar->getAsmVar() + "[SI]");
} else {
var->setAsmVar(arrIdxVar->getAsmVar() + "[BX]");
}
vm.freeTempVar(index->getAsmVar());
var->arrAsmVar = var->getAsmVar();
SIorBX = !SIorBX; // Alternate between BX and SI for nested array accesses
return var;
}
This function generates code to access an array element. It computes the byte offset (SHL BX, 1
multiplies by 2 since each int is 2 bytes) and uses either BX or SI as the index register to allow for nested array accesses.
inline SymbolInfo *handle_assign(SymbolInfo *sym1, SymbolInfo *sym2) {
SymbolInfo *result;
// Type checking...
result = new SymbolInfo(*sym1);
result->setName(sym1->getName() + "=" + sym2->getName());
result->setIdType("VARIABLE");
// Generate code
if (sym2->getIsConst()) {
result->setCode(sym1->getCode() + "\n" + sym2->getCode() + "\n" +
constToMem(sym1, sym2));
} else if (sym1->isArray()) {
result->setCode(sym2->getCode());
result->addCode(sym1->getCode());
result->addCode("MOV AX, " + sym2->getAsmVar());
result->addCode("MOV " + sym1->arrAsmVar + ", AX");
} else {
result->setCode(sym1->getCode() + "\n" + sym2->getCode() + "\n" +
memToMem(sym1, sym2));
}
result->setAsmVar(sym1->getAsmVar());
vm.freeTempVar(sym2->getAsmVar());
return result;
}
This function generates code for assignment statements. It handles different cases (constant values, arrays, variables) and generates appropriate MOV instructions.
inline SymbolInfo *handle_if_else(SymbolInfo *exp, SymbolInfo *ifstmnt, SymbolInfo *elsestmnt) {
SymbolInfo *result =
new SymbolInfo("if(" + exp->getName() + ")" + ifstmnt->getName() +
"else " + elsestmnt->getName(),
"NON_TERMINAL");
string label1 = newLabel(); // Label for else part
string label2 = newLabel(); // Label for end of if-else
result->setCode(exp->getCode());
result->addCode("MOV AX, " + exp->getAsmVar());
result->addCode("CMP AX, 1");
result->addCode("JNE " + label1); // Jump to else if condition is false
result->addCode(ifstmnt->getCode());
result->addCode("JMP " + label2); // Skip else part
result->addCode(label1 + ":"); // Start of else part
result->addCode(elsestmnt->getCode());
result->addCode(label2 + ":\n"); // End of if-else
vm.freeTempVar(exp->getAsmVar());
return result;
}
This function generates code for if-else statements. It creates two labels (one for the else part, one for the end), generates code to evaluate the condition and jump accordingly, and includes the code for both branches.
inline SymbolInfo *handle_for(SymbolInfo *init, SymbolInfo *termimation, SymbolInfo *inc, SymbolInfo *statement) {
SymbolInfo *result =
new SymbolInfo("for(" + init->getName() + termimation->getName() +
inc->getName() + ")" + statement->getName(),
"NON_TERMINAL");
string loop = newLabel(); // Start of loop
string loopExit = newLabel(); // End of loop
result->addCode(";for loop start");
result->addCode(init->getCode()); // Initialization
result->addCode(loop + ":"); // Loop start label
result->addCode(termimation->getCode()); // Condition
result->addCode("MOV AX, " + termimation->getAsmVar());
result->addCode("CMP AX, 0");
result->addCode("JE " + loopExit); // Exit if condition is false
result->addCode(statement->getCode()); // Loop body
result->addCode(inc->getCode()); // Increment
result->addCode("JMP " + loop); // Jump back to start
result->addCode(loopExit + ":"); // Loop exit label
result->addCode(";for loop end");
return result;
}
This function generates code for for-loops. It creates two labels (loop start and loop exit), generates code for initialization, condition checking, loop body, and increment, and adds jumps to create the loop structure.
inline SymbolInfo *handle_function(SymbolInfo *funcVal, SymbolInfo *argList) {
SymbolInfo *func = st.lookup(funcVal->getName());
SymbolInfo *sym = new SymbolInfo(
funcVal->getName() + "(" + argList->getName() + ")", "NON_TERMINAL");
// Error checking...
sym->addCode(argList->getCode());
// Save state for recursive calls
for (int i = 0; i < func->paramSymList.size(); i++) {
sym->addCode("PUSH " + func->paramSymList[i]->getAsmVar());
}
for (int i = 0; i < func->varList.size(); i++) {
sym->addCode("PUSH " + func->varList[i]->getAsmVar());
}
for (int i = 0; i < vm.getSize(); i++) {
sym->addCode("PUSH temp" + to_string(i));
}
// Push return address and arguments
sym->addCode("PUSH return_loc");
for (int i = 0; i < func->paramSymList.size(); i++) {
sym->addCode("PUSH " + asmArgList[i]);
vm.freeTempVar(asmArgList[i]);
}
asmArgList.clear();
// Call function
sym->addCode("CALL " + funcVal->getName());
sym->addCode("POP return_loc");
// Restore saved state
for (int i = vm.getSize() - 1; i >= 0; i--) {
sym->addCode("POP temp" + to_string(i));
}
for (int i = func->varList.size() - 1; i >= 0; i--) {
sym->addCode("POP " + func->varList[i]->getAsmVar());
}
for (int i = func->paramSymList.size() - 1; i >= 0; i--) {
sym->addCode("POP " + func->paramSymList[i]->getAsmVar());
}
sym->isFunctionCall = true;
sym->setAsmVar("CX"); // Return value in CX
sym->setIdType("VARIABLE");
sym->setVarType("INT");
return sym;
}
This function generates code for function calls. It saves the current state (for recursive calls), pushes arguments and the return address, calls the function, and restores the state afterward. The function’s return value is placed in the CX register.
x86 assembly language is a low-level programming language specific to the x86 family of processors. It uses:
The data segment contains all variables and constants. Our compiler generates it like this:
inline void addDataSegment() {
code << ".MODEL MEDIUM \n.STACK 100H \n.DATA" << endl << endl;
asmVarList.push_back("return_loc");
for (string s : asmVarList) {
code << s << " DW ?" << endl;
}
for (auto p : asmArrList) {
code << p.first << " DW " << p.second << " DUP (?)" << endl;
}
}
This creates a .DATA
section with all variables (DW ?
for uninitialized words) and arrays (DUP (?)
for arrays of uninitialized words).
The code segment contains the executable instructions. Our compiler generates it like this:
inline void startCodeSegment() { code << endl << ".CODE" << endl; }
inline void endCodeSegment() { code << endl << "END MAIN\n"; }
The .CODE
directive starts the code segment, and END MAIN
marks the end of the program, with MAIN as the entry point.
Each function needs setup and cleanup code:
// Function prologue
funcVal->setFuncStart(funcVal->getName() + " PROC\n");
funcVal->setFuncStart(funcVal->getFuncStart() + "\n" + "POP return_loc");
for (int i = funcVal->paramSymList.size() - 1; i >= 0; --i) {
funcVal->setFuncStart(funcVal->getFuncStart() + "\nPOP " +
funcVal->paramSymList[i]->getAsmVar());
}
funcVal->setFuncStart(funcVal->getFuncStart() + "\nPUSH BX\nPUSH DX\n");
// Function epilogue
funcVal->setFuncEnd(funcVal->funcEndLabel + ": \n" +
"POP DX\nPOP BX\nPUSH return_loc\nRET\n" +
funcVal->getName() + " ENDP\n\n");
The prologue pops the return address and parameters from the stack, and saves registers BX and DX. The epilogue restores the registers, pushes the return address back onto the stack, and returns.
After generating assembly code, our compiler performs several optimizations:
inline bool optimize_mov(string inst) {
inst.erase(std::remove(inst.begin(), inst.end(), ','), inst.end());
vector<string> tokens = split(inst, " ");
if (tokens.size() == 3 && tokens[0] == "MOV") {
string movLhs = tokens[1];
string movRhs = tokens[2];
if (movLhs == prevMovRhs && movRhs == prevMovLhs) {
return true; // Remove redundant MOVs like "MOV AX, BX" followed by "MOV BX, AX"
} else if (movLhs == "BX") {
BXval = movRhs; // Track BX value for arithmetic optimizations
} else if (movLhs == movRhs) {
return true; // Remove MOVs with same source and destination
}
prevMovLhs = movLhs;
prevMovRhs = movRhs;
}
return false;
}
inline bool optimize_arithmetic(string inst) {
inst.erase(std::remove(inst.begin(), inst.end(), ','), inst.end());
vector<string> tokens = split(inst, " ");
if (tokens.size() == 3 && (tokens[0] == "ADD" || tokens[0] == "SUB") &&
tokens[2] == "0") {
return true; // Remove ADDs and SUBs with 0
} else if (tokens.size() == 2 &&
(tokens[0] == "IMUL" || tokens[0] == "IDIV") && BXval == "1") {
return true; // Remove MULs and DIVs by 1
}
return false;
}
inline void optimize() {
ifstream srcFile("code.asm");
string line;
int line_count = 0, line_removed = 0;
vector<string> lines;
while (std::getline(srcFile, line, '\n')) {
lines.push_back(line);
}
log << "-------------------------------------" << endl;
log << "Optimizer log: " << endl;
for (string s : lines) {
line_count++;
if (s.substr(0, 1) == "L") {
// Clear tracking when entering a new label
prevMovLhs = "";
prevMovRhs = "";
}
if (s == " ") {
log << "Removed blank line : " << line_count << endl;
line_removed++;
} else if (optimize_mov(s)) {
log << "Optimized redundant MOV operation: " << line_count << endl;
line_removed++;
} else if (optimize_arithmetic(s)) {
log << "Optimized redundant arithmetic operation : " << line_count
<< endl;
line_removed++;
} else {
optimized << s << endl;
}
}
log << "Line removed:" << line_removed << endl;
log << "-------------------------------------" << endl;
optimized << "END MAIN" << endl;
srcFile.close();
}
These functions identify and remove:
Let’s examine how our compiler handles a more complex example: a recursive function to calculate Fibonacci numbers.
int fib(int n) {
if (n <= 1)
return n;
return fib(n - 1) + fib(n - 2);
}
int main() {
int x;
x = fib(7);
println(x);
}
The compiler will:
fib
function:
n <= 1
n - 1
and call fib
recursivelyn - 2
and call fib
recursivelymain
function:
x
fib(7)
x
x
The recursive calls are particularly challenging because they require saving and restoring the state of the function for each call. Our compiler handles this by:
Building a compiler is a complex task with many challenges:
Compilers must detect and report errors in the source code. Our compiler checks for:
inline void printError(string msg) {
error << "Error at line no: " << line_count << " - " << msg << endl;
log << "Error at line no: " << line_count << " - " << msg << endl;
error_count++;
}
inline void yyerror(const char *s) { printError(s); }
Compilers create many temporary objects during compilation, which can lead to memory leaks if not properly managed. Our compiler uses a VarManager class to reuse temporary variables and functions to free resources when they’re no longer needed.
Generating efficient code requires careful consideration of register allocation, control flow, and optimization. Our compiler uses a fairly simple approach, but more advanced compilers might use techniques like:
Our compiler is designed with clear separation of concerns:
This modular approach makes the code easier to understand, maintain, and extend.
Building a compiler is a challenging but rewarding endeavor that requires deep understanding of programming languages, data structures, algorithms, and computer architecture. Our compiler demonstrates the essential components of a modern compiler:
While our compiler only implements a subset of C, the principles and techniques apply to more complex languages as well. Understanding how compilers work provides invaluable insight into programming languages and computer systems.
If you’re interested in learning more about compilers, here are some excellent resources:
These resources provide deeper explorations of compiler theory and practice, from basic concepts to advanced techniques used in modern production compilers.