Yet Another C Compiler


Building a C Compiler: A Comprehensive Guide to Intermediate Code Generation
Introduction to Compiler Design
A compiler transforms source code written in a high-level programming language (like C) into executable machine code that can be directly understood by the computer’s processor. This transformation is complex and involves multiple stages, each with its own theoretical foundations and practical implementations.
Before diving into the specific components of our compiler project, let’s establish some foundational concepts.
The Compilation Process
A typical compiler operates in several phases:
- Lexical Analysis: Breaking down the source code into tokens (identifiers, keywords, operators, etc.)
- Syntax Analysis: Organizing these tokens into a hierarchical structure that represents the grammatical structure of the program
- Semantic Analysis: Checking if the program makes logical sense (type checking, scope checking, etc.)
- Intermediate Code Generation: Creating a representation that’s closer to machine code but still independent of specific hardware
- Code Optimization: Improving the intermediate code to make it more efficient
- Code Generation: Translating the optimized intermediate code into machine code
Now, let’s explore each component of our compiler in detail.
Lexical Analysis: Tokenizing the Source Code
Theoretical Background
Lexical analysis is based on the theory of regular languages and finite automata. Regular expressions define patterns for tokens, and these patterns are converted into finite state machines that can efficiently recognize tokens in the input stream.
The lexical analyzer (or scanner) reads the source code character by character and groups them into meaningful units called tokens. Each token has:
- A type (e.g., identifier, keyword, operator)
- A value (the actual text)
- Often, position information (line number, column number)
Implementation with Flex
Our project uses Flex (a modern version of the classic Lex tool) to generate the lexical analyzer. Here’s an in-depth look at how we define tokens:
%option noyywrap
%{
#include "LexUtils.h"
#include<iostream>
#include<string.h>
#include<stdio.h>
using namespace std;
FILE *input;
%}
NEWLINE \r?\n
NOT_NEWLINE ([^\r\n])
WHITESPACE [ \t\f\r\v]+
LETTER [a-zA-Z]
DIGIT [0-9]
ALPHABET {LETTER}|_
ALPHANUM {ALPHABET}|{DIGIT}
ID {ALPHABET}({ALPHANUM})*
INTEGER {DIGITS}
FLOAT ({INTEGER})?(\.?{DIGITS}*)([Ee][+-]?{INTEGER})?
%%
{COMMENT} |
{STAR_COMMENT} {handle_comment(yytext);}
"if" {return IF;}
"else" {return ELSE;}
"for" {return FOR;}
// More keywords...
{INTEGER} {handle_const_int(yytext); return CONST_INT;}
{FLOAT} {handle_const_float(yytext); return CONST_FLOAT;}
{ID} {handle_id(yytext); return ID;}
{ADDOP} {handle_operator(yytext, "ADDOP"); return ADDOP;}
// More operators...
{NEWLINE} {line_count++;}
{WHITESPACE} {}
. {handle_error(yytext, "Unrecognized character");}
%%
Let’s break down what’s happening here:
- The
%{...%}
section contains C/C++ code that will be included in the generated scanner. - Definitions like
LETTER [a-zA-Z]
create named patterns. - In the rules section (
%%
), we define what action to take when a pattern is matched.
For example, when the scanner encounters “if”, it returns the token IF to the parser. When it finds an identifier that matches the pattern {ID}
, it calls handle_id(yytext)
and returns ID.
The functions like handle_const_int()
and handle_id()
create SymbolInfo objects for the tokens and set their attributes. These objects are then passed to the parser.
Token Processing
When the scanner recognizes a token, it creates a SymbolInfo object with the token’s attributes:
void handle_const_int(char *str) {
string s(str);
assignSymbol("CONST_INT");
}
void assignSymbol(string type) {
yylval.symbol = new SymbolInfo((string)yytext, type);
}
This creates a new SymbolInfo object with the token’s text and type, and assigns it to yylval.symbol
, which the parser will access.
Parsing: Building the Syntax Tree
Theoretical Background
Parsing is based on the theory of context-free grammars (CFGs) and pushdown automata. A CFG consists of:
- Terminal symbols (tokens)
- Non-terminal symbols (syntactic categories)
- Production rules (defining how non-terminals can be expanded)
- A start symbol
The parser uses these rules to determine if the input sequence of tokens forms a valid program according to the grammar, and if so, to build a parse tree or abstract syntax tree (AST).
There are two main parsing approaches:
- Top-down parsing: Start from the root (start symbol) and try to derive the input (e.g., LL parsing)
- Bottom-up parsing: Start from the input and try to reduce it to the start symbol (e.g., LR parsing)
Implementation with Bison
Our project uses Bison (a modern version of the classic Yacc tool) to generate the parser. Bison generates LALR(1) parsers, which are a type of bottom-up parser.
Here’s a detailed look at our grammar definition:
%{
#ifndef PARSER
#define PARSER
#include "ParserUtils.h"
#endif
int yyparse(void);
int yylex(void);
extern FILE *yyin;
extern SymbolTable st;
%}
%union{
SymbolInfo* symbol;
}
%token IF FOR DO INT FLOAT VOID SWITCH DEFAULT ELSE WHILE BREAK CHAR DOUBLE RETURN CASE CONTINUE
%token INCOP DECOP NOT
%token LPAREN RPAREN LCURL RCURL LTHIRD RTHIRD COMMA SEMICOLON
%token PRINTLN
%token STRING
%token <symbol> ID
%token <symbol> CONST_INT
%token <symbol> CONST_FLOAT
%token <symbol> CONST_CHAR
%token <symbol> ADDOP
%token <symbol> MULOP
%token <symbol> LOGICOP
%token <symbol> RELOP
%token <symbol> BITOP
%token <symbol> ASSIGNOP
%type <symbol> program
%type <symbol> unit
// More non-terminal types...
%nonassoc second_prec
%nonassoc ELSE
%%
start : program
{
if (!error_count){
cout<<"No error, generating assembly code"<<endl;
addDataSegment();
startCodeSegment();
printCode($$);
addPrintFunc();
endCodeSegment();
optimize();
}
}
;
program : program unit
{
$$ = new SymbolInfo($1->getName() + "\n" + $2->getName(), "NON_TERMINAL");
$$->setCode($1->getCode() + "\n" +$2->getCode());
printRule("program : program unit");
printSymbol($$);
}
| unit
{
$$ = $1;
$$->setCode($1->getCode());
printRule("program : unit");
printSymbol($$);
st.printAll();
}
;
// More grammar rules...
%%
Let’s examine what’s happening here:
- The
%{...%}
section contains C/C++ code that will be included in the generated parser. - The
%union
declaration specifies the type of semantic values (in our case, SymbolInfo pointers). - The
%token
and%type
declarations specify the types of tokens and non-terminals. - The
%%
section contains the grammar rules, with semantic actions enclosed in braces.
Grammar Rules and Semantic Actions
Let’s look at a more complex rule, the if-statement:
statement : IF LPAREN expression RPAREN statement %prec second_prec
{
$$ = handle_if($3, $5);
printRule("IF LPAREN expression RPAREN statement");
printSymbol($$);
}
| IF LPAREN expression RPAREN statement ELSE statement
{
$$ = handle_if_else($3, $5, $7);
printRule("IF LPAREN expression RPAREN statement ELSE statement");
printSymbol($$);
}
This rule says that a statement can be an if-statement without an else clause (IF LPAREN expression RPAREN statement
) or an if-else statement (IF LPAREN expression RPAREN statement ELSE statement
).
The %prec second_prec
part resolves the classic “dangling else” ambiguity by specifying that this rule has lower precedence than the ELSE token.
The semantic actions (code in braces) create a new SymbolInfo object for the statement, generate code for it, and log the rule application for debugging.
Parse Tree Construction
As the parser recognizes grammatical constructs, it builds a parse tree (implicitly, through the semantic values). Each node in the tree is a SymbolInfo object that contains:
- The textual representation of the construct
- Its type (e.g., “NON_TERMINAL”)
- The generated code for that construct
- Other attributes specific to the construct
Symbol Table: Managing Identifiers and Scopes
Theoretical Background
A symbol table is a data structure used by compilers to keep track of identifiers (variables, functions, etc.) and their attributes (type, scope, etc.). It must support operations like:
- Insert a symbol with its attributes
- Look up a symbol to retrieve its attributes
- Handle scopes (blocks of code with their own local symbols)
- Check if a symbol is already defined in the current scope
Symbol tables can be implemented using various data structures like hash tables, trees, or lists, with hash tables being the most common due to their efficiency.
Implementation
Our project uses a sophisticated symbol table that supports multiple scopes and has specialized handling for different types of symbols:
class SymbolTable {
ScopeTable *current = NULL;
ofstream *log;
public:
SymbolTable(ofstream *log) {
this->log = log;
enterScope();
}
void enterScope(int buckets = SYMBOL_TABLE_SIZE) {
ScopeTable *st = new ScopeTable(buckets, current, log);
current = st;
*log << "\nNew ScopeTable #" << st->getID() << " created" << endl;
}
void exitScope() {
if (current == NULL) {
return;
}
ScopeTable *temp = current;
current = current->getParentScope();
delete temp;
}
bool insertSymbol(string name, string type) {
if (current == NULL) {
return false;
}
return current->insertSymbol(name, type);
}
bool insertSymbol(SymbolInfo* symbol) {
if (symbol->getIdType() == "VARIABLE" || symbol->getIdType() == "ARRAY")
symbol->setAsmVar(symbol->getName()+current->getID());
if (current == NULL){
return false;
}
return current->insertSymbol(symbol);
}
SymbolInfo *lookup(string name) {
if (current == NULL) {
return NULL;
}
ScopeTable *temp = current;
SymbolInfo *symbol = NULL;
while (temp != NULL) {
symbol = temp->lookUp(name);
if (symbol != NULL) {
return symbol;
}
temp = temp->getParentScope();
}
return NULL;
}
// More methods...
};
Scope Management
A key aspect of the symbol table is scope management. When the parser enters a new block (e.g., a function body or a compound statement), it creates a new scope table:
inline void enterScope() {
st.enterScope();
if (currentFunction != "") {
SymbolInfo *funcVal = st.lookup(currentFunction);
for (param p : paramList) {
SymbolInfo *sym = new SymbolInfo(p.name, "ID");
sym->setIdType("VARIABLE");
for (auto &c : p.type)
c = toupper(c);
sym->setVarType(p.type);
insertSymbol(sym);
asmVarList.push_back(sym->getAsmVar());
funcVal->paramSymList.push_back(sym);
}
if (currentFunction != "main") {
// Set up function prologue and epilogue
funcVal->setFuncStart(funcVal->getName() + " PROC\n");
funcVal->setFuncStart(funcVal->getFuncStart() + "\n" + "POP return_loc");
for (int i = funcVal->paramSymList.size() - 1; i >= 0; --i) {
funcVal->setFuncStart(funcVal->getFuncStart() + "\nPOP " +
funcVal->paramSymList[i]->getAsmVar());
}
funcVal->setFuncStart(funcVal->getFuncStart() + "\nPUSH BX\nPUSH DX\n");
funcVal->setFuncEnd(funcVal->funcEndLabel + ": \n" +
"POP DX\nPOP BX\nPUSH return_loc\nRET\n" +
funcVal->getName() + " ENDP\n\n");
}
paramList.clear();
}
}
When it exits a block, it removes the current scope table:
inline void exitScope() {
st.printAll();
st.exitScope();
}
Symbol Information
Each symbol in the table is represented by a SymbolInfo object, which contains detailed information about the symbol:
class SymbolInfo {
string name = "";
string type = "";
string idType = ""; // FUNCTION, VARIABLE, ARRAY
string varType = ""; // INT, FLOAT, VOID
string returnType = ""; // INT, FLOAT, VOID
bool funcDefined = false;
int arrSize = 0;
int arrIndex = 0;
int defaultInt = -1;
float defaultFloat = -1.0;
SymbolInfo *next;
string code = " ";
string asmVar = "";
bool isConst = false;
string funcStart;
string funcEnd;
vector<string> asmVarList;
public:
vector<int> intData;
vector<float> floatData;
bool isDummy = false;
SymbolInfo *real = NULL;
vector<SymbolInfo *> paramSymList;
vector<SymbolInfo *> varList;
string funcEndLabel = "";
string arrAsmVar = "";
bool isFunctionCall = false;
// Methods...
};
This rich structure allows the compiler to handle complex language features like arrays, functions with parameters, and type checking.
Type Checking and Semantic Analysis
Theoretical Background
Type checking ensures that operations are applied to compatible operands. For example, you can’t add a string to a number or call a non-function. Type systems can be:
- Static (checked at compile time) vs. dynamic (checked at runtime)
- Strong (few implicit conversions) vs. weak (many implicit conversions)
- Nominal (type names matter) vs. structural (structure matters)
C has a static, relatively weak, nominal type system.
Implementation
Our compiler performs type checking during parsing. For example, when handling binary operations:
inline SymbolInfo *handleADDOP(SymbolInfo *sym1, SymbolInfo *op, SymbolInfo *sym2) {
SymbolInfo *result = new SymbolInfo("", "");
if (sym1->getVarType() == "VOID" || sym2->getVarType() == "VOID") {
printError("Operand of void type");
return nullSym();
}
if (sym1->getVarType() == "FLOAT" || sym2->getVarType() == "FLOAT") {
result->setVarType("FLOAT");
} else {
result->setVarType("INT");
}
result->setIdType("VARIABLE");
// Code generation...
return result;
}
This function checks that neither operand is void, determines the result type based on the operand types, and generates appropriate code.
Intermediate Code Generation
Theoretical Background
Intermediate code is a representation of the program that’s lower-level than the source code but higher-level than machine code. Common forms include:
- Three-address code: Operations with at most three operands per instruction
- Static single assignment (SSA): Each variable is assigned exactly once
- Abstract syntax tree (AST): A tree representation of the program
- Control flow graph (CFG): A graph showing the flow of control
Our compiler generates a form of three-address code, which is then translated to x86 assembly.
Register Allocation
An important aspect of code generation is register allocation: deciding which values to keep in registers and which to store in memory. Our compiler uses a simple approach where temporary values are assigned to registers or memory locations:
class VarManager {
int size = 0;
stack<string> free;
public:
string getTempVar() {
string tempVar;
if (free.empty()) {
tempVar = "temp" + to_string(size);
size++;
asmVarList.push_back(tempVar);
} else {
tempVar = free.top();
free.pop();
}
return tempVar;
}
void freeTempVar(string tempVar) {
if (tempVar.substr(0, 4) == "temp") {
free.push(tempVar);
}
}
int getSize() { return size; }
};
This class manages temporary variables, reusing them when possible to minimize memory usage.
Code Generation for Different Constructs
Let’s look at how our compiler generates code for various language constructs:
Variables and Array Accesses
inline SymbolInfo *getArrayIndexVar(SymbolInfo *arr, SymbolInfo *index) {
SymbolInfo *arrIdxVar = st.lookup(arr->getName());
SymbolInfo *var;
// Error checking...
var = new SymbolInfo(*arrIdxVar);
var->setArrIndex(index->getIntValue());
var->setName(arr->getName() + "[" + to_string(index->getIntValue()) + "]");
var->setReal(arrIdxVar);
var->addCode(index->getCode());
var->addCode("MOV BX, " + index->getAsmVar());
var->addCode("SHL BX, 1"); // Multiply by 2 (size of int)
if (SIorBX) {
var->addCode("MOV SI, BX");
var->setAsmVar(arrIdxVar->getAsmVar() + "[SI]");
} else {
var->setAsmVar(arrIdxVar->getAsmVar() + "[BX]");
}
vm.freeTempVar(index->getAsmVar());
var->arrAsmVar = var->getAsmVar();
SIorBX = !SIorBX; // Alternate between BX and SI for nested array accesses
return var;
}
This function generates code to access an array element. It computes the byte offset (SHL BX, 1
multiplies by 2 since each int is 2 bytes) and uses either BX or SI as the index register to allow for nested array accesses.
Assignment
inline SymbolInfo *handle_assign(SymbolInfo *sym1, SymbolInfo *sym2) {
SymbolInfo *result;
// Type checking...
result = new SymbolInfo(*sym1);
result->setName(sym1->getName() + "=" + sym2->getName());
result->setIdType("VARIABLE");
// Generate code
if (sym2->getIsConst()) {
result->setCode(sym1->getCode() + "\n" + sym2->getCode() + "\n" +
constToMem(sym1, sym2));
} else if (sym1->isArray()) {
result->setCode(sym2->getCode());
result->addCode(sym1->getCode());
result->addCode("MOV AX, " + sym2->getAsmVar());
result->addCode("MOV " + sym1->arrAsmVar + ", AX");
} else {
result->setCode(sym1->getCode() + "\n" + sym2->getCode() + "\n" +
memToMem(sym1, sym2));
}
result->setAsmVar(sym1->getAsmVar());
vm.freeTempVar(sym2->getAsmVar());
return result;
}
This function generates code for assignment statements. It handles different cases (constant values, arrays, variables) and generates appropriate MOV instructions.
If-Else Statements
inline SymbolInfo *handle_if_else(SymbolInfo *exp, SymbolInfo *ifstmnt, SymbolInfo *elsestmnt) {
SymbolInfo *result =
new SymbolInfo("if(" + exp->getName() + ")" + ifstmnt->getName() +
"else " + elsestmnt->getName(),
"NON_TERMINAL");
string label1 = newLabel(); // Label for else part
string label2 = newLabel(); // Label for end of if-else
result->setCode(exp->getCode());
result->addCode("MOV AX, " + exp->getAsmVar());
result->addCode("CMP AX, 1");
result->addCode("JNE " + label1); // Jump to else if condition is false
result->addCode(ifstmnt->getCode());
result->addCode("JMP " + label2); // Skip else part
result->addCode(label1 + ":"); // Start of else part
result->addCode(elsestmnt->getCode());
result->addCode(label2 + ":\n"); // End of if-else
vm.freeTempVar(exp->getAsmVar());
return result;
}
This function generates code for if-else statements. It creates two labels (one for the else part, one for the end), generates code to evaluate the condition and jump accordingly, and includes the code for both branches.
For Loops
inline SymbolInfo *handle_for(SymbolInfo *init, SymbolInfo *termimation, SymbolInfo *inc, SymbolInfo *statement) {
SymbolInfo *result =
new SymbolInfo("for(" + init->getName() + termimation->getName() +
inc->getName() + ")" + statement->getName(),
"NON_TERMINAL");
string loop = newLabel(); // Start of loop
string loopExit = newLabel(); // End of loop
result->addCode(";for loop start");
result->addCode(init->getCode()); // Initialization
result->addCode(loop + ":"); // Loop start label
result->addCode(termimation->getCode()); // Condition
result->addCode("MOV AX, " + termimation->getAsmVar());
result->addCode("CMP AX, 0");
result->addCode("JE " + loopExit); // Exit if condition is false
result->addCode(statement->getCode()); // Loop body
result->addCode(inc->getCode()); // Increment
result->addCode("JMP " + loop); // Jump back to start
result->addCode(loopExit + ":"); // Loop exit label
result->addCode(";for loop end");
return result;
}
This function generates code for for-loops. It creates two labels (loop start and loop exit), generates code for initialization, condition checking, loop body, and increment, and adds jumps to create the loop structure.
Function Calls
inline SymbolInfo *handle_function(SymbolInfo *funcVal, SymbolInfo *argList) {
SymbolInfo *func = st.lookup(funcVal->getName());
SymbolInfo *sym = new SymbolInfo(
funcVal->getName() + "(" + argList->getName() + ")", "NON_TERMINAL");
// Error checking...
sym->addCode(argList->getCode());
// Save state for recursive calls
for (int i = 0; i < func->paramSymList.size(); i++) {
sym->addCode("PUSH " + func->paramSymList[i]->getAsmVar());
}
for (int i = 0; i < func->varList.size(); i++) {
sym->addCode("PUSH " + func->varList[i]->getAsmVar());
}
for (int i = 0; i < vm.getSize(); i++) {
sym->addCode("PUSH temp" + to_string(i));
}
// Push return address and arguments
sym->addCode("PUSH return_loc");
for (int i = 0; i < func->paramSymList.size(); i++) {
sym->addCode("PUSH " + asmArgList[i]);
vm.freeTempVar(asmArgList[i]);
}
asmArgList.clear();
// Call function
sym->addCode("CALL " + funcVal->getName());
sym->addCode("POP return_loc");
// Restore saved state
for (int i = vm.getSize() - 1; i >= 0; i--) {
sym->addCode("POP temp" + to_string(i));
}
for (int i = func->varList.size() - 1; i >= 0; i--) {
sym->addCode("POP " + func->varList[i]->getAsmVar());
}
for (int i = func->paramSymList.size() - 1; i >= 0; i--) {
sym->addCode("POP " + func->paramSymList[i]->getAsmVar());
}
sym->isFunctionCall = true;
sym->setAsmVar("CX"); // Return value in CX
sym->setIdType("VARIABLE");
sym->setVarType("INT");
return sym;
}
This function generates code for function calls. It saves the current state (for recursive calls), pushes arguments and the return address, calls the function, and restores the state afterward. The function’s return value is placed in the CX register.
Assembly Code Generation and Optimization
x86 Assembly Basics
x86 assembly language is a low-level programming language specific to the x86 family of processors. It uses:
- Registers: Small, fast storage locations in the CPU (AX, BX, CX, DX, SI, DI, BP, SP, etc.)
- Memory operands: Addresses in memory, often with complex addressing modes
- Instructions: Operations like MOV, ADD, SUB, MUL, DIV, JMP, CALL, etc.
- Directives: Instructions to the assembler, like .DATA, .CODE, PROC, ENDP, etc.
Data Segment
The data segment contains all variables and constants. Our compiler generates it like this:
inline void addDataSegment() {
code << ".MODEL MEDIUM \n.STACK 100H \n.DATA" << endl << endl;
asmVarList.push_back("return_loc");
for (string s : asmVarList) {
code << s << " DW ?" << endl;
}
for (auto p : asmArrList) {
code << p.first << " DW " << p.second << " DUP (?)" << endl;
}
}
This creates a .DATA
section with all variables (DW ?
for uninitialized words) and arrays (DUP (?)
for arrays of uninitialized words).
Code Segment
The code segment contains the executable instructions. Our compiler generates it like this:
inline void startCodeSegment() { code << endl << ".CODE" << endl; }
inline void endCodeSegment() { code << endl << "END MAIN\n"; }
The .CODE
directive starts the code segment, and END MAIN
marks the end of the program, with MAIN as the entry point.
Function Prologue and Epilogue
Each function needs setup and cleanup code:
// Function prologue
funcVal->setFuncStart(funcVal->getName() + " PROC\n");
funcVal->setFuncStart(funcVal->getFuncStart() + "\n" + "POP return_loc");
for (int i = funcVal->paramSymList.size() - 1; i >= 0; --i) {
funcVal->setFuncStart(funcVal->getFuncStart() + "\nPOP " +
funcVal->paramSymList[i]->getAsmVar());
}
funcVal->setFuncStart(funcVal->getFuncStart() + "\nPUSH BX\nPUSH DX\n");
// Function epilogue
funcVal->setFuncEnd(funcVal->funcEndLabel + ": \n" +
"POP DX\nPOP BX\nPUSH return_loc\nRET\n" +
funcVal->getName() + " ENDP\n\n");
The prologue pops the return address and parameters from the stack, and saves registers BX and DX. The epilogue restores the registers, pushes the return address back onto the stack, and returns.
Optimization
After generating assembly code, our compiler performs several optimizations:
inline bool optimize_mov(string inst) {
inst.erase(std::remove(inst.begin(), inst.end(), ','), inst.end());
vector<string> tokens = split(inst, " ");
if (tokens.size() == 3 && tokens[0] == "MOV") {
string movLhs = tokens[1];
string movRhs = tokens[2];
if (movLhs == prevMovRhs && movRhs == prevMovLhs) {
return true; // Remove redundant MOVs like "MOV AX, BX" followed by "MOV BX, AX"
} else if (movLhs == "BX") {
BXval = movRhs; // Track BX value for arithmetic optimizations
} else if (movLhs == movRhs) {
return true; // Remove MOVs with same source and destination
}
prevMovLhs = movLhs;
prevMovRhs = movRhs;
}
return false;
}
inline bool optimize_arithmetic(string inst) {
inst.erase(std::remove(inst.begin(), inst.end(), ','), inst.end());
vector<string> tokens = split(inst, " ");
if (tokens.size() == 3 && (tokens[0] == "ADD" || tokens[0] == "SUB") &&
tokens[2] == "0") {
return true; // Remove ADDs and SUBs with 0
} else if (tokens.size() == 2 &&
(tokens[0] == "IMUL" || tokens[0] == "IDIV") && BXval == "1") {
return true; // Remove MULs and DIVs by 1
}
return false;
}
inline void optimize() {
ifstream srcFile("code.asm");
string line;
int line_count = 0, line_removed = 0;
vector<string> lines;
while (std::getline(srcFile, line, '\n')) {
lines.push_back(line);
}
log << "-------------------------------------" << endl;
log << "Optimizer log: " << endl;
for (string s : lines) {
line_count++;
if (s.substr(0, 1) == "L") {
// Clear tracking when entering a new label
prevMovLhs = "";
prevMovRhs = "";
}
if (s == " ") {
log << "Removed blank line : " << line_count << endl;
line_removed++;
} else if (optimize_mov(s)) {
log << "Optimized redundant MOV operation: " << line_count << endl;
line_removed++;
} else if (optimize_arithmetic(s)) {
log << "Optimized redundant arithmetic operation : " << line_count
<< endl;
line_removed++;
} else {
optimized << s << endl;
}
}
log << "Line removed:" << line_removed << endl;
log << "-------------------------------------" << endl;
optimized << "END MAIN" << endl;
srcFile.close();
}
These functions identify and remove:
- Redundant MOV instructions
- Arithmetic operations with identity values (adding 0, multiplying by 1)
- Blank lines
Testing with Example Programs
Let’s examine how our compiler handles a more complex example: a recursive function to calculate Fibonacci numbers.
int fib(int n) {
if (n <= 1)
return n;
return fib(n - 1) + fib(n - 2);
}
int main() {
int x;
x = fib(7);
println(x);
}
The compiler will:
- Parse the function declarations and definitions
- Generate code for the
fib
function:- Evaluate
n <= 1
- If true, return n
- If false, compute
n - 1
and callfib
recursively - Compute
n - 2
and callfib
recursively - Add the results and return
- Evaluate
- Generate code for the
main
function:- Declare
x
- Call
fib(7)
- Assign the result to
x
- Print
x
- Declare
The recursive calls are particularly challenging because they require saving and restoring the state of the function for each call. Our compiler handles this by:
- Pushing all local variables and parameters onto the stack before each call
- Popping them off the stack after each call returns
- Passing parameters by pushing them onto the stack
- Returning values in the CX register
Challenges and Best Practices
Building a compiler is a complex task with many challenges:
Challenge: Error Handling
Compilers must detect and report errors in the source code. Our compiler checks for:
- Lexical errors (invalid tokens)
- Syntax errors (invalid grammar)
- Semantic errors (invalid types, undeclared variables, etc.)
inline void printError(string msg) {
error << "Error at line no: " << line_count << " - " << msg << endl;
log << "Error at line no: " << line_count << " - " << msg << endl;
error_count++;
}
inline void yyerror(const char *s) { printError(s); }
Challenge: Memory Management
Compilers create many temporary objects during compilation, which can lead to memory leaks if not properly managed. Our compiler uses a VarManager class to reuse temporary variables and functions to free resources when they’re no longer needed.
Challenge: Code Generation Complexity
Generating efficient code requires careful consideration of register allocation, control flow, and optimization. Our compiler uses a fairly simple approach, but more advanced compilers might use techniques like:
- Register allocation based on liveness analysis
- Instruction scheduling to minimize stalls
- Peephole optimization to improve instruction sequences
- Loop optimization to reduce iteration overhead
- Inline expansion to eliminate function call overhead
Best Practice: Modular Design
Our compiler is designed with clear separation of concerns:
- Lexical analysis is handled by Flex and LexUtils
- Parsing is handled by Bison and ParserUtils
- Symbol management is handled by SymbolTable and SymbolInfo
- Code generation is integrated into the parser actions
- Optimization is a separate pass after code generation
This modular approach makes the code easier to understand, maintain, and extend.
Conclusion
Building a compiler is a challenging but rewarding endeavor that requires deep understanding of programming languages, data structures, algorithms, and computer architecture. Our compiler demonstrates the essential components of a modern compiler:
- Lexical analysis with Flex to break the source code into tokens
- Parsing with Bison to verify the grammar and build a parse tree
- Symbol table management to track variables, functions, and their attributes
- Type checking to ensure operations are applied to compatible operands
- Intermediate code generation to create x86 assembly code
- Optimization to improve the generated code
While our compiler only implements a subset of C, the principles and techniques apply to more complex languages as well. Understanding how compilers work provides invaluable insight into programming languages and computer systems.
Further Resources
If you’re interested in learning more about compilers, here are some excellent resources:
- “Compilers: Principles, Techniques, and Tools” by Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman (also known as the “Dragon Book”)
- “Engineering a Compiler” by Keith D. Cooper and Linda Torczon
- “Modern Compiler Implementation in C” by Andrew W. Appel
- “Advanced Compiler Design and Implementation” by Steven S. Muchnick
- The LLVM project: llvm.org
These resources provide deeper explorations of compiler theory and practice, from basic concepts to advanced techniques used in modern production compilers.