Getting Acquainted with LLVM, Part 2
Hello again! In the last blog post: Getting Acquainted with LLVM Codegen Part 1, we split up the code from the tutorial into separate modules, and also got rid of the global LLVM context, moving it locally to the codegen
module. This blog post will continue the tutorial, and we will add some extensions to the code.
Challenge 2: Fixing Signature Validation Bug
This challenge is presented in the tutorial, and it has to do with the codegen method for the FunctionAST. Let’s briefly review the Function AST codegen, and see what’s going on so far.
So far, we can define really basic functions like this:
ready> def foo(b) b;
ready> Read function definition:define double @foo(double %b) {
entry:
ret double %b
}
Let’s make it a little more complicated:
ready> extern cos(a);
ready> Read extern: declare double @cos(double)
ready> extern sin(a);
ready> Read extern: declare double @sin(double)
ready> def one(x) (cos(x)*cos(x)) + (sin(x)*sin(x));
Read function definition:define double @one(double %x) {
entry:
%calltmp = call double @cos(double %x)
%calltmp1 = call double @cos(double %x)
%multmp = fmul double %calltmp, %calltmp1
%calltmp2 = call double @sin(double %x)
%calltmp3 = call double @sin(double %x)
%multmp4 = fmul double %calltmp2, %calltmp3
%addtmp = fadd double %multmp, %multmp4
ret double %addtmp
}
Here I showcase the extern facility our language has so far. When we generate the prototype for extern functions, we set an external linkage flag to signify that the function may be defined outside the current module, and also that it is callable by functions outside of the module. This is useful for interfacing with C functions, for example. LLVM uses the C calling convention by default, and so we can use C functions in our code, and the linker will resolve these references to find the actual implementation.
Okay, now to the bug in our code. The issue is as follows:
ready> extern foo(a);
ready> Read extern: declare double @foo(double)
ready> def foo(b) b;
ready> Error: Unknown variable name
}
We have an extern definition of foo(a), but when we try to define foo(b), we get an unknown variable name error. This shouldn’t really matter since a and b are both the same type (implicitly, since we only have double arguments right now).
This is actually a common pattern, since using the extern keyword and providing the definition just means that we are extending the visibility of the prototype globally, and not that we are expecting to get the definition from somewhere else. Side note: it seems that extern really serves two purposes, to either broadcast a prototype globally, or to assume the definition of a prototype is being provided. It is interesting that the extern keyword handles both use cases.
Why are we getting an error here? Here is a snippet of code from the FunctionAST function, wherein the problem lies:
Function *FunctionAST::codegen(CodeGen &CG) {
// First, check for an existing function from a previous 'extern' declaration.
Function *TheFunction = CG.getModule()->getFunction(Proto->getName());
...
// Record the function arguments in the NamedValues map.
CG.getNamedValues().clear();
for (auto &Arg : TheFunction->args())
CG.getNamedValues()[std::string(Arg.getName())] = &Arg;
if (Value *RetVal = Body->codegen(CG)) {
// Finish off the function.
CG.getBuilder()->CreateRet(RetVal);
// Validate the generated code, checking for consistency.
verifyFunction(*TheFunction);
return TheFunction;
}
...
}
First we fetch the function prototype from the CodeGen context, and then we clear the symbol table information and start adding in variables from the prototype into the symbol table. Note that we’re iterating over TheFunction->args(), so we are iterating over the variables of the extern prototype that was declared before. This makes the prototype’s variable available in the symbol table when we perform codegen for the function body, as in the call to Body->codegen(CG).
This actually points to a more nefarious bug – what if we have new variable names in our function definition parameters, but use the prototype variable names in the body of the function? Here is an example:
ready> extern foo(a);
ready> Read extern: declare double @foo(double)
ready> def foo(x) a;
Read function definition:define double @foo(double %a) {
entry:
ret double %a
}
This bug is more serious than we first thought. This function is completely nonsensical, since we are defining a function foo(x) that doesn’t even use the named parameter x, and uses a instead, which shouldn’t even exist at that point in the function definition! This is more nefarious because unlike the previous bug that quits with an error when we try to make the definition, this actually translates the code into IR! However, since the prototype defined a as the name for the parameter, and we are using the prototype to populate the symbol table for the codegen pass on the function body, this is what happens. So let’s fix the bug!
Without delving too deeply into theory, this problem in our code has to do with alpha conversion. The basic idea is that the name of the variable doesn’t matter, the functions f(a) = a and f(b) = b mean the same thing. For most programming languages, this has actually been a notoriously tricky thing to get right. Actually, many mathematicians and programmer’s have gotten this issue wrong before. For a full exposiiton I recommend the first chapter of “Practical Foundations for Programming Languages”, by Robert Harper.
However, our language is so simple at this point that we wont run into some of the trickier edge cases with alpha conversion. However, here is an example of something that can go wrong without doing alpha conversion properly: suppose our language had global variables:
extern foo(x)
def foo(y)
x + y
I replicated our bug scenario: we have an extern function foo(x), and then we define the function foo ourselves, using the name y instead. What if we tried to resolve the discrepancy by changing y to x, so that it matches the extern prototype? Then we get the following:
def foo(x)
x + x
Now we have a completely different meaning of the function, instead of adding y to some x defined globally, we first lose the x that was defined globally, as it becomes captured by the parameter x, and we are doubling the input instead of adding the function input to a globally defined x. They are two completely different things!
Right now, our language doesn’t have anything that will cause subtle alpha conversion bugs, so for the time being we will go with a quick and dirty alpha conversion, and revisit the issue when we can have these kinds of bugs. What we are going to do for now is to simply change the function prototype itself to match up with the function definition. This works because we can only have one function definition. When a function is defined, we first check that the type signature matches the prototype signature. This is very simple right now, as we only have double types. So we just check that they have the same number of arguments. Then we iterate over the prototype arguments and definition arguments in tandem, and overwrite the prototype arguments if they are not equal to the definition arguments. Here is the full code listing for the FunctionAST codegen function now, with the changes in the first part of the code:
Function *FunctionAST::codegen(CodeGen &CG) {
// First, check for an existing function from a previous 'extern' declaration.
Function *TheFunction = CG.getModule()->getFunction(Proto->getName());
if (!TheFunction) {
TheFunction = Proto->codegen(CG);
} else {
// We have an extern prototype, so we do a quick and dirty alpha conversion hack
// and rename the prototype variables if they conflict. Note that this needs to be
// changed if we introduce more features into the language, since the alpha conversion
// may not work anymore.
// First check if there is an argument mismatch.
// NOTE: this is a "type equality" check between the prototype and the definition.
// Right now this works because type equality is just determined by number of double
// arguments, and all functions return double so we do not check that.
// This will need to be modified to be actual type equality later on.
if (TheFunction->arg_size() != Proto->getArgs().size())
return nullptr;
// Now we iterate over the function arguments and change names in the prototype
// as necessary.
unsigned int idx = 0;
auto defArgs = Proto->getArgs();
for (auto &Arg: TheFunction->args()) {
if (Arg.getName() != defArgs[idx]) {
Arg.setName(defArgs[idx]);
}
idx++;
}
}
if (!TheFunction)
return nullptr;
if (!TheFunction->empty())
return (Function*)CG.LogErrorV("Function cannot be redefined.");
// Create a new basic block to start insertion into.
BasicBlock *BB = BasicBlock::Create(CG.getContext(), "entry", TheFunction);
CG.getBuilder()->SetInsertPoint(BB);
// Record the function arguments in the NamedValues map.
CG.getNamedValues().clear();
for (auto &Arg : TheFunction->args())
CG.getNamedValues()[std::string(Arg.getName())] = &Arg;
if (Value *RetVal = Body->codegen(CG)) {
// Finish off the function.
CG.getBuilder()->CreateRet(RetVal);
// Validate the generated code, checking for consistency.
verifyFunction(*TheFunction);
return TheFunction;
}
// Error reading body, remove function.
TheFunction->eraseFromParent();
return nullptr;
}
And now we have no issue with different names (and also the bad bug that we had before can’t happen anymore):
ready> extern foo(a);
ready> Read extern: declare double @foo(double)
ready> def foo(b) b + b;
ready> Read function definition:define double @foo(double %b) {
entry:
%addtmp = fadd double %b, %b
ret double %addtmp
}
Thanks for tuning in, see you for the next challenge!