Some notes on computer stuff

Clang by example: Detecting wrong first include

April 23, 2015
[c++] [clang] [llvm] [tutorial]

TOC

  1. Detecting postfix operators in for loops
  2. Detecting unused functions
  3. Detecting wrong first include (you are here)

Introduction

There is a well known (or at least I hope that it's well known) good practice to include header that corresponds to the current source file before including any other headers. It's recommended by multiple C++ coding style guides (for example by Google C++ Style Guide) and indeed makes sense to follow.

To put it short: it makes finding incomplete headers really easy. If you forget to include required header, corresponding source file just won't compile. This way there will be no hidden (implicit) dependencies among modules of a project as well as dependencies on system headers. Below is a simple example that demonstrates such situation.

Let's look at three files:

// util.hpp

#ifndef BAD_FIRST_INCLUDE__UTIL_HPP__
#define BAD_FIRST_INCLUDE__UTIL_HPP__

std::string truncatePath(const std::string &path);

#endif // BAD_FIRST_INCLUDE__UTIL_HPP__
// util.cpp

#include <string>

#include "util.hpp"

// ...
// main.cpp

#include <string>

#include "util.hpp"

// ...

Although both cpp-files should compile without errors, it's easy to see that util.hpp file lacks include of <string> system header. There is no errors because both cpp-files include <string> before including util.hpp, which hides the fact that util.hpp is not standalone as it depends on source files it's included in. Simple swapping two includes in util.cpp file like this:

// util.cpp

#include "util.hpp"

#include <string>

// ...

Makes the issue pop up almost immediately, precisely when util.cpp file is compiled. As std::string is unknown at point of inclusion of util.hpp, compiler will issue errors on its usages in util.hpp.

This discussion is meant to outline usefulness of the approach when "self include" is placed at the top of the file above all other includes. Hope everyone are convinced now. Even though this usage of the approach clearly provides some benefits, not everyone use it and some don't even know about it. Here is where a need for tool that checks whether programmer followed described approach or not arises.

Such a tool could also check that corresponding header file is included at all. Since implementing this check doesn't add much interesting details to the code and requires some heuristics to check whether the header exists in the first place, it's left out of this memo. Nevertheless, implementing it later requires only additions to the code.

New type of FrontendAction

Comparing to previous articles, here we don't need AST at all. Parsing performed by preprocessor at lexical analysis stage of source file processing is enough to extract include directives. That is the reason why we need to get more familiar with FrontendAction. There are plenty of them and it's important to pick one that fits better.

So far we've indirectly met with the ASTFrontendAction at the very end of our main() functions. Here is the code:

return tool.run(newFrontendActionFactory(&finder.getMatchFinder()));

finder.getMatchFinder() returns MatchFinder &. So where is the FrontendAction? In this case it's implicitly created by one of several overloads of newFrontendActionFactory() function. If you look at MatchFinder closer, you'll see that it doesn't inherit from any class. It works thanks to duck typing, overloaded used in this case assumes that its argument implements method with the following signature (or the similar one that can be used without compilation errors):

clang::ASTConsumer *newASTConsumer();

Then value returned by this function is wrapped within a local class derived from ASTFrontendAction.

To implement our current task another overload of newFrontendActionFactory() function is needed. In this case it takes type of FrontendAction derived class as its only template parameter and instantiates it when required.

If you can't understand/recollect why do we need to use any factory at all, here is fast answer: each file is processed by new instance of FrontendAction, hence the need for a factory of such objects.

Now that we know how to create new FrontendActions, let's pick concrete type of action we want to use. There are already several actions related to preprocessor stage, which can be viewed at documentation page for PreprocessorFrontendAction. It shouldn't be hard to guess that we'd like to use PreprocessOnlyAction class, it successfully prevents creation of AST which should make processing of really huge sources fast and smooth for our tool.

For curious ones, here is how PreprocessOnlyAction prevents AST generation in code:

virtual bool usesPreprocessorOnly() const { return true; }

Here is documentation on usesPreprocessorOnly(). Yes, it's that easy, still not really obvious when you don't know it.

Implementing IncludeFinderAction

It should be clear now that IncludeFinderAction needs to derive from PreprocessOnlyAction, but what methods should it implement or override? Documentation mentions only PreprocessOnlyAction::ExecuteAction, so the choice is somewhat limited here and we need to overwrite exactly this method:

void
IncludeFinderAction::ExecuteAction()
{
    IncludeFinder includeFinder(getCompilerInstance());
    getCompilerInstance().getPreprocessor().addPPCallbacks(
        includeFinder.createPreprocessorCallbacks()
    );

    clang::PreprocessOnlyAction::ExecuteAction();

    includeFinder.diagnoseAndReport();
}

Version of ExecuteAction() from parent class is invoked in the middle of our code as we want this to happen:

  1. Create our auxiliary class that does the most interesting part of job (IncludeFinder).
  2. Subscribe it to preprocessor actions.
  3. Let preprocessor process input file as it would do normally.
  4. Analyze preprocessing results and print out results to a user.

Note usage of getCompilerInstance() in the code above. It comes from FrontendAction class and returns reference to CompilerInstance class which holds root objects that allow retrieval of all necessary information about source file processing.

Preprocessor callbacks

Some questions might arise when one is looking at the following lines:

getCompilerInstance().getPreprocessor().addPPCallbacks(
    includeFinder.createPreprocessorCallbacks()
);

Particularly the following ones:

  1. Why don't we "remove" any callbacks later?
  2. Why callbacks are "created"?

There is one answer for both questions (well, almost): Preprocessor takes ownership over PPCallbacks object(s) passed to it. Actually, that was the first part of the answer, there would be no need to create anything if we could unsubscribe listener. So the second part is: callbacks can't be removed after they are added. This means that callbacks handler must be allocated on heap.

It seems to be a design decision, because internally PPChainedCallbacks class is used and there are other similar classes, e.g. ChainedASTReaderListener. Such chained structures doesn't provide means for removing chains, only to add them and to call corresponding methods.

Implementing callbacks

Due to callback details described in the previous section callbacks need to be implemented with a proxy class. Here is how it works in general:

Figure 1.  Using proxy for preprocessor callbacks.

And here is more detailed description:

  1. ExecuteAction() method creates CallbacksProxy class with a reference to corresponding IncludeFinder instance and hands it over to an instance of Preprocessor class (solid arrows).
  2. When interesting callbacks occur Preprocessor calls CallbacksProxy, which dispatches the call to IncludeFinder (dashed arrows).

As for the list of callbacks, PPCallbacks currently has 30 of them and we need only one: PPCallbacks::InclusionDirective. It is called once for every #include (and alike) directive in the source code. The callback is called even when the file inclusion directive refers to doesn't exist. Seems to be exactly what we need.

Processing inclusions

There are two phases of processing here:

  1. Collecting all #include directives in a file.
  2. Deciding whether self-include exists and is at correct place in the file.

The first phase is quite straightforward:

clang::SourceManager &sm = compiler.getSourceManager();

if (sm.isInMainFile(hashLoc)) {
    const unsigned int lineNum = sm.getSpellingLineNumber(hashLoc);
    includes.push_back(std::make_pair(lineNum, fileName));
}

Note SourceManager::isInMainFile() method, you might want to use it one day too. It's an easy way to filter out only those elements that are located in the main file that is processed (one of source files passed to the tool). This way we do not print files referenced by our includes, includes of our includes and so on.

Apart from that we also extract line number in the source file using SourceManager::getSpellingLineNumber() method, which is also quite handy and simpler than another way of obtaining same information using FullSourceLoc as was shown in previous articles.

The second phase doesn't have anything new about clang, but for completeness here are main points how check for include file is performed:

  1. Backward slashes (\) in full path to the source file are replaced with forward slashes (/).
  2. Trailing component of the path is extracted, which should be filename.
  3. Extension of the file is removed to get its name only (part of file name without extension is usually called "root").
  4. For each of found inclusion directives:
    1. Root of file is obtained and checked to match root of the source file. Skip this header if name doesn't match.
    2. Extension is extracted and checked against list of known extensions of header files. Headers with unknown extensions are skipped after issuing corresponding warning.
    3. We get to this step only if match is found. If we found more than one match, print a warning.

The last check that is left is the primitive one: just check that the first element of our collection of includes is the one that corresponds to the source file.

Testing

Repository of the tool contains some [test files]. Let's try the tool on a couple of them. First with tests/good.cpp file:

#include "good.hpp"

#include <utility>
> self-inc-first tests/good.cpp --

No output, as expected.

Now take a look at tests/bad.cpp file:

#include <string>

#include "bad.hpp"

Output with truncated paths:

> self-inc-first tests/bad.cpp --
.../tests/bad.cpp:3:should be the first include in the file

Which is correct as well.

Conclusion

We've wrote another tool that enforces following good programming practices in C++ code. Although as of now it doesn't automatically guesses cases when header inclusion is completely missing in the file, the tool is already useful.

Expected workflow with it can be as follows:

  1. Run self-inc-first over all files of the project.
  2. Move includes according to obtained output.
  3. Fix compilation errors.

After doing this actions, all source files should become self-contained. A simple rule that can significantly simplify building in the future.

Resources