Chapter 3 Architectural Overview

3.1 Introduction

In this chapter, I explain the main architectural decisions that I made in Config4*.

3.2 Hiding Implementation Details

The public API of Config4* is defined in the Configuration class. This is an abstract base class containing very little code. This class provides a static create() operation that creates an instance of a concrete subclass. In this way, the implementation details of Config4* are kept separate from its public API.

The concrete subclass is called ConfigurationImpl. Its most important instance variable is a hash table. When a ConfigurationImpl object is created, its hash table is empty initially. The hash table can then be populated by calling insertString(), insertList() and ensureScopeExists() directly. Alternatively (and more commonly), you can call parse(), which, internally, calls those update-style operations.

3.3 Use of Multiple Hash Tables

I know of two potential ways in which a configuration parser might use a hash table to store name=value pairs. I use the configuration file below to illustrate the two approaches:

foo = "a string";
bar = ["a list", "of", "strings"];
acme {
    widget = "another value";
}

The first approach is to use a single hash table to store all the entries. The entries in this hash table can be represented as follows:

foo → (STRING, “a string”)
bar → (LIST, ["a list", “of”, “string”])
acme → (SCOPE, null)
acme.widget → (STRING, “another string”)

The above notation indicates that each entry in the hash table is a name → tuple mapping, in which the tuple contains two fields: a type (STRING, LIST or SCOPE) and a value (if appropriate).

The other approach is to use a separate hash table for each scope. With this second approach, the hash table for the root scope can be represented as follows:

foo → (STRING, “a string”)
bar → (LIST, ["a list", “of”, “string”])
acme → (SCOPE, <another-hash-table>)

The hash table for the acme scope contains:

widget → (STRING, “another string”)

When I wrote my first configuration parser, I used the first approach, that is, a monolithic hash table. I did this for three reasons. First, it was simpler to implement. Second, it was slightly more memory-efficient. Finally, it meant that the implementation of a lookup<Type>() operation required a lookup on just one hash table. In contrast, the “separate hash table for each scope” approach can require multiple lookups on hash tables. For example, looking up the value of "acme.widget" requires two invocations of lookup():

value = rootScopeOfHashTable.lookup("acme).lookup("widget");

Several years later, I added the @copyFrom statement to my configuration parser and, unfortunately, this introduced a severe performance problem. When using the “monolithic hash table” approach, the implementation of @copyFrom has to iterate over the entire contents of the hash table to find the relevant entries that should be copied. The worst-case scenario for this is when a configuration file has a scope called, say, defaults, and many other scopes, each of which contains the following statement:

@copyFrom "defaults";

In such a scenario, parsing the configuration file takes O(N²) time, where N is the number of entries in the configuration file. That O(N²) performance problem disappears if, instead, a separate hash table is used for each scope. For that reason, I redesigned Config4* to use a separate hash table for each scope.

The hash table used by the ConfigurationImpl class is implemented by the ConfigScope class. The (type, value) tuple used in the above discussion of hash tables is implemented by the ConfigItem class.

3.4 Why Creation and Parsing are Separate Steps

With Config4*, parsing of a configuration file is kept separate from the (initially empty) construction of the Configuration object. For example, in C++, you write:

cfg = Configuration::create();
cfg->parse("foo.cfg");

Things were not always that way. When I wrote my first configuration parser, parsing of a configuration file was performed in the constructor. This resulted in slightly shorter application code:

cfg = Configuration::create("foo.cfg");

Unfortunately, performing parsing in the constructor turned out to be a source of memory leaks. This is because the parser might encounter an error in the configuration file and, as a result, throw an exception. Throwing an exception from (the parser called from within) the constructor means that the object’s destructor is not called, so heap-allocated instance variables become memory leaks. In theory, all I had to do was write the constructor as shown below:

ConfigurationImpl::ConfigurationImpl(const char * fileName)
{
    ... // allocate memory for instance variables
    try {
        parse(fileName);
    } catch(const ConfigurationException & ex) {
        ... // free memory of instance variables
        throw; // re-throw the exception
    }
}

However, on several occasions, as the project matured, memory leaks crept in due to me adding new heap-allocated instance variables but forgetting to free them in the above catch clause. Eventually, I grew tired of that source of recurring memory leaks, and I decided to prevent future re-occurrences by keeping parsing separate from object construction.

Several years after I made that change, I discovered two extra benefits of keeping parsing separate from construction. First, it makes it possible to preset configuration variables. Second, it makes it possible to set a security policy before parsing a configuration file.

3.5 Limitations

There are very few arbitrary limitations in the implementation of Config4*. For example:

Aside from available RAM, there is no arbitrary limit on the size of a configuration file, or the length of lines within a configuration file.
There is no arbitrary limit on the length of an identifier (that is, the name of a scope or variable), on the length of a string value, or on the maximum number of strings in a list.
There is no arbitrary limit on the maximum number of nested @include statements. (However, operating systems typically place a limit on the number of open file descriptors within a process; that will limit the number of nested @include statements.)
There is no arbitrary limit on the number of scopes or how deeply they can be nested. There is no arbitrary limit on the number of entries in a scope. The scope’s hash table will resize itself when it starts to fill up.

I think you get the idea: arbitrary limitations are not common in Config4*. Having said that, Config4* does have some limitations, as I now discuss.

3.5.1 Number of uid- Entries

There can be no more than 10⁹ uid- entries in a configuration file. That is an arbitrary limitation, albeit a large one. That limitation arises because Config4* uses a 32-bit integer to store the uid- counter, and the maximum value of such an integer is 2³¹−1 = 2,147,483,647. That value is a 10-digit number. I decided to round down the maximum value of the uid- counter to 999,999,999 so that the expanded form of an uid- identifier contains nine digits instead of ten.

How likely is it that a configuration file will exceed the limit on uid- entries? I don’t think many people will be creating big enough configuration files to have to worry about exceeding this limit within the next few years (I’m writing this statement in 2011). But the software and databases that underpin an Internet search engine, such as Google, might. If you work for such a company and wish to increase this limit, then you should do the following. Edit the UidIdentifierProcessor class, change the declaration of the instance variable from being a 32-bit integer to being a 64-bit one, and modify the code so that when the value of this instance variable is formatted as a string, the string contains more than nine digits.

3.5.2 Lack of File name and Line Number Information

Consider the following scenario involving two configuration files: foo.cfg and bar.cfg. The foo.cfg file contains the following:

@include "bar.cfg";
... # define some configuration variables

The bar.cfg file contains the following:

x = "2"    # missing semicolon
y = "tru"; # misspelling of "true"

Now let’s consider what happens if we run a program that contains the following code:

cfg = Configuration.create();
try {
    cfg.parse("foo.cfg");
    boolean myBool = cfg.lookupBoolean("", "y");
} catch(ConfigurationException ex) {
    System.out.println(ex.getMessage());
}

When we run the program, the call to parse() fails because of a syntax error, and the following message is printed:

bar.cfg, line 2: expecting ’;’ or ’+’ near ’y’
(included from foo.cfg, line 1)

The error message is very informative. Not only does it correctly report the missing semicolon, it also specifies the location of that problem: line 2 of file bar.cfg, which was included from line 1 of foo.cfg.

Let’s assume we insert the missing semicolon and run the program again. Now, parse() succeeds, but the call to lookupBoolean() fails, and the following message is printed:

foo.cfg: bad boolean value (’tru’) specified for ’y’; should be one of:
’false’, ’true’

That error message is less informative that the previous one. It correctly describes the problem, but it does not accurately specify the file name and line number of the problematic configuration variable. Instead, it just assumes (inaccurately, in this case) that the problematic variable is defined somewhere in foo.cfg rather than in an included file.

The lack of accurate location information in this second error message is due to that information not being recorded in Config4*’s internal hash tables. That information is not recorded because of a combination of my laziness and my concern for efficient memory use, as I now explain.

When the Config4* parser encounters a name=value statement or the opening of a scope, it enters information into the internal hash tables by calling one of the following operations: insertString(), insertList() or ensureScopeExists(). The following discussion applies to all those operations, so, for conciseness, I will discuss just insertString().

The first configuration parser I implemented—the original ancestor of Config4*—did not have @include or @copyFrom statements. The insertString() operation took an extra parameter that indicated the line number at which the configuration variable was defined:

void insertString(String scope, String name, String value, int lineNum);

That line number was recorded in the hash table entry for the variable. If an operation, say, lookupBoolean(), could not translate a variable’s value into the appropriate type, then the text message in the exception thrown could specify the line number (obtained from the entry in the hash table) and the file name (obtained by calling cfg.fileName()) of the problematic variable. This approach worked well, and it had minimal memory overhead: just a 4-byte integer (to store the line number) for each entry in a hash table.

Several years later, I added the @include statement. I realised that if error messages were to specify accurate location information, then it would no longer be sufficient to pass a line number to insertString(). That operation would have to be modified to take a parameter that specified a list of (fileName, lineNumber) tuples, as shown in the following pseudocode:

void insertString(String scope, String name, String value,
                  List[(fileName, lineNum)] locationInformation);

That list of tuples could be stored in the hash table entry for a variable. Then an error message produced by, say, lookupBoolean() could indicate the file name and line number of the problematic variable, plus the path, if any, that traces the @include statements from the main configuration file to the file that contains the problematic variable. (Ideally, the path would trace not just @include statements, but also @copyFrom statements.)

Implementing that enhancement could result in a significant memory overhead. For example, let’s assume there are 100 variables defined in bar.cfg, which is included from foo.cfg. Would the enhancement result in there being 100 copies of the string "bar.cfg" and another 100 copies of "foo.cfg"—separate copies for each entry in the hash table? Avoiding such redundant copies would require the implementation of a pool of unique strings, which would add complexity to the implementation of Config4*.

Would such memory overhead and/or complexity be a worthwhile investment to obtain more informative error messages? I don’t know. So far, I have found it straightforward to search through a file (and included files, if any) in a text editor to find a problematic variable. But then, I have been dealing mainly with configuration files that contain only a few hundred or few thousands lines of text. Perhaps, in a few years time, somebody will be working with configuration files that contain millions of lines of text, a complex interaction of deeply nested and re-opened scopes, all compounded with @include and @copyFrom statements. In such a scenario, accurate location information in error messages might improve ergonomics significantly.

3.5.3 Information lost with round-trip parse() and dump()

If you parse() a configuration file and dump() it back out again, then you do not get back the full contents of the original configuration file.

As first sight, this might appear to be a limitation of the dump() operation. However, that view is inaccurate. To better understand the issues involved, consider a configuration file that contains the following statement:

log_dir = getenv("FOO_HOME") + "/logs/" + exec("hostname");

The Config4* parser evaluates the expression and stores the result in the hash table for the configuration scope. For example, if FOO_HOME has the value "/opt/foo" and hostname returns "host1", then the hash table will contain the following entry:

log_dir → (STRING, “/opt/foo/logs/host1”)

The dump() operation simply dumps the contents of the hash table, and thus produces:

log_dir = "/opt/foo/logs/host1";

So, the limitation is not actually with the dump() operation, since it is faithfully reproducing the contents of the hash table. Instead, the limitation is with the parser and hash table representation, because they record a processed (rather than the original) version of what was in the input configuration file.

You might think this limitation would be easy to overcome: just have the hash table store the original expression rather than the result of evaluating the expression. However, such an approach would suffer from two significant problems.

The first problem is an increased performance overhead. This is because the overhead of evaluating the expression would not be incurred exactly once, when parsing the input file. Instead, the overhead would be incurred every time a lookup<Type>() operation is invoked (which might be multiple times in an application).

The second problem is that the internal architecture of Config4* would have to be redesigned completely to enable dump() to reproduce the input configuration file exactly. In particular, something more complex than a hash table would be required to store the parsed information. This is because:

A hash table does not preserve the order in which entries were added to it, but such an order-preservation guarantee would be required for dump() to reproduce the input file accurately.
The parser discards comments when parsing the input file. These would have to be preserved in the internal representation for dump() to be able to reproduce the input file accurately.

In addition, it is difficult to see how an efficient internal representation might preserve commands such as @include, @copyFrom, @remove, @if-then-@else, and conditional assignment ("?=") statements rather than just the name=value pairs resulting from executing those commands.

In summary, it would require a significant amount of rework to the architecture of Config4* to be able to implement a dump() operation that could reproduce the input configuration file accurately. In my opinion, the benefits would not justify the amount of work involved.

The preceding discussion invites a question: Why did I implement a dump() operation that reproduces the input configuration file so inaccurately? The answer is that my original intention in implementing dump() was to provide a debugging tool: the output of dump() helped me to check that I had implemented the hash table-based internal representation correctly. It was only later that I realised dump() might be useful for other purposes too, such as converting, say, an XML file into Config4* format, or storing the user preferences of a GUI-based application.

3.6 The Multi-step Build Process

Some software projects have a straightforward build system: compile all the source-code files, and then combine them to form a library, executable or jar file. Some other software projects require a multi-step build system, for example:

Compile a subset of the source-code files to produce a utility program, such as a code generator.
Run that utility program to generate additional source-code files.
Compile the newly generated files plus the remaining source-code files, and combine them to form a library, executable or jar file.

Config4Cpp requires that type of multi-step build system. This is due to the default security policy, which must be embedded within the Config4Cpp library.

The first step of the build system is to compile a few source-code files to produce a simplified version of config2cpp called config2cpp-nocheck. In a moment, I will explain how and why this “no check” version of the utility is simplified. But before that, I will discuss the remaining steps of the build system.

The second step of the build system is to run the newly compiled utility on the DefaultSecurity.cfg file to produce a C++ class called DefaultSecurity.

The third step of the build system is to compile this newly generated class plus the remaining source-code files, and combine them to form a library and executable.

The (non-simplified) config2cpp cannot be used in step 2 of the build system because it makes use of the Config4Cpp library, which is not built until step 3 of the build system. (In particular, it is the schema-generation functionality of the utility that makes use of the Config4Cpp library.)

The (simplified) config2cpp-nocheck utility does not contain any schema-generation functionality. This simplification means it avoids any dependency on the Config4Cpp library. This simplified utility is used only by the build system: it is not copied into the bin directory for use by regular users of Config4*.

Originally, Config4J used a similar multi-step build process. However, Version 1.2 of Config4J introduced support for strings of the form "classpath#path/to/file.cfg" that can be passed as a parameter to the Configuration.parse() operation. This Java-specific enhancement means that the DefaultSecurity.cfg file can now be found by searching for it on the classpath (which is guaranteed to work since the file is embedded as a resource file in config4j.jar). In turn, this means that Config4J can now make use of a simpler, single-step build system: just compile all .java files and create config4j.jar.

3.7 Features Implemented with Delegation

Two important pieces of functionality (fallback configuration and security policies) are implemented by having the user-created Configuration object delegate to another, but internal, Configuration object. In this section, I briefly explain how the delegation works.

3.7.1 Fallback Configuration

One of the instance variables in the ConfigurationImpl class is a C++ pointer or Java reference to another ConfigurationImpl object. In Config4J, this instance variable is called fallbackCfg, while in Config4Cpp it is called m_fallbackConfig. (In general, Config4Cpp uses "m_" as a prefix on the names of member, that is, instance, variables.) The constructor initialises this instance variable to be a C++ nil pointer or Java null reference. The setFallbackConfiguration() operation sets it to point to another Configuration object.

The Configuration class defines many type-specific lookup operations, such as lookupList(), lookupString() and lookupBoolean(). The implementations of those operations, either directly or indirectly, invoke a more primitive operation called lookup(), which looks for the desired entry in the hash tables. If lookup() finds the entry, then it returns a pointer/reference to the relevant hash table’s ConfigItem; otherwise, it continues the search by delegating to the fallback configuration object. This can be seen in the abridged pseudocode algorithm shown below:

ConfigItem lookup(String fullyScopedName, String localName, ...)
{
    ConfigItem    item;
    item = ...; // search for fullyScopedName in the hash tables
    if (item == null && fallbackCfg != null) {
        item = fallbackCfg.lookup(localName, localName, ...);
    }
    return item;
}

3.7.2 Security Policy

The enforcement of Config4*’s security policy relies on the interaction between three items: (1) a singleton object representing the default security policy; (2) two instance variables in the ConfigurationImpl class; and (3) an operation called isExecAllowed(). I will discuss each of those in turn.

In Section 3.6, I explained how the build system embeds a DefaultSecurity.cfg file into the Config4* library. That embedded configuration file provides the default security policy. A class called DefaultSecurityConfiguration: (1) inherits from ConfigurationImpl; (2) uses its constructor to parse the embedded DefaultSecurity configuration file; and (3) provides a singleton object. That singleton object is the default security policy used by all Configuration objects.

The ConfigurationImpl class contains two instance variables that are used to implement the security policy:

// Java instance variables
Configuration    securityCfg;
String           securityCfgScope;

// C++ instance variables
Configuration *  m_securityCfg;
StringBuffer     m_securityCfgScope;

The ConfigurationImpl constructor initializes the (m_)securityCfg variable to point to the DefaultSecurityConfiguration singleton object, and initialises (m_)securityCfgScope to be an empty string (denoting the root scope). A programmer can update those instance variables by calling the setSecurityConfiguration() operation.

Recall that there are three ways Config4* can execute an external command:

cfg.parse("exec#command");
@include "exec#command";
name = exec("command");

Whenever Config4* is asked to execute an external command, it calls isExecAllowed() to determine if the security policy in effect allows the specified command to be executed. That operation makes its decision by comparing details of the specified command to the allow_patterns, deny_patterns and trusted_directories variables that appear in the (m_)securityCfgScope scope of the (m_)securityCfg configuration object.

3.8 Thread safety

Implementations of Config4* are not thread safe. The lack of thread safety was a deliberate design decision, and was based on two considerations.

First, some programming languages do not provide portable synchronisation facilities. Thus, avoiding reliance on such facilities helps to keep the architecture of Config4* portable across programming languages.

Second, all the operations in the API of Config4* fall into one of two categories: either they are query operations such as lookup<Type>(), or they are update operations such as parse(), ensureScopeExists(), insert<Type>(), remove() and empty(). I imagine that most multi-threaded, Config4*-based applications will use a single thread to call one or more update operations to initialise a Configuration object. Once initialisation is complete, the Configuration object can then be made available to other threads within the application, but those threads will invoke only query operations on it. It is safe for multiple threads to invoke query operations concurrently.