The layering of abstractions has served us well, but it’s now generating the sorts of complexity it was designed to solve. Time for a re-think?
Anyone who’s built a large piece of software knows that much of the effort is in managing the complexity of the project: which other software a piece of code relies on, how to keep the various aspects separate, how to manage changes and upgrades, and so on. This isn’t something that’s got easier over time: it has for a given code size and style, as we’ve understood build processes and dependency management better; but the code sizes have relentlessly increased to compensate for our improved understanding; and modern practices don’t make life any easier. Downloaded code, dynamic modules and classes, client-server and the like all generate their own intrinsic complexity.
One of the biggest sources of complexity is the use of multiple applications, especially in enterprise systems. A typical e-commerce system, for example, will make use of a web server to present pages (which themselves might contain embedded JavaScript for client-side processing), a database to track orders and inventory, a procurement system to fulfil orders, and possibly a supply-chain management system to order new inventory. That’s the application. Then there’ll be the operating system, a logging facility, a facilities management system, and a load of administrative tools and scripts. And the operating system may itself be virtualised and running as a guest within another, host operating system and hypervisor, which needs its own toolset. The interactions between these tools can be mind-boggling.
Someone once asked: who knows how to work the Apache web server? It sounds like a simple question — any competent web manager? the main developers? — but the sting in the tail is that Apache is very configurable: so configurable, in fact, that it’s pretty much impossible to work out what a given combination of options will do (or, conversely, what combination of options to use to achieve a given effect). The interactions are just too complicated, and the web abounds with examples where interactions between (for example) the thread pool size, the operating system block size, and the Java virtual machine parameters conspire to crash a system that looks like it should be working fine. If you can’t work one server properly — one component of the system — what hope is there to get a complete system humming along?
Al Dearle and I have been talking about this for a while. The basic issue seems to be an interaction between decomposition and dependency. In other words, the complexity comes at the “seams” between the various sub-systems, and is magnified the more configurable the components on either side of the seam are. This is important, because systems are becoming more finely decomposed: the move to component software, software-as-a-service and the like all increase the number of seams. Al’s image of this is that modern systems are like Russian dolls, where each supposedly independent component contains more components that influence the size and complexity of the component containing them. You can only simplify any individual piece so far, because it depends on so many other pieces.
Actually a lot of the seams are now unnecessary anyway. Going back to the e-commerce example, the operating system goes to great pains to provide a process abstraction to keep the components separate — to stop faults in the database affecting the web server, for example. Historically this made perfect sense and prevented a single faulty process in a time-sharing system affecting the processes of other users. Nowadays, however, it makes considerably less sense, for a number of reasons. Firstly, all the components are owned by a single nominal user (although there are still good reasons for separating the root user from the application user), so the security concerns are less pronounced. Secondly, all the components depend on each other, so a crash in the database will effectively terminate the web server anyway. (We’re simplifying, but you get the idea.) Finally, there’s a good chance that the web server, database and so on are each running in their own virtual machine, so there’s only one “real” process per machine (plus all the supporting processes). The operating system is offering protection that isn’t needed, because it’s being provided (again) by the hypervisor running the virtual machines and perhaps (again) by the host operating system(s) involved.
We also tend to build very flexible components (like Apache), which can deal with multiple simultaneous connections, keep users separate, allow modules to be loaded and unloaded dynamically — behave like small operating systems, in other words, replicating the OS functionality again at application level. This is despite the fact that, in enterprise configurations, you’ll probably know in advance the modules to be loaded and have a single user (or small user population) and fixed set of interactions: the flexibility makes the component more complex for no net gain during operation. Although it might simplify configuration and evolution slightly, there are often other mechanisms for this: in a cloud environment one can spin-up a replacement system in an evolved state and then swap the set of VMs over cleanly.
It’s easy to think that this makes no difference for modern machines, but that’s probably not the case. All these layers still need to be resourced; more importantly, they still need to be managed, maintained and secured, which take time to do well — with a result that they typically get done badly (if at all).
Can we do anything about it? One thought is that the decomposition that makes thinking about systems and programming easier makes executing those systems more complex and fragile. In many cases, once the system is configured appropriately, flexibility becomes an enemy: it’ll often be too complicated to re-configure or optimise in a live environment anyway. There may be a reason to have Russian dolls when designing a system, but once designed it’s better to make each doll solid to remove the possibility of then opening-up and falling apart.
So it’s not decomposition that’s the issue, it’s decomposition manifested at run-time. When we add new abstractions to systems, we typically add them in the form of components or libraries that can be called from other components. These components are often general, with lots of parameters and working with multiple clients — sound familiar? This is all good for the component-writer, as it lets the same code be re-used: but it bloats each system that uses the component, adding complexity and interactions.
So one thought for tackling complexity is to change where decomposition manifests itself. If instead of placing new functions in the run-time system, we placed it into the compiler used to build the run-time, we could use compilation techniques to optimise-out the unnecessary functionality so that what results is optimised for the configuration that it’s actually being placed in, rather than being general enough to represent any configuration. There’s substantial work on these ideas in the fields of staged compilation and partial evaluation (for example MetaOCaml, Template Haskell, Flask and the like): the flexibility is manifested at compile-time as compile-time abstractions, that in the course of compilation are removed and replaced with inflexible — but more efficient and potentially more dependable — specialised code. Think taking the source code for Linux, Apache and MySQL, accelerating them together at high speed, and getting out a single program that’d run on a bare machine, had nothing it didn’t actually need, and had all the options for the various (conceptual) sub-systems set correctly to work together.
Don’t believe it’s possible? Neither do I. There’s too much code and especially too much legacy code for this to work at enterprise (or even desktop) level. However, for embedded systems and sensor networks it’s a different story. For these systems, every extra abstraction that makes the programmer’s life easier is a menace if it increases the code size hitting the metal: there just isn’t the memory. But there also isn’t the legacy code base, and there is a crying need for better abstractions. So an approach to the Russian dolls that moves the abstractions out of the run-time and into the languages and compilers might work, and might considerably improve the robustness and ease of use for many systems we need to develop. It also works well with modern language technology, and with other trends like ever-more-specialised middleware that remove bloat and overhead at the cost of generality. Keeping the former and the latter seems like a worthwhile goal.