Last month I read an excellent article in the May issue of Communications of the ACM (CACM) about the future direction of microprocessors. Shekhar Borkar and Andrew Chien’s argument is that energy efficiency is now the limiter of processor development and performance – simply adding cores at ever higher frequencies will lead to prohibitive power consumption. With an expectation of 150-million logic transistors by 2018 they further argue that it will not be feasible to simply replicate identical cores to achieve the performance gain of 30x predicted by the improvements seen over the last two decades. To deal with this issue they suggest that processor designers will instead provide multiple types of core which trade a combination of number, size and complexity against throughput.
In a related blog article at Dr. Dobb’s, Cameron and Tracey Hughes raise the question of transparency for software developers. They question whether they will be able to “write once and run everywhere” or will they need to get out the “Trick Bag” to explicitly deal with the potential variance from processor to processor, machine to machine.
The quick answer, I believe, is that the effect of availability and differences in cores will be felt by system programmers – those that create and maintain operating systems and compiler chains – first. Having said that I don’t believe there is actually an easy answer to the Hughes’ question as the CACM article, whilst being solidly grounded in the hardware aspects, does raise questions related to the software. This is especially true if it is not just the cores that differ, but that the established environment (for example, a flat coherent address space) changes.
The article’s “key insights” summary also talks about application-customised hardware but that, to me, has the implication that gains from multiple cores will only be apparent in specific fields. The aim of energy efficiency may not, in such a situation, translate, or at least achieve its full potential, in a wider market.
So we have lots of questions with no easy answers to any of them. Why is this?
If we leave aside processors designed for specific niche applications and look to the wider market, we find that machines are running many threads or applications at the same time, from an email client receiving mail whilst an image editor is removing red-eye, database servers handling multiple queries and updates, to low-level device drivers responding to key presses or mouse movements whilst network packets are received. If processors contain specialist cores that only relate to specific types of computation then much of the hardware may end up unused in low-power or off states (especially if you never perform image editing, for example, but there is a core optimised for that function). In such a case the throughput won’t achieve the figures given in the article’s example scenarios due to the mismatch in capability and processing load. The extension of this argument is potential confusion for consumers where different processor variants are aimed at different market segments.
Leaving that issue aside, we do already have a problem for software engineers with the number of different processor capabilities already “live in the field”. The first example that springs to mind is the disparate range of SIMD operations available on ostensibly compatible processors. It is already necessary to implement mechanisms that trap unhandled instructions (for example SSE3 on a processor that only supports SSE2) and emulate them by combining multiple supported instructions on one variant to achieve what a single instruction can do on another. This hides the programmer from differences in the cores by exposing a unified model in which all cores the application may be run on (including those on different processors or machines) appear identical.
We can also look at development for the Cell processor which uses two different architectures (PowerPC and SPE) and therefore requires the programmer to break down the design and algorithms for development targeted to a specific core type. That may be considered a specialist environment, as is the allocation of processing to an available GPU for both graphics and application processing (for example NVIDIA’s CUDA system). These approaches treat the specialist cores as co-processors or peripherals to the main core and the programmer has, of necessity, to consider them.
In these examples, the Hughes’ don’t need a trick bag to write their software. They would either be explicitly targeting an environment of heterogeneous cores or shielded by lower-level mechanisms.
Turning, again, to the small core model suggested in the article there are many ways they could be created. They could remove out of order processing or branch prediction capabilities, use shorter pipelines, allow fewer parallel operations or remove execution units such as those needed for SIMD or floating point support. In addition, energy efficiency (the goal of the new architecture) could be realised by different clock speeds or cache sizes. Some of these choices would, in general, become apparent as a throughput or performance variation between cores and that does impact the programmer if they need to guarantee latency.
If, as Borkar and Chien suggest, all the cores were to share a unified instruction set, albeit partially implemented by core type, then the differences could be handled in the same way as current SIMD differences. This approach, however, leads to inefficient use of the cores not to mention the hardware (transistor) overhead of extracting and decoding a subset of the full instruction set. We need, therefore, to consider some possible solutions and their impact on application programmers.
A reasonable assumption is that any processor implementing multiple core types will provide a mechanism to identify each core’s capabilities – Intel’s CPUID instruction already does this, returning a result which is valid for all the available symmetric cores. It would, therefore, be relatively simple for the compiler chain to include, in the header section of a compiled program, a similar set of flags indicating the required features. Operating systems could use these flags to allocate the program to a capable core (or at the least the closest match with emulation for the remainder) and the application programmer would once again be shielded from core differences.
If we want to go further, which we would since we are looking for energy-proportional computing, system programmers could allow any thread of execution to run on any of the cores until an instruction is encountered that is not supported. At that point the fault could be handled by reallocating the thread to a core that does support the instruction. This does, of course, still incur an overhead and time penalty as the current core state is migrated to the new core and, possibly, private caches are reloaded. We still have questions, such as how to identify a thread that can be migrated to a less capable core, however this is most likely in the domain of the system programmer and our application programmer is still shielded from having to make any explicit choices about the instructions or cores used.
The examples we have looked at affirm the simple answer I gave in response to the Hughes’ question but exposes the need for application developers to understand the available architecture and design software to use the capabilities of complex, power hungry large cores in sustained bursts rather than in a scatter-gun fashion. There are still, however, a large number of questions that need answering – even if it is only in the system programmer’s domain.
In fact, it appears to me that Cameron and Tracey’s question is not the one we should be asking. Much of the development of processors and the system level software around them has been done in such a way that application programmers have been largely unaffected by (in fact one could argue that they have been protected from) architectural change. Whilst this allows continued application development without radical redesign it also continues to enforce a distinction between hardware and software development and progress. Furthermore, it places the onus on the compiler and operating systems developers to make the best use of new architectures and instructions and develop the necessary glue to let applications continue to grow in the manner we are used to.
We will only achieve energy-proportional (or efficient if you prefer) computing if all the participants work far closer together in moving forward and understand the aims and limitations that the others have. It is from this closeness that radical new architectures and algorithms will grow, especially in these, the first, steps of choosing feature sets to make available in heterogeneous collections of processing cores.