“If we reduce the execution clocks in these successful processors by a half, reduce the cache sizes by four or eight times, eliminate the L3 cache completely, reduce instruction execution width to one, remove any branch prediction, can we expect spectacular server consolidation performance with these nearly chocked processors? Not a chance! This characterizes the SPARC T processor. To be fair, there are applications that lends themselves to a processor that switches available thin thread contexts on L1 cache misses, but those are generally associated with applications such as specific web farms and functions such the UNIX dd command.”
Sun-cum-Oracle predicated their SPARC T1-T3 Chip Multi Threading (CMT) architecture on a wishful perception of the modern commercial work load. From 2005 to more or less a couple of weeks ago (Hot Chips 23), they claimed modern data center applications as having: (verbatim)
A high-degree of thread level parallelism (TLP)
Large working data sets resulting in poor locality of reference leading to high cache miss rates
Significant data sharing among threads resulting in coherence misses
Low instruction level parallelism (ILP) due to high cache miss rates, difficult to predict branches etc...
Performance bottle necks due to stalls on memory access
Addressing these perceptions defined the architecture of Sun's family of CMT SPARC T1 to T3 processors. These processors were characterized by poor, single “thin” thread performance, yet rather excelled in copying and moving data. The cores were very simple with a handful of stages. Sun claimed single thread, ILP-centric, high clocked processors did not address the demands of the modern data center, for performance was limited by memory access latencies. Sun essentially covered up memory latency by switching to another available thread at a cache miss.
At Hot Chips 23, August 19, 2011 (Stanford University), Oracle took the covers off the next generation of “CMT” SPARC processors. The SPARC T4 has a “feature” called the critical thread API, allowing a single thread to use all the resources of an entire core. Sound familiar? Yes, it's called maximizing the execution of a single ILP-rich thread, and it will do this at clocks around 3GHz. Each core is now a superscalar with 16-stage integer and 11-stage floating point pipelines – and does so with the addition of an L3 cache!
One should wonder what was going on with Sun-cum-Oracle's 6 years of telling the world single threaded, thick ILP, high clock speed processor designers were confronting a technological barrier. What other stories does Oracle want us to believe now? Now that the SPARC has been reset to address real world data center applications, it will have to play catchup with IBM, Intel, and AMD who, for some reason, never had such a technological barrier to overcome!