In the previous blog entry, we read:
“If we reduce the execution clocks in
these successful processors by a half, reduce the cache sizes by four
or eight times, eliminate the L3 cache completely, reduce instruction
execution width to one, remove any branch prediction, can we expect
spectacular server consolidation performance with these nearly
chocked processors? Not a chance! This characterizes the SPARC T
processor. To be fair, there are applications that lends themselves
to a processor that switches available thin thread contexts on L1
cache misses, but those are generally associated with applications
such as specific web farms and functions such the UNIX dd command.”
Sun-cum-Oracle predicated their SPARC
T1-T3 Chip Multi Threading (CMT) architecture on a wishful perception
of the modern commercial work load. From 2005 to more or less a
couple of weeks ago (Hot Chips 23), they claimed modern data center
applications as having: (verbatim)
A high-degree of thread level
Large working data sets resulting
in poor locality of reference leading to high cache miss rates
Significant data sharing among
threads resulting in coherence misses
Low instruction level parallelism
(ILP) due to high cache miss rates, difficult to predict branches
Performance bottle necks due to stalls on memory access
Addressing these perceptions defined
the architecture of Sun's family of CMT SPARC T1 to T3 processors.
These processors were characterized by poor, single “thin” thread
performance, yet rather excelled in copying and moving data. The
cores were very simple with a handful of stages. Sun claimed single
thread, ILP-centric, high clocked processors did not address the
demands of the modern data center, for performance was limited by
memory access latencies. Sun essentially covered up memory latency by
switching to another available thread at a cache miss.
At Hot Chips 23, August 19, 2011
(Stanford University), Oracle took the covers off the next generation
of “CMT” SPARC processors. The SPARC T4 has a “feature”
called the critical thread API, allowing a single thread to use all
the resources of an entire core. Sound familiar? Yes, it's called
maximizing the execution of a single ILP-rich thread, and it will do
this at clocks around 3GHz. Each core is now a superscalar with
16-stage integer and 11-stage floating point pipelines – and does
so with the addition of an L3 cache!
One should wonder what was going on
with Sun-cum-Oracle's 6 years of telling the world single threaded,
thick ILP, high clock speed processor designers were confronting a
technological barrier. What other stories does Oracle want us to
believe now? Now that the SPARC has been reset to address real world
data center applications, it will have to play catchup with IBM,
Intel, and AMD who, for some reason, never had such a technological
barrier to overcome!
Recently, industry trade journals
have announced Oracle's the ability to migrate Virtual Machines (VMs)
safely and securely from one execution environment to another on
SPARC T-based servers. Actually, the OracleVM Server for SPARC Data
Sheet claims what might be considered rather run-of-the-mill for
state-of-the-art server virtualization. Even assuming basic
virtualization capabilities, the architecture of the Sun-cum-Oracle
SPARC T processors do not lend themselves well to “tacking on”
virtiualization and expecting low-overhead performance. If stuck with
either unportable Solaris source code or SPARC binary images and one
simply needs the code to run, albiet degraded, using OracleVM to host
an old Solaris 8 VM on an SPARC T server could be a temporary
solution. However, as for OracleVM and an SPARC T being a powerful
virtualized consolidation platform, one needs to think twice. Here is
When considering server consolidation
it soon becomes clear that without good planning, including adding
multiple paths to network connections and storage spindles, having
more than sufficient RAM (roughly, closing one's eyes and at least
summing up what each stovepipe server used) and choosing an appropriate
target execution architecture, problems soon arise. Modern
conventional processors such as IBM POWER and Intel x86 were designed
for wide instruction execution capability using compilers that extract
maximum Instruction Level Parallelism. Effective execution of wide
ILP code requires a processor design including advanced
branch prediction, large out of order execution windows, large and
extremely fast caches, enhanced with the ability to execute more than
one execution thread per core -- simultaneously. A single 8-core IBM POWER7 processor, for example, can execute 32 threads all in the same
clock tick. The latter is in contrast with Oracle's SPARC T
processors that at any time execute only one thread per clock per
core regardless of the number of tread contexts that are pending.
Modern, multi-threaded operating
systems such as, AIX, Solaris, Linux, even Windows keep track of each
thread context status, switch between threads on the order of
milliseconds, enable prioritized thread preemption, task management
and its switching, and paging in/out of virtualized memory. It is not a
coincidence that processors that have cleaver branch prediction,
large and sophisticated caching, fast clocks, large memory spaces,
not only have the highest performance (such as seen on www.spec.org,
etc) but are successful platforms for server consolidation.
If we reduce the execution clocks in
these successful processors by a half, reduce the cache sizes by four
or eight times, eliminate the L3 cache completely, reduce instruction
execution width to one, remove any branch prediction, can we expect
spectacular server consolidation performance with these nearly
chocked processors? Not a chance! This characterizes the SPARC T
processor. To be fair, there are applications that lends themselves
to a processor that switches available thin thread contexts on L1
cache misses, but those are generally associated with applications
such as specific web farms and functions such the UNIX dd command.
In virtualized environments, system RAM
is extra virtual – operating system (via HW MMU) address
translation and the hypervisor must satisfy and keep track of
multiple VM address translation. Threads should not switch on an L1
cache miss, but rather when the VMs demand it. Even on the latest SPARC T3
processor, with 8KB instruction, 6KB data L1 cache sizes, and 16
cores sharing a 6MB L2 cache, cache thrashing and thread stalling must be
tremendous in a virtualized environment.
In stark contrast, IBM PowerVM,
is not only based on one of the highest performance processor
architectures (IBM POWER7: 32KB L1, 256KB L2 and 8 cores sharing a 32
MB L3), but has the ability to virtualize each core into 10 logical
processor increments – including the four simultaneously executing
threads, create processor pools, cap or uncap logical domains,
migrate domains without a single reported hypervisor security fault.
Go to http://web.nvd.nist.gov/view/vuln/search
and enter powervm, oraclevm, and vmware separately to see the
An amusing example of how much Oracle
cares about customer experiences with OracleVM for SPARC is how a
simple question on its OracleVM blog
has remained unanswered since 2008, but rather is filled with spam.
There has been quite a lot of talk
about ARM Holdings and the ARM processor lately. Some of this is due
to the pervasiveness of its architecture in many mobile devices, some
of it is due to extensive hype over “new technology” versus “old
technology” – an unfortunate metaphor.
Are we to believe processor designers
who license the rights to the ARM processor technology are going to
“one up” traditional server processor architectures simply
because they started out with a stripped down, energy-efficient CPU? Let's take a look at why not!
Benchmarks results specifically
targeting these low-power processors have begun to be published.
Many of these benchmarks are based on the Dhrystone benchmark, run on
8088-class processors back in the 1980s! Performance for this class
of processor is usually measured in DMIP (Dhrystone Millions of
Instructions/Sec), roughly based on a VAX780 MIP. These benchmarks
are a far cry from industry standard benchmarks such as the SPEC or
TPCC warehouse database suits, etc. Before one starts yelling, how can
one expect the ARM class of processors to do well on these
benchmarks? One cannot simultaneously reject so-called “old
technology” while extolling the wonders of 30 battery hr hand held
tablet processor in micro servers. It would indeed be interesting to see
SPECint2006 results for these processors, but none seem to exist. The
same for a tpcc result? It is noteworthy that a dual core 1.6 GHz
Atom processor generates about 8000 DMIPS and dual core Cortex A9
about 4000. This means that if Intel had to drop its clock to say
1GHz to be in the same heat dissipation range as the Cortex A9, they
would have “similar” performance – in a single socket
In reality, “new technology” (ARM)
and “old technology” (Intel, AMD, IBM POWER) are two different
technologies, neither chronologically distinct. If we expect to see a
farm of micro servers each with 100 ARM or ARM-like Systems on a Chip
in 1U form factors, one should expect they will be running commercial
grade applications, the least of which would be web and database
servers. Would we see a SPECweb2005 result published for 1024 socket
ARM-based micro web server? We had better.
Is one supposed to assume that
designers of Intel x86 or IBM POWER are simply wasting millions of
transistors due to negligence? No! Will the processors in the “new
technology” micro servers use a new way for cache coherency
heretofore unknown to the world? I doubt it. SMP cache coherency use
transistors and utilize bandwidth. As more performance is demanded
from these ARM-class micro servers, processor designers will slowly
be incorporating techniques from “old technology” such as huge
out of order execution windows, complex caches, novel inter-socket
communications, multi-threaded execution and the ability to address
huge memory spaces. All these require complexity, transistors, and
watts. By the time all this has been accomplished the wheel will have
been re-invented again, with these micro servers dissipated about the
same heat as the “old technology” processors. If it takes a
given number of transistors to perform some advanced function such as
wide instruction execution and complex branch prediction, etc., the
ARM-class of processors will not perform such functions while
simultaneously violating the laws of solid state physics.
The hype surrounding this “new
technology” sounds striking familiar to what Sun Microsystems
claimed in the last half of the previous decade regarding its
“disruptive” Niagara “technology”. Sun said Thread Level
Parallelism was taking over the data center, since single thread
(Instruction Level Parallelism) was out of gas. Intel didn't think
so! AMD didn't think so! IBM didn't think so! Sun placed eight very
simplistic SPARC cores on a die with each executing at any given
clock tick, one of up to eight thread contexts. Sun claimed clocks
speed didn't matter because slow memory interfaces and long latencies
determined system throughput, not clock. Sun could claim something on
the order of a watt per [thin] thread context, versus perhaps 25W per
[heavy] thread from its competition. Well, about half a decade later
Sun+Oracle have reached a point where their processors now dissipate
basically the same amount of heat as established Intel, AMD, or IBM
POWER processor, and are considering reducing thread count and
cranking up the clock – to be competitive with their competition.
Sun's [now Oracle's] competition never felt the need to sacrifice
single thread performance, all the while adding cores and real
Simultaneous Multi Threading. The IBM POWER7 now has eight cores,
each capable of executing 4 instruction threads at the same time. A
single POWER7 can execute 32 threads simultaneously at a clock rate
nearly triple that of Oracle's Niagara-based processors. So much for
the hype! Something similar will have to happen with the “new
technologies” such as ARM-class processors in micro servers if they
expect to play with “old technology” big boys.
As with most things, you don't get
something for nothing.
In Oracle® Database 11g Running on
Oracle’s SPARC Enterprise M9000 Server Sets World Record TPC-H
Three Terabyte Non-Clustered Benchmark Result,
Oracle's tradition of making claims, hoping the reader will not
examine the details, continues.
In last week's disclosure, Oracle
on a SPARC Enterprise M9000 server, equipped with 64 SPARC64 VII+ 3.0
GHz processors, and Sun Storage 6180 arrays, Oracle Database 11g
Release 2 with Oracle Solaris achieved a world record TPC-H 3 TB
non-clustered performance result of 386,478 QphH@3000GB with a price
Oracle is NOT the category winner in either QphH
or USD/QphH, the only two benchmark metrics. In fact, tcp.org has it
ranked number 2, clustered or not. Perhaps what Oracle referred to as a record is the
system cost record for this category, which it does win at
demonstrates that Oracle Database 11g Release 2 running on the SPARC
Enterprise M9000 server was 2.4 times faster than the IBM Power 595
system(3) and loaded the entire database 3.3 times faster than the
above system while maintaining the highest level of data protection
at a lower cost per transaction(3).
Upon looking at the
actual Oracle and IBM
benchmark disclosures, it must be noted:
- The Oracle result
used 256 cores to achieve their latest result
- IBM used 64 cores
and it was run on a POWER6, IBM's prior generation processor.
Oracle requires 4X
the number Fujitsu cores than did IBM POWER6 cores for a 2.4X
performance differential. Since TPC-H is a business analytics
benchmark, and business analytic applications have per core
licensing, even taking into consideration the M9000 “performance”
advantage, Oracle's M9000 SW cost would be 1.67 times that of IBM's prior generation p595 server (normalizing per core performance).
Oracle also claimed
it loaded it's database 3.3 times faster than IBM. However, the
details show Oracle's Total
Storage / Database Size is 102.6, whereas IBM's
Total Storage / Database Size is 6.58. This means:
- To store 3TB of data; Oracle used 308TB (102.6TB of storage/data set size),
- For the same 3TB of data; IBM only needed 20TB (6.58TB of storage/data set size)
had massive storage, almost 16 times that of IBM per dbase size.
While this totally legal and legitimate, storage has its costs. The
end result is that the price/performance metric for this benchmark,
USD per QphH@3000GB for
Oracle and IBM is:
- Oracle M9000: 19.25
- IBM p575: 20.60
Oracle's latest TPC-H result is 1.07X
better than IBM's prior generation result. Not much to brag about,
considering the Oracle's system costs $7.4M and IBM's is $3.2M. One
might conclude Oracle simply kept adding cores and storage until it
just passed IBM's result. In any case, Oracle provides 1.07x better performance for 2.3x the price.
benchmark underscores the ability of SPARC Enterprise M-series
servers to deliver near-linear scalability and handle the
increasingly large databases required of decision support and data
warehousing systems. Neither IBM nor HP matched this level of
performance in the 3TB scale factor category using a single system,
further highlighting the performance capabilities of multi-processor
SPARC systems in the most demanding enterprise application
As noted above, in this 3TB category,
Oracle is not even the best performer. Fujitsu's
own RX300 X4, a 640-core Xeon server wins.
In the 10TB category, IBM beats Oracle.
Only HP bothered to report results for
the 30TB category, and it did that in 2007.
Two weeks ago on an HP blog,
blogger John Pickett based much his anti-IBM System z196 zBX claims
on what was “heard” rather than on hard evidence. The zEnterprise
BladeCenter Extension (zBX) is the new infrastructure for extending
tried and true System z qualities of service and management
capabilities across a set of integrated, fit-for-purpose POWER7 and
IBM x86 compute blades.
Pickett claimed: (responses are underlined)
IBM will use non-standard POWER7 and x86 blades. This is false,
IBM plans on using its standard blades.
Pickett then lists, numerically, half-truths, rumors and then
bases conclusions on them:
1) Will the zBX Blades be a replacement for the mainframe specialty
engines? No... so you'll have to determine when to run a workload on a mainframe general processor, a specialty engine, a zBX Power blade or a zBX x86 blade...and those are just the mainframe-centric options. The IBM blades are enhancements to the existing System z
infrastructure, which will utilize System z's existing application
administration. This is no different than any other application.
2) Why the need for unique Power7 and x86 blades specific to the
zBX? Doesn’t that defeat the purpose of an open environment? It
might if there were indeed unique IBM blades – but they are
standard IBM blades. zBX will not support every blade IBM ever
designed, but does support specific GA blades.
3) Will the zBX have the same availability as the mainframe? No,
Just because the zBX is connected to the mainframe does not mean the
availability from the mainframe is transferred. This assumption is
not based on anything published by IBM. Since the assumption is
false, the conclusion is false as well. The zBX chassis itself has
been "hardened" to be more like a mainframe in its
availability characteristics. All features are replicated - this
redundancy provides higher availability then standard blades. Even
the high speed private network is redundant. Redundancy allows
continued service in the case of an outage of a particular feature.
In addition, the zBX is monitored for availability and in case of
outage - a call home is initiated automatically. Do any HP blade
chassis have full feature replication?
4) Isn’t the business justification more than a little
challenging? Much of the cost parameters are based on soft
calculations to “increase operational efficiency” (the same which
can also be said of non-mainframe platforms). Should the application
run on a blade in the zBX, a specialty engine such as an IFL or a
mainframe general purpose processor? And don’t get me started on
the mainframe pricing schemes from WLC, AWLC, PSLC, zNALC, etc.
Notwithstanding Pickett never answering his question, the operational
efficiency comes from the housekeeping required of standard blades.
Things that are time consuming for standard blades include OS and
virtualization upgrades to keep all blades at the same release levels
and security levels. This is all done through rules and automatically
by zManager and zBX.
5) How about investment protection? Can you use pre-existing IBM
Power Blades? No. Pre-existing IBM x86 Blades? No. Pre-existing z10?
No. These options were withdrawn by IBM prior to the zBX even being
shipped. Why force mainframe owners to upgrade to a z196 just to
evaluate the zBX? First, one can use pre-existing IBM blades if they
are the specified type. This means that these boards exist today.
Second, are we to conclude that HP supports any blades it ever
sold in any of its blade chassis? No. And for a third party view, “IBM
has been successful in making their chassis totally backward
compatible with their older modules and blades and most of their
newer modules and blades fit in their older chassis with performance
restrictions in rare cases, but that offer a great investment
protection to customers who is upgrading their chassis comparing to
HP which forcing their customers to toss their old blades and modules
out as none of it is compatible across chassis. Who knows if the next
HP chassis will follow up the same path as their current one, which
mean a total lost of investment when upgrading.”
6) Will ISV applications need to be retested and recertified?
Unknown. Perhaps unknown to Pickett is that ISV applications (and
customer apps) will work unchanged. If they ran on AIX before, they
will work in this environment. No retesting or recertification. ISV
applications are certified for an OS - on the zBX, the OS is the same
as on standard blades.
7) What about Windows Server and SQL support? Not available. This
is actually true, as of this date.
8) Is VMware supported? Nope—not there either. Nor it is
available on System z or on POWER7 systems. Both have superior and
more secure virtualization than is offered by VMware. However, the
point is that Vmware is not required -- zManager provides most of
these functions, the customer saves on license costs for VMware, the
administration of VMware, its setup, upgrading, securing, etc. All
this is provided and managed by zManager
9) The new URM (Unified Resource Manager) will simplify your
management, right? Not exactly. URM handles the hardware, but you
will still need other products such as Tivoli Provisioning Manager,
Tivoli Service Automation Manager and OMEGAMON for automation,
control and service management. Not required. This is a customer's
choice in terms of the service management functions they want to add
to the environment.
Pickett concludes: That really does not sound like something that
reduces complexity. Sure, if one bases a conclusion on wrong, poor,
and incomplete facts, as is the case here.
Oracle: Beyond Benchmarksmanship -- SPARC T3-1 takes JD Edwards "Day In the Life" benchmark lead, beats IBM Power7 by 25%.
This one is for the books! It appears
Sun has passed the “report obscure benchmark you do well on
tradition” to Oracle. Sun used to report success with the
Manugistics NetWORKS Fulfillment Benchmark. Good luck finding the
latest results for that “industry standard” benchmark.
While the JD Edwards "Day In the
Life" is an active benchmark, it certainly is not in the
category of industry standard, such as the SPEC suits or TPC-C. It is
so obscure that Oracle didn't bother to provide a direct reference in
their announcement for the reader to make sense of the results. A
googling of “JD Edwards "Day In the Life" benchmark”
produced an IBM white paper that provided the following reference in Appendix B:
Minimum Technical Requirements (MTRs)
for JD Edwards EnterpriseOne are hosted on the MyOracle Web site: https://support.oracle.com/
- Log in to Oracle’s Partner
Connection with your userid and password:
- After signing in, click on More,
then click on Certifications
- Double click on JD Edwards
- Double click on the Note Link for
MTRs, such as Note:745831.1
- Then scroll down to the MTR needed
and double click it
We are not going to register as an
Oracle Partner to analyze Oracle's claims. We need only view what
Oracle has published in its announcement to examine their claim.
- The SPARC T3-1 server is 25% faster and has better response
time than the IBM P750 POWER7 system, when executing the JD Edwards
EnterpriseOne 9.0.1 Day in the Life test, online component.
- The SPARC T3-1 server had 25% better space/performance than
the IBM P750 POWER7 server.
- The SPARC T3-1 server is 5x faster than the x86-based IBM
x3650 M2 server system, when executing the JD Edwards EnterpriseOne
9.0.1 Day in the Life test, online component.
- The SPARC T3-1 server had 2.5x better space/performance than
the x86-based IBM x3650 M2 server.
What Oracle did not say was that:
- For a single socket SPARC T3 to
have 25% better results than a single socket POWER7, Oracle needed
twice the number of cores and four times the threads as the IBM
- Oracle compared their just
released SPARC T3-1 results with that of an IBM POWER6, a product
announced almost FOUR YEARS ago. This is very disingenuous of Oracle
and assumes their customers will not bother to check if Oracle is
making apples-to-apples comparisons. Sun used to assume this.
- Oracle then compared their new
SPARC T3-1 server results to an IBM x3650M2, 2x2.93 GHz X5570, with
64GB of memory – half the RAM of the Oracle T3 machine. It is
incumbent on Oracle to compare their new machine with a comparably
configured IBM x86 server, that is, one with 128MB, or provide
results for SPARC T3-1 server with 64GB of RAM. Neglecting to do so
will result in more of Oracle's performance claims coming under
their SPARC T3-1 is 5X faster than the IBM IBM x3650M2. This claim
is not conclusive. A server cannot be 5X faster simply because the
benchmark reports it serviced 5X the number of users. Moreover, the
IBM x3650 M2's response time is .29 secs compared with the latest
SPARC T3-1 of .523 sec. If response time is more important to the
user, the year and half old IBM x3650 M2 with half the RAM and half
the core count is about 2X as responsive as the latest SPARC T3-1
servers. In fact, the amount of available system RAM usually has a
direct relationship on the number of users. It will be
interesting see what the latest IBM X3650 M3 with 128GB of RAM will
have for results.
If core licenses
costs are important, IBM's year old POWER7 750 uses half the number
of cores for about the same benchmark performance as the the just
released Oracle SPARC T3-1 server. This Oracle benchmark
announcement demonstrates that the latest SPARC T3-1 server, at a
minimum, has a business application suit licensing cost of 2x over
that for IBM POWER7 750 servers. Oracle is telling us that SW costs
based on cores could double by using their HW over POWER7 servers from IBM.
Oracle should not
assume that the readers of its benchmark results will believe their
claims without investigation!
In article, EMC kills SPEC benchmark
with all-flash VNX
Chris Mellor, calls this a “watershed benchmark”, and continues,
“The previous top SPECsfs2008 NFS v3 score was 403,326 ops/sec from
an IBM SONAS (Scale-Out NAS)
. EMC's result was 497623
“SPECsfs2008 is the
latest version of the Standard Performance Evaluation Corporation
benchmark suite measuring file server throughput and response time,
providing a standardized method for comparing performance across
different vendor platforms. SPECsfs2008 results summarize the
server's capabilities with respect to the number of operations that
can be handled per second, as well as the overall latency of the
Clearly, EMC's result are better
than IBM's published results (
However, without getting into minutiae, in comparing the basic
storage technology used by EMC
almost (93%) all of EMC's drives are solid state disks (SSD) and all of
IBM's storage uses 15K rpm hard disks. The advantages of SSDs are
well known and is certainly an acceptable storage technology for this
benchmark. An observation that should be noted is that SSD technology
provides from over one to approximate two orders of magnitude in
random io ops/sec performance over 15K rpm drives1, yet
EMC only reported a slight improvement over IBM's result in this
benchmark. Is cost perhaps the reason?
The cost between
200GB SAS Flash and 450-600 GB 15K SAS drives is in the wide range of
5-30X. The performance capability of SSDs, in this benchmark,
allowed EMC to use about ¼ the number of overall drives than IBM
used. Since It appears the dollar cost of actual storage per
SPECsfs2008_nfs.v3 ops/sec result
for the EMC result appears significantly higher than for IBM's
not clear why a customer would spend so much extra for EMC's SSDs
rather than standard high performance spindles for a 23% performance advantage. It almost appears as though EMC simply wanted a
benchmark result to be slightly higher than IBM's.
IBM Watson not only succeeded in
subjugating its human opponents on Jeopardy but Watson might someday
be the motivator for Sonny and the re-programmed robots
to move from Chicago, USA to Toronto, CA :-)
All kidding aside, industry pundits
have been seriously speculating about alternate uses for IBM Watson.
While dramatic extrapolations to HAL9000, SkyNet, and I, Robot
abound, less has been said about Watson's immediate usefulness.
Rather than to dwell on the Jeopardy “version” of Watson with its
ninety IBM POWER7 servers
(http://www-03.ibm.com/systems/power/advantages/watson/) clustered together with an aggregate
memory size of 15TB, it may be more useful to look at non-game show
“versions” . We generally know what Watson had to accomplish
within the three second (see previous blog) Jeopardy response time
rule. What may be more interesting is to consider classes of problems
where response time demands are in the order of minutes, where a 15TB
data set is not required, or if a Watson-like construct can aid in
narrowing down solutions to sets of possibilities, rather than a
single exact response – and let humans decide and execute on a
critical choice. Versions of IBM Watson/DeepQA can be architected and
have its data digested as a function of the problem being addressed.
Alternate, larger data set versions or versions requiring a response
in less than a second can be designed. Let's look at some examples
from a 50,000 foot level.
Medical and Health Services
A researcher is confronted with a set
of patient symptoms never learned in school; some symptoms look
familiar to seasoned colleagues, but not all of them at the same
time. Is this a new disease, a mutation, something that may have
always existed but never categorized in a way that was recognized?
Traditional hospital databases can be scanned, results correlated as
best can be, but still nothing definite. This is what might take
place today, assuming these databases where constructed for more
than just billing purposes. A Watson-like derivative could be
designed to ingest patient data with specific annotations allowing
correlations that would greatly enhance the chance of narrowing down
the knowledge required to identify and eventually treat what appears
at first be an uncategorized disease. This capability may be vital
for health services located in rural areas where a Watson-like system
has the proven knowledge of millions of medical experts and studies.
Imagine making a query on a surgical procedure and finding out that a
technique abandoned twenty years ago has a better chance of being
successful than what is used today because of an heretofore
unsuspected interaction from combinations of patient symptoms or new
hormonal balances resulting from subsets of prescribed modern
medicines. Replace bacterial, microbial or ontological diseases,
etc., with determining patterns between psychiatric symptoms and
effectiveness of classes of past treatments – this is an another
variant of a Watson solution. Would Watson completely replace a
doctor, probably not, but it can start off as a trusted advisor and
the role of a doctor may be changed forever.
Financial and Economic Analysis
Pumping through piles of financial and
economic data looking for patterns, uncovering relationships between
seemingly related events already consumes vast amounts of computing
power in such places as Wall St, London, and Kong Kong. The ability
to act on distilled, structured information is generally left to
analysts – except for programmed trading. Programmed trading
systems react faster than humans to prevailing conditions, but lack
the capability to respond to exogenous events outside of its
rule-based model. It acts more like a chess program than Watson on
Jeopardy. The experts on Wall St cannot possibly program all the
rules of the particular game with the hope that combinations of
dynamic market and economic data will hit one of them. A system
designed to dynamically digest unstructured data (examples including
libraries of texts on economics, university lecture notes, radio and
TV programs, blogs, etc) create relationships with static data, and
purposely distribute this information across processing nodes to
minimize redundancy and maximize processing is much more capable of
efficiently ascertaining risk than having a rule for every possible
combination of known financial and economic data. Having a machine
with near instantaneous access to machine learned data from world's
leading economists and financial analysts might make a nice companion
on the trading floor, considering it appears reluctant to bet the
house if it is not confident of a position. Investors Business Daily
has an interesting take on Watson-like capabilities
Tech Support and Help Desks
IBM Watson could cannibalize most forms
of current consumer technical support. It could it be worse than what
goes for telephone-based tech support today.
Law, Patent and Trademarks
Not only could a Watson-like capability
minimize filing through existing databases on laws, prior cases,
rulings, hearings, opinions, it could also be used as a method of
testing witness questions. or suggest a series of inquiries and
questions for litigation. It could be used to simulate certain
judges, prosecution and defense lawyers, based on prior cases.
A Watson-like system could generate
questions for a prospective patent claim based on it's ingestion of
the entire patent and trademark database.
National Defense Planning and
The amount of structured data and
especially dynamic unstructured data that can be associated with
military and defense planning is enormous and expanding rapidly.
Military decision support systems augmented by a system that has data
on all previous military campaigns, past and current international
relationships, archives of all international military school
generated data including books, theses, lectures, military doctrines,
etc. Continuous ingestion of real-time data is added expanding
existing relationships. An inquiry that might be forwarded to such a
system might include, “What would Sun Tzu do given the immediate
Taking just the US. Internal Revenue
Service – the most serious problem at the IRS for taxpayers is
getting somebody at the agency to answer telephone calls,
Judging from Watson's ability to address very nuanced Jeopardy
questions, it could address this problem now.
Other areas include manufacturing,
homeland security, local law enforcement agencies, etc
A clear pattern is emerging here. Tasks
that traditionally involve humans remembering, making intelligent
guesses and informed estimates, even if backed up by filing through
mountains of data, could be greatly enhanced or even replaced by
Watson and its derivatives.
As with most new technologies,
something is gained but something is lost. Most people reading this
used to remember important telephone numbers. Today, with perhaps
hundreds stored in cell phone, the ability to recall telephone
numbers is almost a lost talent. However, nobody seems to be
Today, 2/14/2011, the first of three Jeopardy! sessions between the top two Jeopardy! champions and IBM Watson will air on national TV. As each question is asked at lot will be taking place and many of us will be wondering just what is going on inside IBM Watson. Just what is going on?
While IBM Watson's entire execution infrastructure has not be
published, we do know that each compute element consists of a
commercially available IBM POWER 750
The entire interconnected cluster looks like a set of library shelves.
Many of us will wonder each time a question is asked what is going on in the three seconds given to the contestants. As humans, we can more or less understand being a Jeopardy! contestant. Many people will invariably not know the answer in three second but will retort after the correct response is made -- "Oh, I knew that!". Watson is not doing that, although it has been reported that IBM Watson has a good idea of the types of questions and answers that have been previously asked on Jeopardy! In contrast, Watson's POWER7 processors are pumping through 15 TB of data (equivalent to about 200 millions pages of text) at a rate 500 GB/s each, concurrently. But first, Watson has to understand the question. It has to determine verbs, nouns, objects and moreover, nuances in the English language not part generally part of the standard English 101 class. Next, Watson must look for the best answer. What might be the basic applications that are used to accomplish this massive test.
It has been reported that Watson runs on Linux, but also DeepQ&A (Watson's SW application stack) uses Hadoop and UIMA applications. UIMA stands for Unstructured Information Management Architecture, and according to wikipedia, "UIMA is a component software architecture for the development, discovery, composition, and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies developed by IBM. The source code for a reference implementation of this framework has been made available on SourceForge, and later on Apache Software Foundation website." This is an application that intelligently digests and correlates information that otherwise appears amorphous.
Upon reviewing my previous blog entry, https://www-950.ibm.com/blogs/davidian/entry/what_runs_watson_and_why16?lang=en_us, IBM Watson has 4 TB of storage, but has 16 TB of systems-wide memory. Such an architecture suggests an in-memory databases or at least in-memory data structures. Indeed, Watson uses Apache's Hadoop framework to facilitate preprocessing the large volume of data in order to create in-memory datasets. To provide effective CPU scheduling, the file system includes location awareness, that is, the physical location of each node, rack & network switch. Hadoop applications can use this information to schedule work on the node where the data is, and, failing that, on the same rack/switch, reducing backbone traffic. The Hadoop file system uses this when replicating data, trying to keep different copies of the data on different racks.
"Watson’s DeepQA UIMA annotators were deployed as mappers in the Hadoop map-reduce framework, which distributed them across processors in the cluster. Hadoop contributes to optimal CPU utilization and also provides convenient tools for deploying, managing, and monitoring the data "analysis process." For more information see: http://www.ibm.com/common/ssi/cgi-bin/ssialias?infotype=SA&subtype=WH&htmlfid=POW03061USEN&attachment=POW03061USEN.PDF&appname=STGE_PO_PO_USEN_WH
When watching Jeopardy! tonight try to keep in mind that for every question, IBM Watson has to, as a minimum, within 3 seconds:
- Take the stated question and parse its components
- Determine relationships between grammatical elements
- Create items that it must look for or relationships that may expand its search
- Have Hadoop dispatch work to access information that UIMA has intelligently digested and annotated
- 2880 POWER7 cores processing through TBs of data looking for the best set of results
- DeepQA then determining what it considers the best response, and
- Press a mechanical button as do the human contestants and express the answer in English.
Let the best "man" win!
IBM Watson http://www-03.ibm.com/innovation/us/watson/index.shtml, the computer that will
compete against the top two Jeopardy! champions on February 14-16, 2011, is constructed using a
commercially available computing platform from IBM. The IBM Watson is a massively parallel
system based on the IBM POWER7 750 in a standard rack mounted configuration.
The IBM Power 750, featuring IBM's POWER7 processor, is a server that runs AIX, IBM i and
Linux, and has been on the market since Feb 2010. This is the same unit that has been described on
http://www03.ibm.com/systems/power/hardware/750/index.html for about a year.
The IBM Watson itself looks like what could be described as a set of books in a book shelf. See:
IBM Watson is comprised of ninety IBM POWER 750 servers, 16 Terabytes of memory, and 4
Terabytes of clustered storage. This is enclosed in ten racks including the servers, networking, shared
disk system, and cluster controllers. These ninety POWER 750 servers have four POWER7 processors,
each with eight cores. IBM Watson has a total of 2880 POWER7 cores.
Watson runs IBM DeepQA software, http://www.research.ibm.com/deepqa/deepqa.shtml, which scales
out with and searches vast amounts of unstructured information. Effective execution of this software,
corresponding to a less than three second response time to a Jeopardy! question, is not just based on
raw execution power. Effective system throughput includes having available data to crunch on.
Without an efficient memory sub-system, no amount of compute power will yield effective results. A
balanced design is comprised of main memory, several levels of local cache and execution power.
IBM's POWER 750's scalable design is capable of filling execution pipelines with instructions and
data, keeping all the POWER7 processor cores busy. At 3.55 GHz, each of Watson's POWER7
on-chip bandwidth is 500 Gigabytes per second. The total on-chip bandwidth for Watson's 360
POWER7 processors is an astounding 180,000 Gigabytes per second! It is no accident that an IBM
POWER7 based technology serves as basic hardware building block for IBM Watson.
If there were an industry standard performance benchmark for playing Jeopardy!, such as
specJeopardy!2011, there would be only one published result.
Watch an in-depth discussion of IBM Watson on your local PBS channel on February 9, 2011,
Watson's human-like artificial
intelligence beat both Jeopardy rival champions in a dry run as
reported in the trade sites on January 14, 2011. Canadian Broadcasting in Computer
beats Jeopardy! champs
reported, “Later, the human contestants made jokes about the
Terminator movies and robots from the future.”
Timothy Prickett Morgan in
referred to IBM Watson's avatar as the evil Skynet. In that article,
Watson beats humans in Jeopardy! dry run,
Morgan noted that Watson is a Linux cluster of IBM POWER7-based p750
servers. For a pundit who mentioned how the Watson Jeopardy event is
perhaps a veiled marketing ploy, he gratuitously added, “Watson QA
software is running on 10 racks of these machines, which have a total
of 2,880 Power7 cores and 15 TB of main memory spread across this
system. The Watson QA system is not linked to any external data
sources, but has a database of around 200 million pages of "natural
language content," which IBM says is roughly equivalent to the
data stored in 1 million books.”
It was stated in several reports that
Watson has some issues with language ambiguity as a challenge.
Perhaps this makes Watson more human-like than we think as
interpreting ambiguity in speech is generally a learned ability.
Myth has it that Discovery One
Spacecraft's HAL9000 computer name in 2001: A Space Odyssey
is a one-letter-shift from the letters IBM as in IBM9000. I suspect
we should start worrying when the next generation of IBM lip-reading,
human-like technology argues with us as in the classic “open the
pod bay doors”... http://www.youtube.com/watch?v=kkyUMmNl4hk
So, who cares? What difference does it
make to me – I don't even watch Jeopardy? While what IBM is
pursuing may be somewhat dismissed as veiled gratuitous public
relations by some pundits,
this human like intelligence is a demonstration of what will permeate
our lives well within a generation. The research and development
prowess that IBM will demonstrate on nation-wide TV, regardless if it
“wins”, represents the type of gating technology that will be the
progenitor for an industry that doesn't even exist yet: human
assistants ranging from intelligent prosthetics to nannies to
soldiers and home robots. Sound wild? A generation ago who would have
thought there would be multiple computing platform in the home?
The types of technological challenges
that have to be overcome for the realization of a “home robot
market” include: multiple simultaneous emotion extraction from
enhanced speech and facial recognition, natural language interfaces,
cognitive abilities, symbolic interpretation of live vision objects,
tactical grasping, near instantaneous database access or some digital
neural equivalent, etc, etc.
What was learned from IBM Deep Blue's
victory over Grand Master Gary Kasperov in 1997
was an early step. Today's ability to take on the best of
Jeopardy allows us to learn and define the technological hurdles that
must be solved that will usher in the next revolution in computing.
IBM taking on the best of Jeopardy is important to anybody who has an
interest in their high-technology career over the next twenty years.
New markets and industries will be created that are unimaginable
more information see:
When Sun Microsystems native SPARC
processors were sucking wind, Sun marketing began talking down single
threaded, high-clocked, large-fast cache-base execution environments
in favor of a mythical transformation of most all applications into
thread-rich execution environments. Sun made the term Thread Level
Parallelism [TLP] prolific. Now that Oracle purchased Sun we read
that single threaded, high-clock rate execution is being demanded by
Oracle applications. Changing horses twice mid-stream does not
impress data center managers.
Timothy Prickett Morgan noted,
“Oracle has been promising a 3X
improvement in "single strand" performance, which everyone
mean clock speed.”
“...Oracle might be overclocking the
Sparc chips to reach the 5 GHz stratosphere of chip clock speeds.
While this might not be the case, the question we need to be asking
Oracle - and remember, Oracle
doesn't answer questions - is: if not,
the 2000s, Sun's customers were expecting explanations for its
traditional UltraSPARC processors lacking in performance. In reality
Sun, via Texas Instruments [TI], was not able to successfully
fabricate, traditional high-clocked, large cache, state-of-the-art
processors. Traditional processors, such as IBM POWER or Intel x86,
were designed to maximize Instruction Level Parallelism [ILP] with
fast single thread execution.
mid-2002 Sun purchased Afara, the firm that designed processors with
slow-clocks and simple cores able to maximize the executions of many
threads. TI was able fabricate these processors with simple cores,
small caches, and placed identical copies on a single die. This
created Sun's Niagara processor line, known today as the UltraT1, 2,
3, etc. Sun began its CMT marketing campaign claiming that processor
clocks have reached an asymptote and memory performance was scaling
at 1/3 that of processor clocks, condemning traditional execution to
the dust bin of history. Sun's CMT technology was purported to save
the data center and do so at a low heat dissipation per thread
regime. Sun's argument was that ILP has reached the end of the line,
processor clocking had reached the point of creating unimaginable
power densities, and memory technology was never going to catch up.
CMT contrarian market hype was taking place as IBM POWER4, the first
commercial general purpose multi-core processor was setting
performance records and Intel's Xeons were approaching 4 GHz.
IBM's POWER6 hit 5GHz several years ago, and today's IBM System z
(mainframe) processors run at 5.2 GHz. What Sun proclaimed as a
semiconductor technology wall was torn down with cleaver designs by
IBM, Intel and AMD. Sun sacrificed single thread performance as the
cost of keeping a processor line alive. Sun paid the price as it
lost market share. IBM and Intel today have multiple core
processors running multiple simultaneous threads, never having to
sacrifice single thread performance in the interim.
we enter this decade it appears that Sun+Oracle plans on cranking up
the clocks on their CMT processors while keeping the core count
constant. In addition, Sun+Oracle appears to be adopting the
capability to dynamically alter the number of threads per core
allowing more of the CPU core to execute the thread (contrary to its
CMT market hype) and enabling more cache per thread! Sound familiar?
It should, considering IBM introduced it earlier last year calling it
Intelligent Threading (see:
has basically contradicted nearly all of Sun's CMT marketing hype.
The following link is one of the few remaining original CMT
justification presentations still on the web, outside of oracle.com:
Most of Sun's CMT processor presentations seem to have been excised
from the web. A June 2005 blog that is still active at Oracle, http://blogs.sun.com/esaxe/entry/cmt_performance_enhancements_in_solaris
“Rather than butting heads with the
laws of physics in an attempt to quickly burn though a single
instruction stream (stumbling and stalling along the way), CMT
processors do more by allowing multiple threads to execute in
It wasn't that Sun's processors didn't
meet performance expectations due to the laws of physics. Rather,
they failed in meeting the challenge in designing and fabricating
processors given the limits of solid state physics. It appears that
Sun+Oracle are playing catch up again against IBM and Intel –
neither of which waited around for the “laws of physics” to ease
almost the entire decade following Y2K, Sun Microsystems claimed the
TPC-C benchmark was irrelevant, not representative of the modern data
center and moreover, it cannot be used for sizing. Subsequently, Sun
didn't publish any TPC-C results. This benchmark alienation came just
after Sun claimed its final world record E10K TPC-C results with
UltraSPARC-II processors and just before Sun introduced the
UltraSPARC-III, circa 2001. These actions were not accidents nor is
the recent Oracle+Sun's claim of a TPC-C result of 30,249,688 tpmC
UltraSPARC-III had a blocking L1 cache, designed to optimize SPEC
CPU95 benchmark execution. The UltraSPARC-III was late enough that
the SPEC CPU95 was retired and replaced by SPEC CPU2000. SPEC CPU2000
had a larger footprint and a different execution pattern than its
predecessor. Throughout the last decade, Sun's UltraSPARC processors
were plagued by poor single processor industry-standard benchmark
results. For Sun publishing any TPC-C results would be very
embarrassing. (I know, as a member of Sun's benchmark council). When
industry standard benchmark results were good Sun would publish them.
When results turned out poor – the benchmark was attacked. When
results became good ”again”, they are published, as was by
Oracle+Sun on December 2, 2010.
the TPC-C benchmark could be characterized by light-weight thread
processing representing, “... the principal activities
(transactions) of an order-entry environment. These transactions
include entering and delivering orders, recording payments, checking
the status of orders, and monitoring the level of stock at the
warehouses” (see: http://www.tpc.org/tpcc/default.asp),
this benchmark does provide a relative measure of the ability of a
system to move data with processing capability secondary (The handful
of SQL statements are rather trivial). Rapid data movement with
low-quality processing is a forte of Sun's T1, T2, T3, and T4
processors. Interestingly, it was only after Oracle purchased Sun
that TPC-C benchmarks on Sun SPARC were published again. It was known
as far back as 2005 that the UltraT1 generated relatively good TPC-C
results, but because the TPC-C benchmark was deemed worthless, Sun
could not publish them less be called on the carpet for blatant
duplicity. Oracle must think today's customers have no medium-term
memory, a poor assumption for a database software company.
results come in two flavors, single or clustered. A single result
represents the capability of a single server with its storage. A
clustered result approximates a cumulative sum of all the machines in
the cluster. The larger the cluster, the better the result. Of course
clustering like this has its mechanical and networking asymptotes,
but generally you can pick a desired tpmC and then cluster servers
and storage until that result is achieved. Sun made this argument a
decade ago as a reason to avoid the TPC-C clustered results. In fact,
Sun used to claim that IBM and others had to cluster their servers to
get even publishable results.
TPC-C results can be used for certain comparisons. For example: The
latest Sun+Oracle TPC-C result was achieved using a cluster of
twenty-seven servers with 1726 SPARC processor cores. They then
compared the results with the best IBM result which is a cluster of
three, p780 servers with 192 POWER7 cores. Sun+Oracle has a 3X better
result than IBM with 9X cores and 9X servers. The quotient is left as
an exercise for the reader!
a heritage of duplicity, note the title of another blog on the same
blogs.sun.com site that Oracle's latest TPC-C claim was made:
What to believe is up to the imagination of the reader!
The title of this blog entry contains a direct quote from Larry
Ellison, Oracle CEO. See: www.youtube.com/watch?v=3WPOrdUGteE
(50-54 seconds into the clip) and deserves investigation considering
Oracle also claims its Exadata appliance provides “Extreme
Offhand, one could ask to whom has
Oracle been selling its database software? One might wonder what
credit card, supply chain, etc., OLTP database systems have been
doing the past 20 or 30 years!
Given Ellison's theatrics and
hyperbole, it is worth a peruse of industry standard On Line
Transaction Processing (OLTP) as well as Data Warehousing benchmark
results to determine at least the relative “extreme performance”
of Oracle's Exadata product
The accepted independent industry
standard benchmark for OLTP database systems is the Transaction
Processing Performance Council's TPC-C benchmark. TPC-C is one of two
TPC's OLTP benchmarks. “TPC-C simulates a complete computing
environment where a population of users executes transactions against
a database. The benchmark is centered around the principal activities
(transactions) of an order-entry environment. These transactions
include entering and delivering orders, recording payments, checking
the status of orders, and monitoring the level of stock at the
warehouses. While the benchmark portrays the activity of a wholesale
supplier, TPC-C is not limited to the activity of any particular
business segment, but, rather represents any industry that must
manage, sell, or distribute a product or service.” (see:
TPC-C results are categorized as
Considering Oracle claims Exadata is characterized by “extreme
performance”, one would expect to see Exadata results among the Top
Ten Results by Performance. However, Oracle as no published TPC-C
Exadata results. Looking at All Results, under Oracle, again,
there are no Exadata results. The only Oracle or Sun-related TPC-C
result in almost a decade is for a Sun SPARC Enterprise T5440 Server
Cluster – no Exadata TPC-C results. In contrast, however, IBM has
over 50 published benchmark results.
Sun Microsystems, sold to Oracle last year, rejected the
applicability of TPC-C in representing real OLTP a decade ago. It
championed the development of an alternative OLTP database benchmark,
the TPC-E. “The TPC-E benchmark uses a database to model a
brokerage firm with customers who generate transactions related to
trades, account inquiries, and market research. The brokerage firm in
turn interacts with financial markets to execute orders on behalf of
the customers and updates relevant account information.
The benchmark is “scalable,” meaning that the number of
customers defined for the brokerage firm can be varied to represent
the workloads of different-size businesses. The benchmark defines the
required mix of transactions the benchmark must maintain. The TPC-E
metric is given in transactions per second (tps). It specifically
refers to the number of Trade-Result transactions the server can
sustain over a period of time.” (see:
Oracle has no published TPC-E Exadata results either. IBM has 9
results out of a total of 39 published TPC-E results.
There is no relative measure of Oracle's OLTP claims.
The independent TPC also provides the industry standard Data
Warehousing benchmark, TPC-H. “The TPC Benchmark™H (TPC-H) is a
decision support benchmark. It consists of a suite of business
oriented ad-hoc queries and concurrent data modifications. The
queries and the data populating the database have been chosen to have
broad industry-wide relevance. This benchmark illustrates decision
support systems that examine large volumes of data, execute queries
with a high degree of complexity, and give answers to critical
business questions.” (see: http://tpc.org/tpch/default.asp)
Since the TPC-H benchmark tests Data Warehousing characteristics,
there are multiple database size results, ranging from 100GB to
Oracle has no published Exadata TPC-H results. The latest
Oracle (Sun) result was published over a year ago, and that was for a
single Sun Fire x4600.
Even looking at Oracle Benchmark Results (see:
there is no mention of Exadata Oracle Benchmark results in
either Data Warehousing section or Online Transaction Processing
sections. There are no Exadata results for Oracle SAP or
Oracle Application Server benchmarks.
While not having published industry standard database benchmarks
for Oracle's Exadata does not preclude this server and storage
combination from having “extreme performance”, it is just that we
have to take Ellison's word for it!
IBM’s POWER7-based 780 has the current TPC-C world’s record of
10,366,254 tpmC (see:
IBM holds six of the Top Ten positions. 
 Results current as of August 17, 2010. TPC, TPC Benchmark, TPC-C and
tpmC are trademarks of the Transaction Processing Performance Council.
To deign comparability
between the available IBM POWER7 and Oracle's yet-to-be-released
UltraSPARC T3 (Niagara 3) is like juxtaposing a BMW X6 and a school
bus, respectively. Certainly both vehicles transport people, both are made of
metal and burn hydrocarbon fuel – but this is where the comparison
ends. Interestingly, Oracle claims a school bus analogy for their Chip Multi
Threading (CMT) architecture, saying it represents computing
requirements in today's data center. Oracle says it is more efficient
to transport, say, 40 students in a school bus at one time, although
slowly, than to transport 8 groups of 5 students in an X6 running
back and forth at lightening speed. Unfortunately – we don't have
40 students to transport, but perhaps less than 5. A school bus is
an application-specific vehicle, as is Oracle's CMT
application-specific processor architecture.
Oracle's CMT argument also
claims that single, heavy weight thread performance (the BMW X6)
is not as important as the ability to execute multiple,
low-performance threads (the school bus). In contrast, IBM's POWER
and Intel's x86 are designed for general purpose computing
requirements: heavy weight thread processing (ability to execute the
maximum number of instructions/clock) with fast clocks, large
low-latency local caches, branch prediction, and out of order
execution. Today, these general purpose processors also execute many
HW threads simultaneously without having been designed to sacrifice thread execution
quality for thread quantity. One of the few widespread application-specific execution environments demanding the efficient execution
of scores of low-demanding threads is a web server under heavy load.
Another is shuffling around streams of data. UltraSPARC T3-based
systems are good web servers, but are architecturally challenged in heavy processing of that data. Real-life benchmarks speak for
themselves – see my previous blog entry.
Oracle claims that by
doubling the HW thread context count in the UltraSPARC T3 over its
predecessor, the UltraSPARC T2 (Niagara2) overall performance will
double. Any increase in performance could only occur if the
execution environment was thread starved. Alternatively, since few
applications spawn scores of threads, executing that application on a
processor that has double the thread contexts of its predecessor
will not provide any more performance. This is similar to designing a
new school bus that now holds 80 students, but is still only
Oracle's UltraSPARC T3 and
IBM's POWER7 are both two billion transistor processors and dissipate
about the same amount of heat. As seen in the UltraSPARC T3 die
photograph just below, the processor has sixteen (1.6GHz) cores, each
holding 8 HW thread contexts, providing a total of 128 HW thread
contexts per socket. However, each core only executes one thread at
any given time, if a thread is actually available for that core.
Literature on this topic tends to be blurred – giving the
impression that at any given time all 128 thread contexts are
executing simultaneously. In fact, each core is so simple that even
branch prediction is non-existent, forcing a thread switch on any
cache miss. Cores communicate with a shared 6MB L2 cache via crossbar
switches. The processor has on-board memory, PCIe, Ethernet, and SMP
coherency controllers. With all pumping at full blast, a theoretical
maximum BW of 2.4Tb/sec is achieved but can only be sustained with a
large number of available threads and full bore I/O running.
Oracle UltraSPARC T3
In contrast, the POWER7
(see die below) has eight cores, each with 4 fully simultaneously
executing threads. The POWER7 can execute twice the number of threads
simultaneously as can the UltraSPARC T3. In order to decrease memory
latency and insure the cores are fed with instructions and data, the
POWER7 has a huge, on-board 32MB L3 cache feeding eight dedicated
256KB, 8-cycle latency L2 caches, pumping data into 2-cycle latency
32KB L1 data caches. Combined dual memory and SMP coherence
controllers aggregate 2.9 Tb/sec of BW. The POWER7 has as many
floating point units as threads.
IBM's POWER7 is the latest
product in a successful road map of general purpose processors
designed with the horsepower to pound through heavy-weight,
compute-intensive tasks at nearly 4GHz.
It is worth noting that
Oracle's UltraSPARC T3 is curiously missing from the this month's Hot
Chips Conference agenda (see:
even though its general availability is set for later this year. At
least two IBM's POWER-related sessions are scheduled at Hot Chips.
On August 17, 2010, IBM continues its roll-out of new POWER7-based systems,
software, and solutions. Register for webcasts: http://www-03.ibm.com/systems/power/advantages/
In HP: Last Itanium man Standing;
Nehalem Lives the Dream
Timothy Prickett Morgan wrote, “Without the threat of Itanium,
which was never really fulfilled, perhaps IBM would have never
knuckled down and put some money into decent Power chip development,
which allowed the company to go from joke to dominance in the Unix
server racket.” While the title of his article may be somewhat
accurate, this and some of its other claims need to be addressed.
The Itanium was developed to address
what was perceived at the time to be a classic “technology wall”.
From the late 1980s into mid 1990s, some CPU architects assumed that
RISC processor technology was running out of intrinsic processing
capability. (Ironically, RISC was developed to address an earlier
“technology wall”.) It appeared there was more instruction level
parallelism (ILP) pumped out by RISC compilers than could be processed
by 1980’s vintage RISC processors, leaving on the table mutually
exclusive operations that could be otherwise executed. It was
incorrectly assumed that RISC technology could not mature. It indeed
did mature, using capabilities such as: instruction pipelining,
superscaling, out of order execution, register re-naming, speculative
execution, advanced branching algorithms, etc. These capabilities,
combined with ever-advancing fabrication technology, allowed RISC
processors to address and exceed all their earlier perceived
limitations. CISC processors have also adopted such capabilities.
The Itanium's Explicitly Parallel
Instruction Computing (EPIC) architecture used Very Large Instruction
Word (VLIW -- the same architecture used in the Elbrus, the last
Soviet supercomputer of the late 1980s) technology and included many
more execution units than could ever be effectively used. The Itanium
has so many execution units (both integer and floating point) that it was actually designed to execute instructions down both
sides of a conditional branch simultaneously while the branch
condition was being evaluated. A 33% longer instruction word forced
the adoption of larger caches, created code bloat, and mandated
higher data bandwidths, all relative to RISC processors.
Intel and HP actually overshot the
ability of compilers to extract enough ILP for the Itanium to execute
code effectively. There is only so much ILP inherent in human-scribed
source code. Because of the complexity of Intel's EPIC architecture,
it could not be run at a state-of-the-art clock. The Merced (the
first Itanium) was introduced at 733 MHz in 2001 when most state-of
the-art processors were running at least 1GHz. At almost a decade
after the Merced was released, today’s Itanium only operates at 1.7
GHz. This is in stark contrast with IBM's POWER having hit 5 GHz
several years ago. Had the Merced been available in the late 1990s,
things might have been radically different.
Interestingly enough, today's Itanium
uses hyperthreading (two simultaneous threads/core) helping utilize
empty execution units. Even so, Itanium's benchmarks are on the order
of 0.5X relative to the Nehalem EX. On the RISC side, IBM's POWER7's
eight cores and four threads/core current benchmarks demonstrate a 4X
performance capability over the Itanium.
To even hint that the Itanium was the
motivating force behind IBM's “joke to dominance” in the RISC
UNIX marketplace ignores advances in process technology, intelligent
architecting, and the ability of IBM to model and simulate the
dynamics of code compilation and execution. In the very late 1990s
Sun Microsystems spent millions of dollars developing a VLIW processor
similar to the Itanium. That chip was called the MAJC. It was so poor
in performance that Solaris wasn't even ported to it. The MAJC ended
up a high-end graphics processor, eventually being dropped as another
Morgan also suggests, “And without Intel
relegating 64-bit processing to Itaniums and leaving Xeons to
32-bits, there would not have been a gap in which Advanced Micro
Devices could leap and create the Opterons, which are the inspiration
for the Nehalem family of processors that have put Intel back in the
driver's seat when it comes to server CPUs .”
Intel clearly learned lessons from the
Itanium as seen in such places as microcode fusing on the Nehalem,
but high volume 64-bit computing was on its way with or without the
existence or demise of the Itanium. Intel clearly hiccupped staying
with very long pipeline Xeons with AMD's Opteron and 64-bits filling
Whether the Opteron “inspired” the
Nehalem as the Itanium enters the coffin is a rather wide leap. The
Itanium was simply too little too late and should be afforded proper
burial. Of more interest would be to imagine a 3.5 GHz HP PA-RISC --
a processor the Itanium really did eliminate.
A clear classification
of processing capabilities is beginning to take shape across CPU vendors, at
least in the enterprise space. Sometimes comments in the blogosphere tend to
overshadow simple benchmark comparisons. We are not discussing some exotic
benchmark run on a hot box, or due to some application-specific characteristic
of a processor, world record benchmarks are claimed in a certain edge case. Look
at what an established benchmark suite SPECint_rate2006, SPECfp_rate2006 (see www.spec.org) as of April 13, 2010 tells us
about a clear class performance distinction forming between enterprise processors
from different vendors:
* IBM POWER7 at
more than 2x overall processing capability relative to its nearest rival, Intel
* Nehalem EX,
at about a 2x performance level capability than Itanium, Sun, and Fujitsu, and
* Itanium, Sun UltraT, and Fujitsu SPARC64.
argue that one should not look at simply one benchmark, or suite to estimate
performance. Note that the SPEC CPU benchmark is a very good initial indicator
of overall systems performance centered from the CPU, its caches,
interconnects looking outward.
three house race is very significant for many reasons. Market pressure will
build on the lowest performance class forcing either price cuts or engaging in
risky design and fabrication activity in an attempt to makeup for the
performance shortfall. In the case of Itanium, one would expect to see it sent
to pasture soon, as even Windows has dropped support for most of its products. Itanium’s
architecture tried unsuccessfully to address the issue of not enough execution
capability in RISC architectures. An issue subsequently solved by several generations
of RISC processors. Sun (Oracle) is attempting to modify its current
application specific (highly threaded, low ILP) UltraSPARC-T
architecture that only runs well in lightweight, highly threaded applications
such as web servers or simple OLTP. Sun assumed thread level parallelism would
supplant instruction level parallelism and has had five year to prove it. We
are still waiting. Fujitsu’s current generation of SPARC64 processors were
built on what remained of Amdahl’s s390 clone processor. The rest is history
and solid state physics.
appears to be two distinct leaders in the CPU race: IBM POWER and Intel’s x86, both
pulling away from the pack and from each other.
Paul Venezia's InfoWorld expose' [see: http://www.infoworld.com/t/mergers-and-acquisitions/oracle-customers-sun-sun-who-751] details what is in store for Sun's current customer base post facto the Oracle acquisition. Oracle's CEO, Larry Ellison, claims Oracle+Sun will form the basis for Oracle to emulate IBM of the 1960s. This is a clear endorsement of IBM today, yet disregards market and technology realities existing fifty years ago, not today.
In the meantime, Oracle is engaging in tactics that will drive the last nail in the coffin for even SPARC/Solaris zealots. It was bad enough that Sun+Texas Instruments could not meet market performance windows for Sun's native UltraSPARC line of processors, leaving Sun SPARC customers having to choose between application specific UltraSPARC-Tx, Niagara-class based products or SPARC64 based servers from Fujitsu. Now, Oracle is increasing the service and maintenance costs of aging Sun products -- forcing Sun customers to upgrade. Upgrade to what?
Since the introduction of IBM's POWER5 in 2004, Sun's IBM sales playbook was predicated on the following golden rules:
* IBM will force you to upgrade your systems every two years, while at Sun we design our systems for a five year refresh cycle. [Translation: Sun needs to create a plausible excuse in face of IBM's technology and product development cycle being 2x faster than that of Sun.]
* IBM makes servers that are benchmark machines, while Sun servers are balanced systems. [Translation: Need non sequitur FUD in light of IBM's superior performance.]
* When IBM brings GBS into your data center, System Admins, Systems Programmers, DBAs, etc., will lose their jobs. [Translation: Make it personal, do whatever you can to stay with Sun and keep away from IBM -- or be layed off!]
Oracle, the company that wants to emulate IBM, is now violating Sun's own stated golden rules. This is not surprising. For the past decade Sun has claimed that the TPC-C OLTP industry-standard database benchmark is archaic and does not represent any aspect of the modern data center, and besides when run in a cluster you can simply add servers and storage until the desires numerical result is achieved. This was convenient for Sun since it had poor results on this benchmark. Oracle came along last fall, using a huge cluster of Sun's Niagara-based application-specific processors and claimed a world record TPC-C benchmark result.
If I were a Sun customer, I would be seriously considering transitioning away from another decade of techno-deception.
As IBM announces the POWER7 today, Sun-Oracle will be telling their fleeing customer base that IBM finally, albeit after five years, validates Sun’s Chip Multi Threading (CMT) processor architecture.
Sun-Oracle will attempt to equate their five year old, 8-core 4 threads per core CMT processor called the UltraSPARC T1 (Niagara-1) with today’s IBM POWER7. Indeed, the IBM POWER7 has 8 cores and 4 threads per core - but that is where the numerical similarities end and Sun generated FUD (Fear, Uncertainly, and Doubt) begins.
Sun-Oracle will argue that not only does it market a second-generation CMT processor, the UltraSPARC T2 (Niagara-II), with 8 cores and 8 threads per core but that its next incarnation will have 16 cores with similar threading. If one were to set as equivalent the number of cores on a piece of silicon as the test of greatness, one might note, among the multitudes, Intel’s 16-core, multi-threaded network processors, IXP2400/2800 available well before Sun’s CMT, or Octeon’s encryption processor with 16 MIPS64 cores. Alas, Sun would say that both these processors, and many others like them, are application specific.
However, Sun-Oracle’s current CMT processors are also application specific. They only perform well in thread-rich environments. Such environments are typically web servers and lightweight databases where strong single thread performance is not necessary. One need only note the types of benchmarks Sun publishes - and those it does not - for confirmation of Sun’s CMT application specific resonance.
Sun attempted to design and Texas Instruments manufacture a CMT processor that would address both thread-rich and heavy single threaded execution requirements. Internally it was called the ROCK processor. That project ended in failure. Its chief architect left Sun and joined Microsoft last year. The reason Sun attempted to design such a CMT processor is because many of today’s applications still require swift execution of heavy single threads. Sun’s available CMT processors are so poor at executing single threaded code that it doesn’t even publish industry-standard single core benchmarks for their processors. POWER7’s published industry-standard benchmarks speak for themselves.
Both IBM and Intel could have easily designed, manufactured, and marketed processors that were both highly multi-cored and multi-threaded but would have done so by sacrificing the execution quality on a vast array of standard single-threaded data center applications. In contrast, it was not necessary for IBM to sacrifice the execution quality of existing data center applications. IBM evolved its multi-core RISC architecture beginning with its dual core POWER4 in 2001 to today’s 8-core POWER7 with a continual positive impact on data center execution quality and price-performance.
There is no better indication on how divergent multi-core and multi-threaded processor architecture and performance can be than to note that DARPA selected IBM’s POWER7 for its Supercomputing Grand Challenge (see: http://www-03.ibm.com/press/us/en/pressrelease/20671.wss). Sun was dropped from the competition - based on the broken promise of the ROCK processor by 2010. (See: http://m.channelregister.co.uk/2006/11/21/darpa_petascale/). Today’s announced POWER7 is part of DARPA’s Petascale Challenge, not Sun’s non-existent ROCK and certainly not its little brother, Sun’s UltraSPARC T2 processor.