In article, EMC kills SPEC benchmark
with all-flash VNX
Chris Mellor, calls this a “watershed benchmark”, and continues,
“The previous top SPECsfs2008 NFS v3 score was 403,326 ops/sec from
an IBM SONAS (Scale-Out NAS)
. EMC's result was 497623
“SPECsfs2008 is the
latest version of the Standard Performance Evaluation Corporation
benchmark suite measuring file server throughput and response time,
providing a standardized method for comparing performance across
different vendor platforms. SPECsfs2008 results summarize the
server's capabilities with respect to the number of operations that
can be handled per second, as well as the overall latency of the
Clearly, EMC's result are better
than IBM's published results (
However, without getting into minutiae, in comparing the basic
storage technology used by EMC
almost (93%) all of EMC's drives are solid state disks (SSD) and all of
IBM's storage uses 15K rpm hard disks. The advantages of SSDs are
well known and is certainly an acceptable storage technology for this
benchmark. An observation that should be noted is that SSD technology
provides from over one to approximate two orders of magnitude in
random io ops/sec performance over 15K rpm drives1, yet
EMC only reported a slight improvement over IBM's result in this
benchmark. Is cost perhaps the reason?
The cost between
200GB SAS Flash and 450-600 GB 15K SAS drives is in the wide range of
5-30X. The performance capability of SSDs, in this benchmark,
allowed EMC to use about ¼ the number of overall drives than IBM
used. Since It appears the dollar cost of actual storage per
SPECsfs2008_nfs.v3 ops/sec result
for the EMC result appears significantly higher than for IBM's
not clear why a customer would spend so much extra for EMC's SSDs
rather than standard high performance spindles for a 23% performance advantage. It almost appears as though EMC simply wanted a
benchmark result to be slightly higher than IBM's.
Today, 2/14/2011, the first of three Jeopardy! sessions between the top two Jeopardy! champions and IBM Watson will air on national TV. As each question is asked at lot will be taking place and many of us will be wondering just what is going on inside IBM Watson. Just what is going on?
While IBM Watson's entire execution infrastructure has not be
published, we do know that each compute element consists of a
commercially available IBM POWER 750
The entire interconnected cluster looks like a set of library shelves.
Many of us will wonder each time a question is asked what is going on in the three seconds given to the contestants. As humans, we can more or less understand being a Jeopardy! contestant. Many people will invariably not know the answer in three second but will retort after the correct response is made -- "Oh, I knew that!". Watson is not doing that, although it has been reported that IBM Watson has a good idea of the types of questions and answers that have been previously asked on Jeopardy! In contrast, Watson's POWER7 processors are pumping through 15 TB of data (equivalent to about 200 millions pages of text) at a rate 500 GB/s each, concurrently. But first, Watson has to understand the question. It has to determine verbs, nouns, objects and moreover, nuances in the English language not part generally part of the standard English 101 class. Next, Watson must look for the best answer. What might be the basic applications that are used to accomplish this massive test.
It has been reported that Watson runs on Linux, but also DeepQ&A (Watson's SW application stack) uses Hadoop and UIMA applications. UIMA stands for Unstructured Information Management Architecture, and according to wikipedia, "UIMA is a component software architecture for the development, discovery, composition, and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies developed by IBM. The source code for a reference implementation of this framework has been made available on SourceForge, and later on Apache Software Foundation website." This is an application that intelligently digests and correlates information that otherwise appears amorphous.
Upon reviewing my previous blog entry, https://www-950.ibm.com/blogs/davidian/entry/what_runs_watson_and_why16?lang=en_us, IBM Watson has 4 TB of storage, but has 16 TB of systems-wide memory. Such an architecture suggests an in-memory databases or at least in-memory data structures. Indeed, Watson uses Apache's Hadoop framework to facilitate preprocessing the large volume of data in order to create in-memory datasets. To provide effective CPU scheduling, the file system includes location awareness, that is, the physical location of each node, rack & network switch. Hadoop applications can use this information to schedule work on the node where the data is, and, failing that, on the same rack/switch, reducing backbone traffic. The Hadoop file system uses this when replicating data, trying to keep different copies of the data on different racks.
"Watson’s DeepQA UIMA annotators were deployed as mappers in the Hadoop map-reduce framework, which distributed them across processors in the cluster. Hadoop contributes to optimal CPU utilization and also provides convenient tools for deploying, managing, and monitoring the data "analysis process." For more information see: http://www.ibm.com/common/ssi/cgi-bin/ssialias?infotype=SA&subtype=WH&htmlfid=POW03061USEN&attachment=POW03061USEN.PDF&appname=STGE_PO_PO_USEN_WH
When watching Jeopardy! tonight try to keep in mind that for every question, IBM Watson has to, as a minimum, within 3 seconds:
- Take the stated question and parse its components
- Determine relationships between grammatical elements
- Create items that it must look for or relationships that may expand its search
- Have Hadoop dispatch work to access information that UIMA has intelligently digested and annotated
- 2880 POWER7 cores processing through TBs of data looking for the best set of results
- DeepQA then determining what it considers the best response, and
- Press a mechanical button as do the human contestants and express the answer in English.
Let the best "man" win!
IBM Watson http://www-03.ibm.com/innovation/us/watson/index.shtml, the computer that will
compete against the top two Jeopardy! champions on February 14-16, 2011, is constructed using a
commercially available computing platform from IBM. The IBM Watson is a massively parallel
system based on the IBM POWER7 750 in a standard rack mounted configuration.
The IBM Power 750, featuring IBM's POWER7 processor, is a server that runs AIX, IBM i and
Linux, and has been on the market since Feb 2010. This is the same unit that has been described on
http://www03.ibm.com/systems/power/hardware/750/index.html for about a year.
The IBM Watson itself looks like what could be described as a set of books in a book shelf. See:
IBM Watson is comprised of ninety IBM POWER 750 servers, 16 Terabytes of memory, and 4
Terabytes of clustered storage. This is enclosed in ten racks including the servers, networking, shared
disk system, and cluster controllers. These ninety POWER 750 servers have four POWER7 processors,
each with eight cores. IBM Watson has a total of 2880 POWER7 cores.
Watson runs IBM DeepQA software, http://www.research.ibm.com/deepqa/deepqa.shtml, which scales
out with and searches vast amounts of unstructured information. Effective execution of this software,
corresponding to a less than three second response time to a Jeopardy! question, is not just based on
raw execution power. Effective system throughput includes having available data to crunch on.
Without an efficient memory sub-system, no amount of compute power will yield effective results. A
balanced design is comprised of main memory, several levels of local cache and execution power.
IBM's POWER 750's scalable design is capable of filling execution pipelines with instructions and
data, keeping all the POWER7 processor cores busy. At 3.55 GHz, each of Watson's POWER7
on-chip bandwidth is 500 Gigabytes per second. The total on-chip bandwidth for Watson's 360
POWER7 processors is an astounding 180,000 Gigabytes per second! It is no accident that an IBM
POWER7 based technology serves as basic hardware building block for IBM Watson.
If there were an industry standard performance benchmark for playing Jeopardy!, such as
specJeopardy!2011, there would be only one published result.
Watch an in-depth discussion of IBM Watson on your local PBS channel on February 9, 2011,
Watson's human-like artificial
intelligence beat both Jeopardy rival champions in a dry run as
reported in the trade sites on January 14, 2011. Canadian Broadcasting in Computer
beats Jeopardy! champs
reported, “Later, the human contestants made jokes about the
Terminator movies and robots from the future.”
Timothy Prickett Morgan in
referred to IBM Watson's avatar as the evil Skynet. In that article,
Watson beats humans in Jeopardy! dry run,
Morgan noted that Watson is a Linux cluster of IBM POWER7-based p750
servers. For a pundit who mentioned how the Watson Jeopardy event is
perhaps a veiled marketing ploy, he gratuitously added, “Watson QA
software is running on 10 racks of these machines, which have a total
of 2,880 Power7 cores and 15 TB of main memory spread across this
system. The Watson QA system is not linked to any external data
sources, but has a database of around 200 million pages of "natural
language content," which IBM says is roughly equivalent to the
data stored in 1 million books.”
It was stated in several reports that
Watson has some issues with language ambiguity as a challenge.
Perhaps this makes Watson more human-like than we think as
interpreting ambiguity in speech is generally a learned ability.
Myth has it that Discovery One
Spacecraft's HAL9000 computer name in 2001: A Space Odyssey
is a one-letter-shift from the letters IBM as in IBM9000. I suspect
we should start worrying when the next generation of IBM lip-reading,
human-like technology argues with us as in the classic “open the
pod bay doors”... http://www.youtube.com/watch?v=kkyUMmNl4hk
So, who cares? What difference does it
make to me – I don't even watch Jeopardy? While what IBM is
pursuing may be somewhat dismissed as veiled gratuitous public
relations by some pundits,
this human like intelligence is a demonstration of what will permeate
our lives well within a generation. The research and development
prowess that IBM will demonstrate on nation-wide TV, regardless if it
“wins”, represents the type of gating technology that will be the
progenitor for an industry that doesn't even exist yet: human
assistants ranging from intelligent prosthetics to nannies to
soldiers and home robots. Sound wild? A generation ago who would have
thought there would be multiple computing platform in the home?
The types of technological challenges
that have to be overcome for the realization of a “home robot
market” include: multiple simultaneous emotion extraction from
enhanced speech and facial recognition, natural language interfaces,
cognitive abilities, symbolic interpretation of live vision objects,
tactical grasping, near instantaneous database access or some digital
neural equivalent, etc, etc.
What was learned from IBM Deep Blue's
victory over Grand Master Gary Kasperov in 1997
was an early step. Today's ability to take on the best of
Jeopardy allows us to learn and define the technological hurdles that
must be solved that will usher in the next revolution in computing.
IBM taking on the best of Jeopardy is important to anybody who has an
interest in their high-technology career over the next twenty years.
New markets and industries will be created that are unimaginable
more information see:
When Sun Microsystems native SPARC
processors were sucking wind, Sun marketing began talking down single
threaded, high-clocked, large-fast cache-base execution environments
in favor of a mythical transformation of most all applications into
thread-rich execution environments. Sun made the term Thread Level
Parallelism [TLP] prolific. Now that Oracle purchased Sun we read
that single threaded, high-clock rate execution is being demanded by
Oracle applications. Changing horses twice mid-stream does not
impress data center managers.
Timothy Prickett Morgan noted,
“Oracle has been promising a 3X
improvement in "single strand" performance, which everyone
mean clock speed.”
“...Oracle might be overclocking the
Sparc chips to reach the 5 GHz stratosphere of chip clock speeds.
While this might not be the case, the question we need to be asking
Oracle - and remember, Oracle
doesn't answer questions - is: if not,
the 2000s, Sun's customers were expecting explanations for its
traditional UltraSPARC processors lacking in performance. In reality
Sun, via Texas Instruments [TI], was not able to successfully
fabricate, traditional high-clocked, large cache, state-of-the-art
processors. Traditional processors, such as IBM POWER or Intel x86,
were designed to maximize Instruction Level Parallelism [ILP] with
fast single thread execution.
mid-2002 Sun purchased Afara, the firm that designed processors with
slow-clocks and simple cores able to maximize the executions of many
threads. TI was able fabricate these processors with simple cores,
small caches, and placed identical copies on a single die. This
created Sun's Niagara processor line, known today as the UltraT1, 2,
3, etc. Sun began its CMT marketing campaign claiming that processor
clocks have reached an asymptote and memory performance was scaling
at 1/3 that of processor clocks, condemning traditional execution to
the dust bin of history. Sun's CMT technology was purported to save
the data center and do so at a low heat dissipation per thread
regime. Sun's argument was that ILP has reached the end of the line,
processor clocking had reached the point of creating unimaginable
power densities, and memory technology was never going to catch up.
CMT contrarian market hype was taking place as IBM POWER4, the first
commercial general purpose multi-core processor was setting
performance records and Intel's Xeons were approaching 4 GHz.
IBM's POWER6 hit 5GHz several years ago, and today's IBM System z
(mainframe) processors run at 5.2 GHz. What Sun proclaimed as a
semiconductor technology wall was torn down with cleaver designs by
IBM, Intel and AMD. Sun sacrificed single thread performance as the
cost of keeping a processor line alive. Sun paid the price as it
lost market share. IBM and Intel today have multiple core
processors running multiple simultaneous threads, never having to
sacrifice single thread performance in the interim.
we enter this decade it appears that Sun+Oracle plans on cranking up
the clocks on their CMT processors while keeping the core count
constant. In addition, Sun+Oracle appears to be adopting the
capability to dynamically alter the number of threads per core
allowing more of the CPU core to execute the thread (contrary to its
CMT market hype) and enabling more cache per thread! Sound familiar?
It should, considering IBM introduced it earlier last year calling it
Intelligent Threading (see:
has basically contradicted nearly all of Sun's CMT marketing hype.
The following link is one of the few remaining original CMT
justification presentations still on the web, outside of oracle.com:
Most of Sun's CMT processor presentations seem to have been excised
from the web. A June 2005 blog that is still active at Oracle, http://blogs.sun.com/esaxe/entry/cmt_performance_enhancements_in_solaris
“Rather than butting heads with the
laws of physics in an attempt to quickly burn though a single
instruction stream (stumbling and stalling along the way), CMT
processors do more by allowing multiple threads to execute in
It wasn't that Sun's processors didn't
meet performance expectations due to the laws of physics. Rather,
they failed in meeting the challenge in designing and fabricating
processors given the limits of solid state physics. It appears that
Sun+Oracle are playing catch up again against IBM and Intel –
neither of which waited around for the “laws of physics” to ease
almost the entire decade following Y2K, Sun Microsystems claimed the
TPC-C benchmark was irrelevant, not representative of the modern data
center and moreover, it cannot be used for sizing. Subsequently, Sun
didn't publish any TPC-C results. This benchmark alienation came just
after Sun claimed its final world record E10K TPC-C results with
UltraSPARC-II processors and just before Sun introduced the
UltraSPARC-III, circa 2001. These actions were not accidents nor is
the recent Oracle+Sun's claim of a TPC-C result of 30,249,688 tpmC
UltraSPARC-III had a blocking L1 cache, designed to optimize SPEC
CPU95 benchmark execution. The UltraSPARC-III was late enough that
the SPEC CPU95 was retired and replaced by SPEC CPU2000. SPEC CPU2000
had a larger footprint and a different execution pattern than its
predecessor. Throughout the last decade, Sun's UltraSPARC processors
were plagued by poor single processor industry-standard benchmark
results. For Sun publishing any TPC-C results would be very
embarrassing. (I know, as a member of Sun's benchmark council). When
industry standard benchmark results were good Sun would publish them.
When results turned out poor – the benchmark was attacked. When
results became good ”again”, they are published, as was by
Oracle+Sun on December 2, 2010.
the TPC-C benchmark could be characterized by light-weight thread
processing representing, “... the principal activities
(transactions) of an order-entry environment. These transactions
include entering and delivering orders, recording payments, checking
the status of orders, and monitoring the level of stock at the
warehouses” (see: http://www.tpc.org/tpcc/default.asp),
this benchmark does provide a relative measure of the ability of a
system to move data with processing capability secondary (The handful
of SQL statements are rather trivial). Rapid data movement with
low-quality processing is a forte of Sun's T1, T2, T3, and T4
processors. Interestingly, it was only after Oracle purchased Sun
that TPC-C benchmarks on Sun SPARC were published again. It was known
as far back as 2005 that the UltraT1 generated relatively good TPC-C
results, but because the TPC-C benchmark was deemed worthless, Sun
could not publish them less be called on the carpet for blatant
duplicity. Oracle must think today's customers have no medium-term
memory, a poor assumption for a database software company.
results come in two flavors, single or clustered. A single result
represents the capability of a single server with its storage. A
clustered result approximates a cumulative sum of all the machines in
the cluster. The larger the cluster, the better the result. Of course
clustering like this has its mechanical and networking asymptotes,
but generally you can pick a desired tpmC and then cluster servers
and storage until that result is achieved. Sun made this argument a
decade ago as a reason to avoid the TPC-C clustered results. In fact,
Sun used to claim that IBM and others had to cluster their servers to
get even publishable results.
TPC-C results can be used for certain comparisons. For example: The
latest Sun+Oracle TPC-C result was achieved using a cluster of
twenty-seven servers with 1726 SPARC processor cores. They then
compared the results with the best IBM result which is a cluster of
three, p780 servers with 192 POWER7 cores. Sun+Oracle has a 3X better
result than IBM with 9X cores and 9X servers. The quotient is left as
an exercise for the reader!
a heritage of duplicity, note the title of another blog on the same
blogs.sun.com site that Oracle's latest TPC-C claim was made:
What to believe is up to the imagination of the reader!
To deign comparability
between the available IBM POWER7 and Oracle's yet-to-be-released
UltraSPARC T3 (Niagara 3) is like juxtaposing a BMW X6 and a school
bus, respectively. Certainly both vehicles transport people, both are made of
metal and burn hydrocarbon fuel – but this is where the comparison
ends. Interestingly, Oracle claims a school bus analogy for their Chip Multi
Threading (CMT) architecture, saying it represents computing
requirements in today's data center. Oracle says it is more efficient
to transport, say, 40 students in a school bus at one time, although
slowly, than to transport 8 groups of 5 students in an X6 running
back and forth at lightening speed. Unfortunately – we don't have
40 students to transport, but perhaps less than 5. A school bus is
an application-specific vehicle, as is Oracle's CMT
application-specific processor architecture.
Oracle's CMT argument also
claims that single, heavy weight thread performance (the BMW X6)
is not as important as the ability to execute multiple,
low-performance threads (the school bus). In contrast, IBM's POWER
and Intel's x86 are designed for general purpose computing
requirements: heavy weight thread processing (ability to execute the
maximum number of instructions/clock) with fast clocks, large
low-latency local caches, branch prediction, and out of order
execution. Today, these general purpose processors also execute many
HW threads simultaneously without having been designed to sacrifice thread execution
quality for thread quantity. One of the few widespread application-specific execution environments demanding the efficient execution
of scores of low-demanding threads is a web server under heavy load.
Another is shuffling around streams of data. UltraSPARC T3-based
systems are good web servers, but are architecturally challenged in heavy processing of that data. Real-life benchmarks speak for
themselves – see my previous blog entry.
Oracle claims that by
doubling the HW thread context count in the UltraSPARC T3 over its
predecessor, the UltraSPARC T2 (Niagara2) overall performance will
double. Any increase in performance could only occur if the
execution environment was thread starved. Alternatively, since few
applications spawn scores of threads, executing that application on a
processor that has double the thread contexts of its predecessor
will not provide any more performance. This is similar to designing a
new school bus that now holds 80 students, but is still only
Oracle's UltraSPARC T3 and
IBM's POWER7 are both two billion transistor processors and dissipate
about the same amount of heat. As seen in the UltraSPARC T3 die
photograph just below, the processor has sixteen (1.6GHz) cores, each
holding 8 HW thread contexts, providing a total of 128 HW thread
contexts per socket. However, each core only executes one thread at
any given time, if a thread is actually available for that core.
Literature on this topic tends to be blurred – giving the
impression that at any given time all 128 thread contexts are
executing simultaneously. In fact, each core is so simple that even
branch prediction is non-existent, forcing a thread switch on any
cache miss. Cores communicate with a shared 6MB L2 cache via crossbar
switches. The processor has on-board memory, PCIe, Ethernet, and SMP
coherency controllers. With all pumping at full blast, a theoretical
maximum BW of 2.4Tb/sec is achieved but can only be sustained with a
large number of available threads and full bore I/O running.
Oracle UltraSPARC T3
In contrast, the POWER7
(see die below) has eight cores, each with 4 fully simultaneously
executing threads. The POWER7 can execute twice the number of threads
simultaneously as can the UltraSPARC T3. In order to decrease memory
latency and insure the cores are fed with instructions and data, the
POWER7 has a huge, on-board 32MB L3 cache feeding eight dedicated
256KB, 8-cycle latency L2 caches, pumping data into 2-cycle latency
32KB L1 data caches. Combined dual memory and SMP coherence
controllers aggregate 2.9 Tb/sec of BW. The POWER7 has as many
floating point units as threads.
IBM's POWER7 is the latest
product in a successful road map of general purpose processors
designed with the horsepower to pound through heavy-weight,
compute-intensive tasks at nearly 4GHz.
It is worth noting that
Oracle's UltraSPARC T3 is curiously missing from the this month's Hot
Chips Conference agenda (see:
even though its general availability is set for later this year. At
least two IBM's POWER-related sessions are scheduled at Hot Chips.
On August 17, 2010, IBM continues its roll-out of new POWER7-based systems,
software, and solutions. Register for webcasts: http://www-03.ibm.com/systems/power/advantages/
Paul Venezia's InfoWorld expose' [see: http://www.infoworld.com/t/mergers-and-acquisitions/oracle-customers-sun-sun-who-751] details what is in store for Sun's current customer base post facto the Oracle acquisition. Oracle's CEO, Larry Ellison, claims Oracle+Sun will form the basis for Oracle to emulate IBM of the 1960s. This is a clear endorsement of IBM today, yet disregards market and technology realities existing fifty years ago, not today.
In the meantime, Oracle is engaging in tactics that will drive the last nail in the coffin for even SPARC/Solaris zealots. It was bad enough that Sun+Texas Instruments could not meet market performance windows for Sun's native UltraSPARC line of processors, leaving Sun SPARC customers having to choose between application specific UltraSPARC-Tx, Niagara-class based products or SPARC64 based servers from Fujitsu. Now, Oracle is increasing the service and maintenance costs of aging Sun products -- forcing Sun customers to upgrade. Upgrade to what?
Since the introduction of IBM's POWER5 in 2004, Sun's IBM sales playbook was predicated on the following golden rules:
* IBM will force you to upgrade your systems every two years, while at Sun we design our systems for a five year refresh cycle. [Translation: Sun needs to create a plausible excuse in face of IBM's technology and product development cycle being 2x faster than that of Sun.]
* IBM makes servers that are benchmark machines, while Sun servers are balanced systems. [Translation: Need non sequitur FUD in light of IBM's superior performance.]
* When IBM brings GBS into your data center, System Admins, Systems Programmers, DBAs, etc., will lose their jobs. [Translation: Make it personal, do whatever you can to stay with Sun and keep away from IBM -- or be layed off!]
Oracle, the company that wants to emulate IBM, is now violating Sun's own stated golden rules. This is not surprising. For the past decade Sun has claimed that the TPC-C OLTP industry-standard database benchmark is archaic and does not represent any aspect of the modern data center, and besides when run in a cluster you can simply add servers and storage until the desires numerical result is achieved. This was convenient for Sun since it had poor results on this benchmark. Oracle came along last fall, using a huge cluster of Sun's Niagara-based application-specific processors and claimed a world record TPC-C benchmark result.
If I were a Sun customer, I would be seriously considering transitioning away from another decade of techno-deception.