In article, EMC kills SPEC benchmark
with all-flash VNX
Chris Mellor, calls this a “watershed benchmark”, and continues,
“The previous top SPECsfs2008 NFS v3 score was 403,326 ops/sec from
an IBM SONAS (Scale-Out NAS)
. EMC's result was 497623
“SPECsfs2008 is the
latest version of the Standard Performance Evaluation Corporation
benchmark suite measuring file server throughput and response time,
providing a standardized method for comparing performance across
different vendor platforms. SPECsfs2008 results summarize the
server's capabilities with respect to the number of operations that
can be handled per second, as well as the overall latency of the
Clearly, EMC's result are better
than IBM's published results (
However, without getting into minutiae, in comparing the basic
storage technology used by EMC
almost (93%) all of EMC's drives are solid state disks (SSD) and all of
IBM's storage uses 15K rpm hard disks. The advantages of SSDs are
well known and is certainly an acceptable storage technology for this
benchmark. An observation that should be noted is that SSD technology
provides from over one to approximate two orders of magnitude in
random io ops/sec performance over 15K rpm drives1, yet
EMC only reported a slight improvement over IBM's result in this
benchmark. Is cost perhaps the reason?
The cost between
200GB SAS Flash and 450-600 GB 15K SAS drives is in the wide range of
5-30X. The performance capability of SSDs, in this benchmark,
allowed EMC to use about ¼ the number of overall drives than IBM
used. Since It appears the dollar cost of actual storage per
SPECsfs2008_nfs.v3 ops/sec result
for the EMC result appears significantly higher than for IBM's
not clear why a customer would spend so much extra for EMC's SSDs
rather than standard high performance spindles for a 23% performance advantage. It almost appears as though EMC simply wanted a
benchmark result to be slightly higher than IBM's.
Two weeks ago on an HP blog,
blogger John Pickett based much his anti-IBM System z196 zBX claims
on what was “heard” rather than on hard evidence. The zEnterprise
BladeCenter Extension (zBX) is the new infrastructure for extending
tried and true System z qualities of service and management
capabilities across a set of integrated, fit-for-purpose POWER7 and
IBM x86 compute blades.
Pickett claimed: (responses are underlined)
IBM will use non-standard POWER7 and x86 blades. This is false,
IBM plans on using its standard blades.
Pickett then lists, numerically, half-truths, rumors and then
bases conclusions on them:
1) Will the zBX Blades be a replacement for the mainframe specialty
engines? No... so you'll have to determine when to run a workload on a mainframe general processor, a specialty engine, a zBX Power blade or a zBX x86 blade...and those are just the mainframe-centric options. The IBM blades are enhancements to the existing System z
infrastructure, which will utilize System z's existing application
administration. This is no different than any other application.
2) Why the need for unique Power7 and x86 blades specific to the
zBX? Doesn’t that defeat the purpose of an open environment? It
might if there were indeed unique IBM blades – but they are
standard IBM blades. zBX will not support every blade IBM ever
designed, but does support specific GA blades.
3) Will the zBX have the same availability as the mainframe? No,
Just because the zBX is connected to the mainframe does not mean the
availability from the mainframe is transferred. This assumption is
not based on anything published by IBM. Since the assumption is
false, the conclusion is false as well. The zBX chassis itself has
been "hardened" to be more like a mainframe in its
availability characteristics. All features are replicated - this
redundancy provides higher availability then standard blades. Even
the high speed private network is redundant. Redundancy allows
continued service in the case of an outage of a particular feature.
In addition, the zBX is monitored for availability and in case of
outage - a call home is initiated automatically. Do any HP blade
chassis have full feature replication?
4) Isn’t the business justification more than a little
challenging? Much of the cost parameters are based on soft
calculations to “increase operational efficiency” (the same which
can also be said of non-mainframe platforms). Should the application
run on a blade in the zBX, a specialty engine such as an IFL or a
mainframe general purpose processor? And don’t get me started on
the mainframe pricing schemes from WLC, AWLC, PSLC, zNALC, etc.
Notwithstanding Pickett never answering his question, the operational
efficiency comes from the housekeeping required of standard blades.
Things that are time consuming for standard blades include OS and
virtualization upgrades to keep all blades at the same release levels
and security levels. This is all done through rules and automatically
by zManager and zBX.
5) How about investment protection? Can you use pre-existing IBM
Power Blades? No. Pre-existing IBM x86 Blades? No. Pre-existing z10?
No. These options were withdrawn by IBM prior to the zBX even being
shipped. Why force mainframe owners to upgrade to a z196 just to
evaluate the zBX? First, one can use pre-existing IBM blades if they
are the specified type. This means that these boards exist today.
Second, are we to conclude that HP supports any blades it ever
sold in any of its blade chassis? No. And for a third party view, “IBM
has been successful in making their chassis totally backward
compatible with their older modules and blades and most of their
newer modules and blades fit in their older chassis with performance
restrictions in rare cases, but that offer a great investment
protection to customers who is upgrading their chassis comparing to
HP which forcing their customers to toss their old blades and modules
out as none of it is compatible across chassis. Who knows if the next
HP chassis will follow up the same path as their current one, which
mean a total lost of investment when upgrading.”
6) Will ISV applications need to be retested and recertified?
Unknown. Perhaps unknown to Pickett is that ISV applications (and
customer apps) will work unchanged. If they ran on AIX before, they
will work in this environment. No retesting or recertification. ISV
applications are certified for an OS - on the zBX, the OS is the same
as on standard blades.
7) What about Windows Server and SQL support? Not available. This
is actually true, as of this date.
8) Is VMware supported? Nope—not there either. Nor it is
available on System z or on POWER7 systems. Both have superior and
more secure virtualization than is offered by VMware. However, the
point is that Vmware is not required -- zManager provides most of
these functions, the customer saves on license costs for VMware, the
administration of VMware, its setup, upgrading, securing, etc. All
this is provided and managed by zManager
9) The new URM (Unified Resource Manager) will simplify your
management, right? Not exactly. URM handles the hardware, but you
will still need other products such as Tivoli Provisioning Manager,
Tivoli Service Automation Manager and OMEGAMON for automation,
control and service management. Not required. This is a customer's
choice in terms of the service management functions they want to add
to the environment.
Pickett concludes: That really does not sound like something that
reduces complexity. Sure, if one bases a conclusion on wrong, poor,
and incomplete facts, as is the case here.
almost the entire decade following Y2K, Sun Microsystems claimed the
TPC-C benchmark was irrelevant, not representative of the modern data
center and moreover, it cannot be used for sizing. Subsequently, Sun
didn't publish any TPC-C results. This benchmark alienation came just
after Sun claimed its final world record E10K TPC-C results with
UltraSPARC-II processors and just before Sun introduced the
UltraSPARC-III, circa 2001. These actions were not accidents nor is
the recent Oracle+Sun's claim of a TPC-C result of 30,249,688 tpmC
UltraSPARC-III had a blocking L1 cache, designed to optimize SPEC
CPU95 benchmark execution. The UltraSPARC-III was late enough that
the SPEC CPU95 was retired and replaced by SPEC CPU2000. SPEC CPU2000
had a larger footprint and a different execution pattern than its
predecessor. Throughout the last decade, Sun's UltraSPARC processors
were plagued by poor single processor industry-standard benchmark
results. For Sun publishing any TPC-C results would be very
embarrassing. (I know, as a member of Sun's benchmark council). When
industry standard benchmark results were good Sun would publish them.
When results turned out poor – the benchmark was attacked. When
results became good ”again”, they are published, as was by
Oracle+Sun on December 2, 2010.
the TPC-C benchmark could be characterized by light-weight thread
processing representing, “... the principal activities
(transactions) of an order-entry environment. These transactions
include entering and delivering orders, recording payments, checking
the status of orders, and monitoring the level of stock at the
warehouses” (see: http://www.tpc.org/tpcc/default.asp),
this benchmark does provide a relative measure of the ability of a
system to move data with processing capability secondary (The handful
of SQL statements are rather trivial). Rapid data movement with
low-quality processing is a forte of Sun's T1, T2, T3, and T4
processors. Interestingly, it was only after Oracle purchased Sun
that TPC-C benchmarks on Sun SPARC were published again. It was known
as far back as 2005 that the UltraT1 generated relatively good TPC-C
results, but because the TPC-C benchmark was deemed worthless, Sun
could not publish them less be called on the carpet for blatant
duplicity. Oracle must think today's customers have no medium-term
memory, a poor assumption for a database software company.
results come in two flavors, single or clustered. A single result
represents the capability of a single server with its storage. A
clustered result approximates a cumulative sum of all the machines in
the cluster. The larger the cluster, the better the result. Of course
clustering like this has its mechanical and networking asymptotes,
but generally you can pick a desired tpmC and then cluster servers
and storage until that result is achieved. Sun made this argument a
decade ago as a reason to avoid the TPC-C clustered results. In fact,
Sun used to claim that IBM and others had to cluster their servers to
get even publishable results.
TPC-C results can be used for certain comparisons. For example: The
latest Sun+Oracle TPC-C result was achieved using a cluster of
twenty-seven servers with 1726 SPARC processor cores. They then
compared the results with the best IBM result which is a cluster of
three, p780 servers with 192 POWER7 cores. Sun+Oracle has a 3X better
result than IBM with 9X cores and 9X servers. The quotient is left as
an exercise for the reader!
a heritage of duplicity, note the title of another blog on the same
blogs.sun.com site that Oracle's latest TPC-C claim was made:
What to believe is up to the imagination of the reader!
All Virtualization was not Created Equal
Server virtualization, or the ability to run more than one operating environment on a single piece of hardware, while simultaneously letting each operating environment (virtual machine) think it is the only one running, has clearly reached the data center mainstream. However, all server virtualization was not created equal.
Virtualization was first commercially available in the 1960s with CP/CMS on IBM’s s/360 mainframes and has matured four decades into today’s z/VM on IBM’s System z. System z can run 1964 COBOL in one virtual machine and state-of-the art Linux in another. IBM’s System z is the most virtualizable server ever designed. For example, any one 4GHz processor core can, at any time, become:
* Central Processors - Running the z/OS operating system (and others)
* System Assist Processors -- Offloading I/O processing from the z/OS central processors
* Internal Coupling Facility -- Running special z/OS clustering microcode
* Integrated Facility for Linux -- Running Linux at a lower cost than on z/OS central processors.
* Application Assist Processor – Lowering costs by off-loading JAVA from z/OS central processors
* Integrated Information Processor -- Lowering costs by off-loading selected DB2 work from center processors
* Any spares -- Can dynamically replace any failing core
Moreover, this architecture allows multiple concurrent hypervisors, whereas typically one hypervisor controls an entire server environment. A hypervisor lies between the physical hardware and virtual machines. Hypervisors make one physical resource look like multiple virtual resources. It abstracts the physical hardware resources into logical representations used by virtual machines and regulates virtual machine access to these abstracted resources. Advanced hypervisor functionality can combine multiple physical resources into shared pools from which users receive virtual resources, on demand.
The System z PR/SM hypervisor has been available since 1988, and:
* Has a CC EAL5 security rating
* Creates separate pools of z/OS and z/VM LPARs.
* Allows CPU resources to be shared within these pools
* Has memory dedicated to an LPAR
* Manages low latency virtual networks
* Facilitates I/O sharing between or dedicated to LPARs
Classic, single operating systems running on a physical server timeshares the execution of threads and tasks. Add virtualization and multiple virtual machines not only must time share the underlying environment, but each has its threads and tasks switching. Until recently, it would have been difficult to implement a viable virtualization infrastructure atop commercially available off the shelf hardware. The switching and housekeeping overhead was prohibitive on older, slower, limited memory footprint, general purpose server technology. Historically, servers and operating systems were not virtualization-ready, except for System z [IBM Mainframe]. Most high performance enterprise processors did not have even a hypervisor execution mode. This would be necessary for efficient virtualization.
Modern high performance processors have user and kernel (used only by the operating system) execution modes. This was the classic execution architecture; user space and [single] operating system kernel space. This worked fine when server performance was a primary marketing feature, regardless of low system utilization. In an attempt to implement multiple OSs running on the same server, given, among other constraints: processor designs without a hypervisor privilege, Non Uniform Memory Architecture, lack of I/O management units, etc., resulted in servers with hardware partitioned domains, such as Sun’s SPARC & Solaris Domains or HP’s SuperDomes & nPars. These classic limitations, clearly missing on x86 processors, where eventually addressed by novel technologies from VMware and others, that would trap and translate (and cache) executing x86 kernel opcodes into user mode opcodes allowing the hypervisor to run as if in kernel mode.
Hardware partitioning is crude compared with today’s virtualization, for the underlying HW is not shared, but rather each partition is electrically isolated from interfering with each other, expressly lacking the ability to share processing capability or other systems resources. In such an environment a single OS/hard partition could be running at 100%, with other partitions idling, wasting overall utilization.
Until very recently, Intel and AMD x86 processors had almost no capability of aiding hypervisors in abstracting hardware. The latest Intel Nehalem-class processors with Virtual Machine Extensions and chip set extensions, greatly help hypervisors abstract underlying hardware and keep track of virtual machine status. Eventually commercial x86 virtualization packages will make more use of these new capabilities. It would have been preferable if all x86 processors were designed with virtualization in mind, but that was not the case.
The IBM POWER processor was specifically designed to operate in a virtualized environment. The processor has three execution modes: hypervisor, kernel, and user, has peripheral management units, etc. Having been designed for a virtualized environment, the POWER hypervisor, PowerVM, is a firmware product. Since PowerVM functions synergistically with POWER’s architecture, it has the ability to abstract a single POWER core down to 1/10 of a processor, (alternatively create 10 virtual or logical processors out of one physical core) within multiple shared processor pools containing up to 254 logical processors. These kinds of capabilities would be rather difficult to match on current x86 class processors and virtualization suites.
System virtualization enables the consolidations of systems, workloads, and operating environments, optimizes resource use, improve data center flexibility and responsiveness. Virtualization provides the following benefits:
Consolidation and reduction of hardware costs -- virtualization enables efficient access and manages resources, reducing operation and system management costs while maintaining needed capacity. Typical server wide utilization of ~20% is increased to over 80% with well implemented virtualization.
Optimization of workloads -- virtualization enables dynamic response to the application needs of its users. Virtualization increases the use of existing resources by enabling dynamic sharing of resource pools.
IT flexibility and responsiveness -- virtualization provides a single, consolidated view of, and easy access to, all available resources in the network, regardless of location.
When contemplating virtualization, consider the consolidation of logical resources rather than physical resources into an environment designed to support server, storage, and network virtualization. Adding virtualization technologies to your data center creates an on demand, secure, and flexible infrastructure prepared to automatically handle dynamic workload changes in your data center environment.
Refer to http://www-03.ibm.com/systems/virtualization/ for further information on IBM virtualization.
There has been quite a lot of talk
about ARM Holdings and the ARM processor lately. Some of this is due
to the pervasiveness of its architecture in many mobile devices, some
of it is due to extensive hype over “new technology” versus “old
technology” – an unfortunate metaphor.
Are we to believe processor designers
who license the rights to the ARM processor technology are going to
“one up” traditional server processor architectures simply
because they started out with a stripped down, energy-efficient CPU? Let's take a look at why not!
Benchmarks results specifically
targeting these low-power processors have begun to be published.
Many of these benchmarks are based on the Dhrystone benchmark, run on
8088-class processors back in the 1980s! Performance for this class
of processor is usually measured in DMIP (Dhrystone Millions of
Instructions/Sec), roughly based on a VAX780 MIP. These benchmarks
are a far cry from industry standard benchmarks such as the SPEC or
TPCC warehouse database suits, etc. Before one starts yelling, how can
one expect the ARM class of processors to do well on these
benchmarks? One cannot simultaneously reject so-called “old
technology” while extolling the wonders of 30 battery hr hand held
tablet processor in micro servers. It would indeed be interesting to see
SPECint2006 results for these processors, but none seem to exist. The
same for a tpcc result? It is noteworthy that a dual core 1.6 GHz
Atom processor generates about 8000 DMIPS and dual core Cortex A9
about 4000. This means that if Intel had to drop its clock to say
1GHz to be in the same heat dissipation range as the Cortex A9, they
would have “similar” performance – in a single socket
In reality, “new technology” (ARM)
and “old technology” (Intel, AMD, IBM POWER) are two different
technologies, neither chronologically distinct. If we expect to see a
farm of micro servers each with 100 ARM or ARM-like Systems on a Chip
in 1U form factors, one should expect they will be running commercial
grade applications, the least of which would be web and database
servers. Would we see a SPECweb2005 result published for 1024 socket
ARM-based micro web server? We had better.
Is one supposed to assume that
designers of Intel x86 or IBM POWER are simply wasting millions of
transistors due to negligence? No! Will the processors in the “new
technology” micro servers use a new way for cache coherency
heretofore unknown to the world? I doubt it. SMP cache coherency use
transistors and utilize bandwidth. As more performance is demanded
from these ARM-class micro servers, processor designers will slowly
be incorporating techniques from “old technology” such as huge
out of order execution windows, complex caches, novel inter-socket
communications, multi-threaded execution and the ability to address
huge memory spaces. All these require complexity, transistors, and
watts. By the time all this has been accomplished the wheel will have
been re-invented again, with these micro servers dissipated about the
same heat as the “old technology” processors. If it takes a
given number of transistors to perform some advanced function such as
wide instruction execution and complex branch prediction, etc., the
ARM-class of processors will not perform such functions while
simultaneously violating the laws of solid state physics.
The hype surrounding this “new
technology” sounds striking familiar to what Sun Microsystems
claimed in the last half of the previous decade regarding its
“disruptive” Niagara “technology”. Sun said Thread Level
Parallelism was taking over the data center, since single thread
(Instruction Level Parallelism) was out of gas. Intel didn't think
so! AMD didn't think so! IBM didn't think so! Sun placed eight very
simplistic SPARC cores on a die with each executing at any given
clock tick, one of up to eight thread contexts. Sun claimed clocks
speed didn't matter because slow memory interfaces and long latencies
determined system throughput, not clock. Sun could claim something on
the order of a watt per [thin] thread context, versus perhaps 25W per
[heavy] thread from its competition. Well, about half a decade later
Sun+Oracle have reached a point where their processors now dissipate
basically the same amount of heat as established Intel, AMD, or IBM
POWER processor, and are considering reducing thread count and
cranking up the clock – to be competitive with their competition.
Sun's [now Oracle's] competition never felt the need to sacrifice
single thread performance, all the while adding cores and real
Simultaneous Multi Threading. The IBM POWER7 now has eight cores,
each capable of executing 4 instruction threads at the same time. A
single POWER7 can execute 32 threads simultaneously at a clock rate
nearly triple that of Oracle's Niagara-based processors. So much for
the hype! Something similar will have to happen with the “new
technologies” such as ARM-class processors in micro servers if they
expect to play with “old technology” big boys.
As with most things, you don't get
something for nothing.
In Oracle® Database 11g Running on
Oracle’s SPARC Enterprise M9000 Server Sets World Record TPC-H
Three Terabyte Non-Clustered Benchmark Result,
Oracle's tradition of making claims, hoping the reader will not
examine the details, continues.
In last week's disclosure, Oracle
on a SPARC Enterprise M9000 server, equipped with 64 SPARC64 VII+ 3.0
GHz processors, and Sun Storage 6180 arrays, Oracle Database 11g
Release 2 with Oracle Solaris achieved a world record TPC-H 3 TB
non-clustered performance result of 386,478 QphH@3000GB with a price
Oracle is NOT the category winner in either QphH
or USD/QphH, the only two benchmark metrics. In fact, tcp.org has it
ranked number 2, clustered or not. Perhaps what Oracle referred to as a record is the
system cost record for this category, which it does win at
demonstrates that Oracle Database 11g Release 2 running on the SPARC
Enterprise M9000 server was 2.4 times faster than the IBM Power 595
system(3) and loaded the entire database 3.3 times faster than the
above system while maintaining the highest level of data protection
at a lower cost per transaction(3).
Upon looking at the
actual Oracle and IBM
benchmark disclosures, it must be noted:
- The Oracle result
used 256 cores to achieve their latest result
- IBM used 64 cores
and it was run on a POWER6, IBM's prior generation processor.
Oracle requires 4X
the number Fujitsu cores than did IBM POWER6 cores for a 2.4X
performance differential. Since TPC-H is a business analytics
benchmark, and business analytic applications have per core
licensing, even taking into consideration the M9000 “performance”
advantage, Oracle's M9000 SW cost would be 1.67 times that of IBM's prior generation p595 server (normalizing per core performance).
Oracle also claimed
it loaded it's database 3.3 times faster than IBM. However, the
details show Oracle's Total
Storage / Database Size is 102.6, whereas IBM's
Total Storage / Database Size is 6.58. This means:
- To store 3TB of data; Oracle used 308TB (102.6TB of storage/data set size),
- For the same 3TB of data; IBM only needed 20TB (6.58TB of storage/data set size)
had massive storage, almost 16 times that of IBM per dbase size.
While this totally legal and legitimate, storage has its costs. The
end result is that the price/performance metric for this benchmark,
USD per QphH@3000GB for
Oracle and IBM is:
- Oracle M9000: 19.25
- IBM p575: 20.60
Oracle's latest TPC-H result is 1.07X
better than IBM's prior generation result. Not much to brag about,
considering the Oracle's system costs $7.4M and IBM's is $3.2M. One
might conclude Oracle simply kept adding cores and storage until it
just passed IBM's result. In any case, Oracle provides 1.07x better performance for 2.3x the price.
benchmark underscores the ability of SPARC Enterprise M-series
servers to deliver near-linear scalability and handle the
increasingly large databases required of decision support and data
warehousing systems. Neither IBM nor HP matched this level of
performance in the 3TB scale factor category using a single system,
further highlighting the performance capabilities of multi-processor
SPARC systems in the most demanding enterprise application
As noted above, in this 3TB category,
Oracle is not even the best performer. Fujitsu's
own RX300 X4, a 640-core Xeon server wins.
In the 10TB category, IBM beats Oracle.
Only HP bothered to report results for
the 30TB category, and it did that in 2007.
When Sun Microsystems native SPARC
processors were sucking wind, Sun marketing began talking down single
threaded, high-clocked, large-fast cache-base execution environments
in favor of a mythical transformation of most all applications into
thread-rich execution environments. Sun made the term Thread Level
Parallelism [TLP] prolific. Now that Oracle purchased Sun we read
that single threaded, high-clock rate execution is being demanded by
Oracle applications. Changing horses twice mid-stream does not
impress data center managers.
Timothy Prickett Morgan noted,
“Oracle has been promising a 3X
improvement in "single strand" performance, which everyone
mean clock speed.”
“...Oracle might be overclocking the
Sparc chips to reach the 5 GHz stratosphere of chip clock speeds.
While this might not be the case, the question we need to be asking
Oracle - and remember, Oracle
doesn't answer questions - is: if not,
the 2000s, Sun's customers were expecting explanations for its
traditional UltraSPARC processors lacking in performance. In reality
Sun, via Texas Instruments [TI], was not able to successfully
fabricate, traditional high-clocked, large cache, state-of-the-art
processors. Traditional processors, such as IBM POWER or Intel x86,
were designed to maximize Instruction Level Parallelism [ILP] with
fast single thread execution.
mid-2002 Sun purchased Afara, the firm that designed processors with
slow-clocks and simple cores able to maximize the executions of many
threads. TI was able fabricate these processors with simple cores,
small caches, and placed identical copies on a single die. This
created Sun's Niagara processor line, known today as the UltraT1, 2,
3, etc. Sun began its CMT marketing campaign claiming that processor
clocks have reached an asymptote and memory performance was scaling
at 1/3 that of processor clocks, condemning traditional execution to
the dust bin of history. Sun's CMT technology was purported to save
the data center and do so at a low heat dissipation per thread
regime. Sun's argument was that ILP has reached the end of the line,
processor clocking had reached the point of creating unimaginable
power densities, and memory technology was never going to catch up.
CMT contrarian market hype was taking place as IBM POWER4, the first
commercial general purpose multi-core processor was setting
performance records and Intel's Xeons were approaching 4 GHz.
IBM's POWER6 hit 5GHz several years ago, and today's IBM System z
(mainframe) processors run at 5.2 GHz. What Sun proclaimed as a
semiconductor technology wall was torn down with cleaver designs by
IBM, Intel and AMD. Sun sacrificed single thread performance as the
cost of keeping a processor line alive. Sun paid the price as it
lost market share. IBM and Intel today have multiple core
processors running multiple simultaneous threads, never having to
sacrifice single thread performance in the interim.
we enter this decade it appears that Sun+Oracle plans on cranking up
the clocks on their CMT processors while keeping the core count
constant. In addition, Sun+Oracle appears to be adopting the
capability to dynamically alter the number of threads per core
allowing more of the CPU core to execute the thread (contrary to its
CMT market hype) and enabling more cache per thread! Sound familiar?
It should, considering IBM introduced it earlier last year calling it
Intelligent Threading (see:
has basically contradicted nearly all of Sun's CMT marketing hype.
The following link is one of the few remaining original CMT
justification presentations still on the web, outside of oracle.com:
Most of Sun's CMT processor presentations seem to have been excised
from the web. A June 2005 blog that is still active at Oracle, http://blogs.sun.com/esaxe/entry/cmt_performance_enhancements_in_solaris
“Rather than butting heads with the
laws of physics in an attempt to quickly burn though a single
instruction stream (stumbling and stalling along the way), CMT
processors do more by allowing multiple threads to execute in
It wasn't that Sun's processors didn't
meet performance expectations due to the laws of physics. Rather,
they failed in meeting the challenge in designing and fabricating
processors given the limits of solid state physics. It appears that
Sun+Oracle are playing catch up again against IBM and Intel –
neither of which waited around for the “laws of physics” to ease
So, who cares? What difference does it
make to me – I don't even watch Jeopardy? While what IBM is
pursuing may be somewhat dismissed as veiled gratuitous public
relations by some pundits,
this human like intelligence is a demonstration of what will permeate
our lives well within a generation. The research and development
prowess that IBM will demonstrate on nation-wide TV, regardless if it
“wins”, represents the type of gating technology that will be the
progenitor for an industry that doesn't even exist yet: human
assistants ranging from intelligent prosthetics to nannies to
soldiers and home robots. Sound wild? A generation ago who would have
thought there would be multiple computing platform in the home?
The types of technological challenges
that have to be overcome for the realization of a “home robot
market” include: multiple simultaneous emotion extraction from
enhanced speech and facial recognition, natural language interfaces,
cognitive abilities, symbolic interpretation of live vision objects,
tactical grasping, near instantaneous database access or some digital
neural equivalent, etc, etc.
What was learned from IBM Deep Blue's
victory over Grand Master Gary Kasperov in 1997
was an early step. Today's ability to take on the best of
Jeopardy allows us to learn and define the technological hurdles that
must be solved that will usher in the next revolution in computing.
IBM taking on the best of Jeopardy is important to anybody who has an
interest in their high-technology career over the next twenty years.
New markets and industries will be created that are unimaginable
more information see:
Today, 2/14/2011, the first of three Jeopardy! sessions between the top two Jeopardy! champions and IBM Watson will air on national TV. As each question is asked at lot will be taking place and many of us will be wondering just what is going on inside IBM Watson. Just what is going on?
While IBM Watson's entire execution infrastructure has not be
published, we do know that each compute element consists of a
commercially available IBM POWER 750
The entire interconnected cluster looks like a set of library shelves.
Many of us will wonder each time a question is asked what is going on in the three seconds given to the contestants. As humans, we can more or less understand being a Jeopardy! contestant. Many people will invariably not know the answer in three second but will retort after the correct response is made -- "Oh, I knew that!". Watson is not doing that, although it has been reported that IBM Watson has a good idea of the types of questions and answers that have been previously asked on Jeopardy! In contrast, Watson's POWER7 processors are pumping through 15 TB of data (equivalent to about 200 millions pages of text) at a rate 500 GB/s each, concurrently. But first, Watson has to understand the question. It has to determine verbs, nouns, objects and moreover, nuances in the English language not part generally part of the standard English 101 class. Next, Watson must look for the best answer. What might be the basic applications that are used to accomplish this massive test.
It has been reported that Watson runs on Linux, but also DeepQ&A (Watson's SW application stack) uses Hadoop and UIMA applications. UIMA stands for Unstructured Information Management Architecture, and according to wikipedia, "UIMA is a component software architecture for the development, discovery, composition, and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies developed by IBM. The source code for a reference implementation of this framework has been made available on SourceForge, and later on Apache Software Foundation website." This is an application that intelligently digests and correlates information that otherwise appears amorphous.
Upon reviewing my previous blog entry, https://www-950.ibm.com/blogs/davidian/entry/what_runs_watson_and_why16?lang=en_us, IBM Watson has 4 TB of storage, but has 16 TB of systems-wide memory. Such an architecture suggests an in-memory databases or at least in-memory data structures. Indeed, Watson uses Apache's Hadoop framework to facilitate preprocessing the large volume of data in order to create in-memory datasets. To provide effective CPU scheduling, the file system includes location awareness, that is, the physical location of each node, rack & network switch. Hadoop applications can use this information to schedule work on the node where the data is, and, failing that, on the same rack/switch, reducing backbone traffic. The Hadoop file system uses this when replicating data, trying to keep different copies of the data on different racks.
"Watson’s DeepQA UIMA annotators were deployed as mappers in the Hadoop map-reduce framework, which distributed them across processors in the cluster. Hadoop contributes to optimal CPU utilization and also provides convenient tools for deploying, managing, and monitoring the data "analysis process." For more information see: http://www.ibm.com/common/ssi/cgi-bin/ssialias?infotype=SA&subtype=WH&htmlfid=POW03061USEN&attachment=POW03061USEN.PDF&appname=STGE_PO_PO_USEN_WH
When watching Jeopardy! tonight try to keep in mind that for every question, IBM Watson has to, as a minimum, within 3 seconds:
- Take the stated question and parse its components
- Determine relationships between grammatical elements
- Create items that it must look for or relationships that may expand its search
- Have Hadoop dispatch work to access information that UIMA has intelligently digested and annotated
- 2880 POWER7 cores processing through TBs of data looking for the best set of results
- DeepQA then determining what it considers the best response, and
- Press a mechanical button as do the human contestants and express the answer in English.
Let the best "man" win!
To deign comparability
between the available IBM POWER7 and Oracle's yet-to-be-released
UltraSPARC T3 (Niagara 3) is like juxtaposing a BMW X6 and a school
bus, respectively. Certainly both vehicles transport people, both are made of
metal and burn hydrocarbon fuel – but this is where the comparison
ends. Interestingly, Oracle claims a school bus analogy for their Chip Multi
Threading (CMT) architecture, saying it represents computing
requirements in today's data center. Oracle says it is more efficient
to transport, say, 40 students in a school bus at one time, although
slowly, than to transport 8 groups of 5 students in an X6 running
back and forth at lightening speed. Unfortunately – we don't have
40 students to transport, but perhaps less than 5. A school bus is
an application-specific vehicle, as is Oracle's CMT
application-specific processor architecture.
Oracle's CMT argument also
claims that single, heavy weight thread performance (the BMW X6)
is not as important as the ability to execute multiple,
low-performance threads (the school bus). In contrast, IBM's POWER
and Intel's x86 are designed for general purpose computing
requirements: heavy weight thread processing (ability to execute the
maximum number of instructions/clock) with fast clocks, large
low-latency local caches, branch prediction, and out of order
execution. Today, these general purpose processors also execute many
HW threads simultaneously without having been designed to sacrifice thread execution
quality for thread quantity. One of the few widespread application-specific execution environments demanding the efficient execution
of scores of low-demanding threads is a web server under heavy load.
Another is shuffling around streams of data. UltraSPARC T3-based
systems are good web servers, but are architecturally challenged in heavy processing of that data. Real-life benchmarks speak for
themselves – see my previous blog entry.
Oracle claims that by
doubling the HW thread context count in the UltraSPARC T3 over its
predecessor, the UltraSPARC T2 (Niagara2) overall performance will
double. Any increase in performance could only occur if the
execution environment was thread starved. Alternatively, since few
applications spawn scores of threads, executing that application on a
processor that has double the thread contexts of its predecessor
will not provide any more performance. This is similar to designing a
new school bus that now holds 80 students, but is still only
Oracle's UltraSPARC T3 and
IBM's POWER7 are both two billion transistor processors and dissipate
about the same amount of heat. As seen in the UltraSPARC T3 die
photograph just below, the processor has sixteen (1.6GHz) cores, each
holding 8 HW thread contexts, providing a total of 128 HW thread
contexts per socket. However, each core only executes one thread at
any given time, if a thread is actually available for that core.
Literature on this topic tends to be blurred – giving the
impression that at any given time all 128 thread contexts are
executing simultaneously. In fact, each core is so simple that even
branch prediction is non-existent, forcing a thread switch on any
cache miss. Cores communicate with a shared 6MB L2 cache via crossbar
switches. The processor has on-board memory, PCIe, Ethernet, and SMP
coherency controllers. With all pumping at full blast, a theoretical
maximum BW of 2.4Tb/sec is achieved but can only be sustained with a
large number of available threads and full bore I/O running.
Oracle UltraSPARC T3
In contrast, the POWER7
(see die below) has eight cores, each with 4 fully simultaneously
executing threads. The POWER7 can execute twice the number of threads
simultaneously as can the UltraSPARC T3. In order to decrease memory
latency and insure the cores are fed with instructions and data, the
POWER7 has a huge, on-board 32MB L3 cache feeding eight dedicated
256KB, 8-cycle latency L2 caches, pumping data into 2-cycle latency
32KB L1 data caches. Combined dual memory and SMP coherence
controllers aggregate 2.9 Tb/sec of BW. The POWER7 has as many
floating point units as threads.
IBM's POWER7 is the latest
product in a successful road map of general purpose processors
designed with the horsepower to pound through heavy-weight,
compute-intensive tasks at nearly 4GHz.
It is worth noting that
Oracle's UltraSPARC T3 is curiously missing from the this month's Hot
Chips Conference agenda (see:
even though its general availability is set for later this year. At
least two IBM's POWER-related sessions are scheduled at Hot Chips.
On August 17, 2010, IBM continues its roll-out of new POWER7-based systems,
software, and solutions. Register for webcasts: http://www-03.ibm.com/systems/power/advantages/
Watson's human-like artificial
intelligence beat both Jeopardy rival champions in a dry run as
reported in the trade sites on January 14, 2011. Canadian Broadcasting in Computer
beats Jeopardy! champs
reported, “Later, the human contestants made jokes about the
Terminator movies and robots from the future.”
Timothy Prickett Morgan in
referred to IBM Watson's avatar as the evil Skynet. In that article,
Watson beats humans in Jeopardy! dry run,
Morgan noted that Watson is a Linux cluster of IBM POWER7-based p750
servers. For a pundit who mentioned how the Watson Jeopardy event is
perhaps a veiled marketing ploy, he gratuitously added, “Watson QA
software is running on 10 racks of these machines, which have a total
of 2,880 Power7 cores and 15 TB of main memory spread across this
system. The Watson QA system is not linked to any external data
sources, but has a database of around 200 million pages of "natural
language content," which IBM says is roughly equivalent to the
data stored in 1 million books.”
It was stated in several reports that
Watson has some issues with language ambiguity as a challenge.
Perhaps this makes Watson more human-like than we think as
interpreting ambiguity in speech is generally a learned ability.
Myth has it that Discovery One
Spacecraft's HAL9000 computer name in 2001: A Space Odyssey
is a one-letter-shift from the letters IBM as in IBM9000. I suspect
we should start worrying when the next generation of IBM lip-reading,
human-like technology argues with us as in the classic “open the
pod bay doors”... http://www.youtube.com/watch?v=kkyUMmNl4hk
As the late 1990s approached, a “new” CPU architecture was going to take the enterprise data center by storm – HP/Intel’s Itanium. Has the last chapter in this epic been written? Ironically, Itanium’s execution architecture is call Explicitly Parallel Instruction Computing, or EPIC. It was conceived in 1988 by HP to address one in a long line of “technology barriers” – the supposed Reduced Instruction Set Computing (RISC) computing barrier. As with most “barriers”, the semiconductor industry found successful ways of addressing them without quantum leaps in disruptive endeavors.
By the late 1980s, it appeared that RISC architectures were running out of steam. RISC architectures were designed years earlier to help compilers create more efficient object code. Eventually, RISC processors were designed as superscalars; having more than one instruction execution unit thus able to execute more than one simple instruction in parallel. This then could allow compilers to extract and schedule the execution of multiple instructions in parallel. Compilers and superscalar execution architectures gave the impression they were still not able to extract enough Instruction Level Parallelism (ILP) from the source code to display a predictable performance roadmap. If true, this meant hardware was being wasted, and thus a “perceived technology barrier” was created.
The Itanium’s architecture was adapted from a previously identified execution technique known as VLIW (Very Long Instruction Word) architecture. Basically, the Itanium has so many instruction execution units that, for example, it actually execute code down both sides of a code branch and dumps the results of the losing side. However, it turns out there is only so much inherent ILP in modern source coding techniques that can be extracted by compilers. The Itanium is an example of trying to fix something that never was really broke in the first place.
If the Itanium was available as promised (mid/late-1990s) it could have dominated the market, but:
• It was years late
• Again, it’s EPIC architecture was intended to address performance concerns that were moot at by the time of its release, resulting in:
• Itanium delivering parallel instruction scheduling no better than competitor processors but requires a 33% larger memory image for the same program space. Memory includes caches, and system RAM. (This is one of the main reasons Itanium processors have massive on-board caches.)
• The first Itanium became GA just as the Dot Com bubble burst, and was not well adopted because:
• Cost of rip-and-replace hardware was prohibitive
• Cost of re-working applications was very high
• Poor performance relative to IBM’s POWER4, 5, 6
• HP had already announced the end of both the Alpha and PA-RISC based servers, consequently, HP is stuck with Itanium as its only “high-end” processor, even though Intel’s own Xeon outperformed Itanium in most applications
• 1989: HP conceived Itanium to address RISC performance shortfalls
• 1989-2001: Itanium was so late that RISC shortfalls were already addressed by competitive offerings in the market
• HP and Intel spent perhaps billions on research and development for what emerged as inferior technology
• 2001: A 733 MHz Itanium 1 (Merced) released by HP and Intel
• Poorly accepted by the market because of its lack of performance
• 2002: 900 – 1700 MHz Itanium 2 (McKinley, Madison, Deerfield, Fanwood) released
• Continued poor acceptance, out performed by IBM’s POWER4
• 2006: Intel & ISV Consortium claimed they will spend $10B more on Itanium through 2010 to build an ecosystem (which normal market forces failed to create)
• 2006: Dual Core Itanium 2 (Montecito, Millington) finally released
• 2009: Quad Core Itanium, Tukwila, delayed again.
As a long-time Sun customer, you decide enough is enough with Trusted Solaris 8 having been EOL-ed in May of 2009 and want to move your MLS-centric Oracle RAC implementation to the next version of Trusted Solaris, which is Solaris 10 11/06 with Trusted Extensions.
You order Sun Solaris 10 11/06 w/Trusted Extensions, update your Oracle RAC license, buy a Sun SPARC server, (probably some Niagara-based product because Sun can “prove” it’s the best there is). Two weeks later your sales and technical support team from Sun is laid off, the result of stalled EU issues with the Oracle acquisition.
Subsequently, you note there seems to be a problem installing Oracle RAC in a Solaris Container (Zone) unless it’s installed in the global Zone and even then your customer finds out that Oracle only supports non-RAC databases in Solaris Containers. Not only is RAC not supported but even if one installs “regular” Oracle, it being run in a global zone will violate Sun MLS implementation for Sun basically replaced its traditional Trusted Solaris 8 fine-grained MLS with each Classification (sensitivity) set in its own Zone, that is, its own instantiation of Solaris.
Even extracting a tar-ed Trusted Solar 8 MLS file system would fill a Solaris 10 w/TE screen with permission errors due to the radically different method of course-grained classifications used in Solaris 10 w/TE.
Frantically you call “an old friend” from IBM and explain your Sun problem. IBM responds that it has no problem, nor makes any excuses, about running Oracle RAC in an MLS AIX or MLS Linux environment.
Sun and Oracle References
 see: www.orafaq.com/wiki/Solaris
“Starting with Solaris10, Oracle can be installed within a Solaris Container (soft OS partition). However, only non-RAC databases are supported (RAC is only supported in a global zone)”
 see: http://hub.opensolaris.org/bin/view/Community+Group+zones/faq#H
“Q. Can I use Oracle RAC in a Container?
A: This is really three questions: (1) does it work (2) in what configurations does Sun support the Solaris components (3) in what configurations does Oracle support this? The short answers are:
Oracle RAC has been demonstrated consistently using the Solaris OS, the Solaris Zones Cluster feature of Solaris Cluster software and Oracle RAC
Sun supports ’Solaris Zone Clusters" using Solaris Cluster
Only Oracle can determine the level of support available for Oracle RAC in Solaris Zones Clusters.”
[note: Result: Cannot run RAC under Solaris10 w/TE with MLS. Running everything in the same zone (especially the global zone) violates MLS classifications (sensitivities). Running RAC in a Solaris Cluster Zone is not a MLS Solaris Zone.]
As reported in TheRegister http://www.theregister.co.uk/2009/12/18/redhat_rhel6_itanium_dead/
, Red Hat pulls plug on Itanium with RHEL 6.
It appears there may be little life left in Intel/HP's Itanium experiment. Referring to an earlier blog, Itanium -- Too Little, Too Late, not only was Itanium an attempt to solve a problem that didn't really exist, but Intel's own x86 Nehalem outperforms the Itanium both in absolute terms and price/performance.
It's interesting to note that Timothy
says, "By the way, RHEL 6 will be supported on IBM's Power-based servers and its
mainframes, which have been supported with RHEL 4 and 5, as well as x64 servers." as he proceeds to suggest that IBM simply purchase Novell and go head to head with RedHat.
See the entire line of IBM products, service and support for Linux: http://www-03.ibm.com/linux/prod_svc.html
Paul Venezia's InfoWorld expose' [see: http://www.infoworld.com/t/mergers-and-acquisitions/oracle-customers-sun-sun-who-751] details what is in store for Sun's current customer base post facto the Oracle acquisition. Oracle's CEO, Larry Ellison, claims Oracle+Sun will form the basis for Oracle to emulate IBM of the 1960s. This is a clear endorsement of IBM today, yet disregards market and technology realities existing fifty years ago, not today.
In the meantime, Oracle is engaging in tactics that will drive the last nail in the coffin for even SPARC/Solaris zealots. It was bad enough that Sun+Texas Instruments could not meet market performance windows for Sun's native UltraSPARC line of processors, leaving Sun SPARC customers having to choose between application specific UltraSPARC-Tx, Niagara-class based products or SPARC64 based servers from Fujitsu. Now, Oracle is increasing the service and maintenance costs of aging Sun products -- forcing Sun customers to upgrade. Upgrade to what?
Since the introduction of IBM's POWER5 in 2004, Sun's IBM sales playbook was predicated on the following golden rules:
* IBM will force you to upgrade your systems every two years, while at Sun we design our systems for a five year refresh cycle. [Translation: Sun needs to create a plausible excuse in face of IBM's technology and product development cycle being 2x faster than that of Sun.]
* IBM makes servers that are benchmark machines, while Sun servers are balanced systems. [Translation: Need non sequitur FUD in light of IBM's superior performance.]
* When IBM brings GBS into your data center, System Admins, Systems Programmers, DBAs, etc., will lose their jobs. [Translation: Make it personal, do whatever you can to stay with Sun and keep away from IBM -- or be layed off!]
Oracle, the company that wants to emulate IBM, is now violating Sun's own stated golden rules. This is not surprising. For the past decade Sun has claimed that the TPC-C OLTP industry-standard database benchmark is archaic and does not represent any aspect of the modern data center, and besides when run in a cluster you can simply add servers and storage until the desires numerical result is achieved. This was convenient for Sun since it had poor results on this benchmark. Oracle came along last fall, using a huge cluster of Sun's Niagara-based application-specific processors and claimed a world record TPC-C benchmark result.
If I were a Sun customer, I would be seriously considering transitioning away from another decade of techno-deception.