I cannot think of a more exciting (that is to say, tumultuous) era in significant effectiveness computing
since the introduction in the mid-80s of massively parallel personal computers these as the Intel iPSC and
nCUBE, adopted by the IBM SP and Beowulf clusters. But now, rather of benefiting only
high-performance purposes parallelized with MPI or other distributed memory styles, the
revolution is occurring inside of a node and is benefiting everyone, no matter if operating a laptop or
multi-petaFLOP/s supercomputer. In sharp distinction to the GPGPUs that fired our collective
imagination in the previous decade, the IntelsXeon Phit product relatives (its initially item model
previously becoming one particular teraFLOP/s double-precision peak speed!) provides supercomputer performance
correct into everyone’s office while utilizing regular programing resources entirely suitable with
the desktop surroundings like a entire suite of numerical software package and the entire GNU/Linux
stack. With both architectural and application unity from multi-main Intels Architecture processors
to quite a few-core Intel Xeon Phi solutions we have for the very first time a holistic route for transportable,
large-effectiveness computing that is based on a acquainted and confirmed threaded, scalar-vector
programming design. And Intel’s vision for Intels Several Integrated Core (Intels MIC) architecture
takes us from the early petaFLOP period in 2012 into the exaFLOP era in 2020_in truth,
the Intel Xeon Phi coprocessor primarily based on Intel MIC architecture is credibly a glimpse into
that long term.
So what’s the catch? There is really no new information here—sequential (a lot more especially
singled-threaded and non-vectorized) computation is dead even in the desktop. Prolonged lifeless.
Pipelined practical models, a number of instruction concern, SIMD extensions, and multi-core architectures
killed that years back. But if you have a single of the ninety nine percent of programs that are not yet
each multi-threaded and vectorized, then on multicore Intel Xeon with AVX SIMD units you
could be missing a issue of up to 100x in efficiency, and the extremely-threaded Intel MIC architecture
implies a issue of up to 1000x. Indeed, you are reading individuals quantities correctly. A scalar,
one-threaded application, depending on what is restricting its effectiveness, could be leaving
various orders of magnitude in single-socket overall performance on the table. All present day processors,
whether CPU or GPGPU, demand incredibly big amounts of parallelism to achieve large functionality.
Some very good information on Intel Xeon Phi coprocessors is that with your software working out of the
box thanks to the normal programming environment (once more in contrast to GPGPUs that call for
recoding even to operate), you can use common equipment to analyze overall performance and have a robust route
for incremental transformations to enhance the code, and individuals optimizations will have immediately
in excess of to mainstream processors. But how to improve the code? Which algorithms, data constructions,
numerical representations, loop constructs, languages, compilers, and so on, are a very good match for
Intel Xeon Phi items? And how to do all of this in a way that is not necessarily distinct to
the current Intel MIC architecture but instead positions you for foreseeable future and even non-Intel architectures?
This guide generously and accessibly places the reply to all of these queries and additional
into your fingers. A essential level is that early in the “killer-micro” revolution that changed tailor made vector
processors with commodity CPUs, application builders ceased to create vectorized algorithms
because the initially generations of these CPUs could certainly achieve a fantastic fraction of peak performance
on operation-abundant sequential code. Quickly forward two decades and this is now considerably from true
but the passage of time has erased from our collective memory a lot of the wisdom and folklore
of the vectorized algorithms so effective on Cray and other vector desktops. On the other hand, the success
of that period should give us excellent self confidence that a multi-threaded, scalar-vector programming
product supported by a wealthy vector instruction established is a wonderful match for a incredibly wide range of algorithms
and disciplines. Unnecessary to say, there are new difficulties these kinds of as the further and much more
advanced memory hierarchy of modern-day processors, an purchase of magnitude additional threads, the deficiency
of accurate hardware acquire/scatter, and compilers still catching up with (rediscovering?) what was
achievable twenty five years ago.
In October 2010, I gave a talk entitled “DSLs, Vectors, and Amnesia,” at an outstanding workshop
on language tools in Houston arranged by John Mellor-Crummey. Amnesia referred to the decline of
vectorization capabilities stated above. I applied the instance of rationalizing the bizarrely substantial
speedups1 claimed in the early days of using GPGPUs as a platform to determine successes and
failures in mainstream programming. Inconsistent optimization of programs on the two platforms
is the trivial clarification, with the GPGPU knowing a much greater fraction of its peak pace. But
why was this so, and what was to study from this about crafting code with transportable efficiency?
There were two factors fundamental the functionality discrepancy. Very first, the data-parallel programming
languages such as OpenCL and NVidia’s CUDA† forced programmers to write massively
data-parallel code that, with a fantastic compiler and some tuning of vector lengths and facts layout,
properly matched the fundamental components and realized high performance. Next, the comparison
was typically designed in opposition to a scalar, non-threaded x86 code that we now realize to be far
from exceptional. The universal solution was to again-port the GPGPU code to the multi-core CPU with
retuning for the distinct numbers of cores and vector/cache sizes—indeed with care and some luck
the similar code foundation could serve on the two platforms (and surely so if programming with OpenCL).
All of the optimizations for locality, bandwidth, concurrency, and vectorization carry straight in excess of,
and the cleaner, simplified, dependency-totally free code is far more readily analyzed by compilers. Therefore,
there is every reason to be expecting that almost all algorithms that function properly on present GPGPUs can
with a nominal volume of restructuring execute similarly very well on the Intel Xeon Phi coprocessor,
and that algorithms requiring fine-grain concurrent management need to be appreciably simpler to categorical
on the coprocessor than on GPGPU.
In reading through this ebook you will arrive to know its authors. By means of the Intel MIC architecture
early purchaser enabling program, I have labored with Jim Jeffers who in addition to getting articulate,
distinct considering, and genial, is an professional effectively worth your time and the price of this book.
Additionally, he and the other leaders of the Intel Xeon Phi item progress plan really feeling
that they are performing anything substantial and transformative that will condition the potential of computing
and is well worth the massive motivation of their skilled and individual existence.
This ebook belongs on the bookshelf of each and every HPC skilled. Not only does it successfully
and accessibly instruct us how to use and get significant performance on the Intel MIC architecture, it is about a lot more than that. It requires us back to the universal fundamentals of substantial-efficiency
computing such as how to believe and explanation about the efficiency of algorithms mapped
to modern architectures, and it puts into your arms powerful applications that will be useful for many years
to appear.