It
is difficult to design successful workstation
architectures since these systems are general-purpose and used for a
large number of very diverse tasks. Performance of a workstation can be
decomposed into that of its components, such as the network,
graphics hardware, I/O, processor and memory subsystems. Successful
design of these components requires careful consideration of the
workload of workstation users. This workload is large and
diverse.
Current methods for designing these components are iterative processes
that are not well-suited to large, diverse workloads. This thesis
addresses this problem by developing a systematic
method to synthesize
prototype architectures of workstation components from large
workloads. The thesis focuses on the processor and the memory
components, although the overall approach can be applied to the
evaluation of other workstation components as well.
[...]
It is clear that there are considerable problems with the
current architectural design process. What is needed is a
systematic approach that can take workload elements such as
industry-standard benchmarks or end-user supplied applications and
derive architecture prototypes from them. The
advantages of such a computer architecture prototyping technique would
be:
But in the interim,
it became clear to me that the workload as viewed as immutable was a
flawed view. The compiler had an equal role to play. My
students and I started to define the compiler's role and quickly
discovered the VLIW view of the world. Compiler-driven
microarchitecture seemed to us to be much more open to design
tradeoffs. I was a researcher, so "dusty deck" code compatibility
issues did not interest me. Superscalar approaches seemed stalled
(this was the early '90s) at dual issue or perhaps four issue machines.
We began architecting a framework to reason about compiler-driven
microarchitecture. In 1994, I was getting married. My best
man was also my former office mate and co-advisee from Illinois (Sadun
Anik). He brought with him to the wedding a technical report from
his research group at HP Labs on a framework for investigating
compiler-driven microarchitecture called "PlayDoh". We scrapped
our fledgeling work on our framework and adopted the PlayDoh semantics.
PlayDoh was interesting but it was far from implementable. My
students and I decided that there was interesting work to do on how to
make a practical VLIW that would be a commercial success. We
devised an encoding (Sergei Larin did this for his MS thesis). We
began looking at the problem of cross-generation code compatibility in
VLIW, what was then an oft-cited Achilles Heel of the technology--
binaries from one generation of the
same encoded ISA in a VLIW were not guaranteed to run correctly
on a later version of that ISA. My Ph.D. student Sumedh Sathaye
and I devised a way to combine the encoding of the ISA, the compiler,
the page fault handler in the OS, and (ultimately with Kishore Menezes
and Sanjeev Banerjia's input) the instruction cache design to provide
cross-generation compatibility:
It was affirming to see that companies such as Transmeta
essentially adopted this same approach. Sumedh Sathaye realized the
potential of the technology was greater than just VLIW compatibility
and so he set out to design what we called an "evolutionary compiler,"
and what most people today refer to as a dynamic optimizer. After
he received his Ph.D., Sumedh went on to work on dynamic optimization
as part of a very influential research group in IBM T. J. Watson
Research Center that designed the now-classic dynamic optimization
systems DAISY and TULIP.
Kishore Menezes as a Ph.D. student, and Burzin Patel as an M.S.
student, worked on a related problem, that profile-driven
optimization-- so very important for VLIW compilation-- was impractical
in a commercial setting. He published several papers on what we
referred to then as "hardware based profiling." Kishore had an
influence on what ultimately went into the Itanium-I performance
counters, which are now quite appropriate for profiling code with zero
slow down.
Since that time in the mid to late `90s, my students and I stayed
on the path of solving the supposidly unsolvable barriers for
compiler-driven technology by employing every aspect of a computer
system-- from the hardware up through the OS.
At the same time, workstations were no longer the middle
market. It is relatively easy to argue that even PCs were no longer
the middle market, either. That unique feature set of a heavily
constrained engineering problem migrated down to the smallest
devices, the so-called embedded
systems. These systems had (indeed, still have) crushingly tight power, cost,
form factor, performance, etc., constraints. To that end,
we researched architectural problems in embedded system design. But the flavor of
the research was the same as it was when we began a decade prior.
A funny thing happened on the way to the Forum: Moore's Law hit a hiccup. Note there are truly two Moore's laws. The first is what Gordon Moore was trying to express, namely that the number of transistors that are feasibly fabricatable on a monolithic device tends to double at approximately every 18 months (this constant itself has varied over time between every 24 months and every 12 months). The "other" Moore's law is that computer performance of a single-threaded application (i.e., uniprocessors) doubles at the same rate-- every 18 months. The latter is more of a self-fulfilling prophecy. If you think your competitors will double performance over the next 18 months, you work as hard as you can to keep up!
So why did this second law end? It ended because we hit a power density limit: we cannot build ever-larger microarchitectures without exceeding the power envelopes of modern forced-air-cooled packages. This has placed a new, and in my view, wonderful constraint on architecture design! The answer has been a side step. Instead of relying on single-thread performance scaling, the industry has taken a gamble on multi-threaded performance and built monolithic multiprocessors on a chip. Unfortunately, this is only a partial answer. Parallel programs are hard to write, and harder to deal with architecturally, than orthodox single-threaded programs.
There have emerged two flavors of multiprocessors on a chip, the multi-cores and the manycores. It is something like the battle royal between superscalar and VLIW in the just past generation of computer architecture. Manycore is clearly the underdog, and so that is where we are focusing our energies. My students and I are working on all aspects of making thousands of processors on a chip (so-called "kilocore-scale manycores") programmable, coherent and genuinely useful.
What's interesting to me is that the history I have with supercomputing in the 1980's is now paying dividends. "Everything old is new again," I suppose. A lesson for any student here is to never dismiss any area of computer architecture as "boring" or "old hat." You'll be surprised what you will find is "hot" and interesting during the long arc of your career!
Also it is important to note that the original goal of systematic
computer architecture prototyping using fast simulation techniques
hasn't been lost. It is ever the more important in manycore design
where the new challenges are modeling and simulation of vast numbers of parallel threads.