Title: Future-pointing vectors

Opinion, published in March 1996 issue of "@CSC@"
(a journal published by the Centre of Scientific Computing Finland at that time)

by Pekka Janhunen
   FMI/GEO
   POB 503, FIN-00101, Helsinki
   Pekka.Janhunen@fmi.fi

---
In this text I try to shed some light on the ongoing debate and
competition between vector supercomputers and the so-called massively
parallel (MPP) machines. A typical situation is that the user
community wants vector machines while computing centers would be eager
to move to MPP's.

The vector machine was originally designed for scientific computation
whereas the microprocessor (usually RISC but also CISC; the difference
between the two has all but disappeared) was designed to run popular
word processing etc. applications quickly. This implies that the
market for the microprocessor is orders of magnitude larger. But it also
suggests that microprocessors are not necessarily very good in
scientific computation because that was never a design goal.

Nowadays the most important distinction between vector and scalar
processors is the way memory references are handled. Scalar
machines are based on a memory hierarchy with one or more levels of
cache memory. This is efficient for all small problems and some types
of large problems. Vector machines have interleaved memory banks. They
are fast for moderate and large problems whenever bank conflicts can
be avoided. Usually this means that powers of two should be avoided as
a stride when traversing an array. It just happens to be the case that
most scientific problems can avoid significant bank conflicts, but
many problems are essentially impossible to program in such a way that
a memory hierarchy would be efficient. The notable exception is, of
course, dense linear algebra which can be programmed efficiently on
any relevant machine.

The performance loss due to overflowing the cache can be quite
dramatic on RISC machines. When this happens, computing speed usually drops
below ten megaflops. The performance when processing large arrays need not
correlate with the theoretical peak performance. See
http://perelandra.cms.udel.edu:80/~mccalpin/hpc/stream/index.html for
performance data.

One of the ingenious things that happens inside a vector machine is the
nearly independent processing of scalar and vector instruction streams. In
practice this means that the bookkeeping scalar instructions in many cases
execute with zero overhead.  Indeed, the scalar execution unit need not be
very fast and still the vector pipelines can work at full steam. This
situation is to be contrasted with RISC processors where overlapping of
instructions is very limited and all extra instructions increase the
execution time a little bit. 

A vector processor is, essentially, an easy-to-program parallel SIMD
computer.  Memory references and computations are overlapped to bring
about a tenfold speed increase.  This performance boost is of the same
order of magnitude than what can be achieved in typical MPP applications:
codes that are more than 90 percent parallelizable are not very common. In
addition, it is also possible to increase the vector processor performance
further by adding more execution units or increasing the vector length
(the pipelining depth). For instance, some versions of NEC vector
processors have equally many execution units as four C90 processors. 
These machines are efficient in regular problems where long vectors can be
used.  Correspondingly, one can add more processors to MPPs, but only few
applications can benefit from these. 

The purpose of these considerations was to point out that there is no
natural law that limits the power of a single processing unit, even though
the speed of light and other factors may limit the clock frequency. If
maximum performance is wanted, parallelism is the only way up, but the
parallelism must be exploited at every level. It includes the pipelining
of memory references and calculations, as well as parallel execution units
within one processor.  Exploiting only the last and the most trivial stage
of parallelism by adding more processor can obviously not yield the best
result, because Amdahl's law is at work at every stage. Based on this view
we can conclude that vector processors have a bright future. Some day they
will be mass products. 


______________________________________________________________
  Pekka Janhunen                   tel (+358) 0 1929 535
  Finnish Meteorological Institute fax (+358) 0 1929 539
  Department of Geophysics         tlx 124436 EFKL FI
  P.O.Box 503, FIN-00101 Helsinki, Finland
  Internet email :                       Pekka.Janhunen@fmi.fi
  x.400 :   /C=fi/ADMD=Mailnet/PRMD=IL/SUR=Janhunen/GIV=Pekka/ 
  WWW home page (GEO) :                  http://www.geo.fmi.fi
  NEW!!!!                 http://www.geo.fmi.fi/prog/tela.html
______________________________________________________________