New
f90
compiler
for
desktop
superscalar/superpipelined
cpus
-
Performance
of
basic
BLAS1
codes
-
Comparing
GNU
Fortran,
several
hand-coded
assembler
libraries
and
output
from
my
f90
compiler
(provisionally,
rkhf90)
with
no
assists
and
only
the
basic
x87
fpu.
Best
performance
is
about
2x
over
hand-coded
efforts.
Some
codes
run
a
little
slower
than
GNU
Fortran.
(NOTE:
The
AMD
f90
compiler
uses
SSE1
and
SSE2
instruction
sets).
-
Performance
of
selected
BLAS1
codes
using
3DNow!
- F90
can
generate
any
mixture
of
3DNow!/SSE1/SSE2
and
x87
FPU
instructions
to
maximise
performance.
While
3DNow!
has
no
overflow
handling
and
non-standard
roundoff,
it
generally
performs
better
on
platforms
that
offer
both
3DNow!
and
SSE1.
-
Performance
of
selected
BLAS1
codes
using
SSE1
-
SSE1
can
be
overlapped
with
execution
in
the
x87
FPU.
-
Performance
of
selected
BLAS1
codes
using
SSE2
-
SSE2
can
be
overlapped
with
execution
in
the
x87
FPU.
- Sample
codes
output
from
the
rkhf90
compiler.
-
More-or-less
follows
style
of
GNU
C.
The
compiler
automatically
unrolls
loops
to
optimise
pipelining.
It
can't
re-roll
loop
code.
Ergo,
the
BLAS1
examples
compiled
by
rkhf90
are
much
simpler
than
the
Jack
Dongarra
code
used
for
the
GNU
Fortran
compiler.
- Sample
3DNow!
codes
output
from
f90.
-
F90
handles
any
combination
of
instruction
set
offerings,
and
can
even
handle
partial
implementations
(e.g.
PADDD
that
isn't
wired
--
but
doesn't
instruction
fault
--
when
using
SSE
regs
on
XP/MP's).
- SSE1
codes
output
from
f90.
- SSE2
codes
output
from
f90.
- Performance
comparison
of
the
different
instruction
sets
Kym
Horsell
/
Kym@KymHorsell.COM
Modest
donations
gladly
accepted
via
PayPal.
ADVISORY:
Email
to
these
sites
is
filtered.