|
DK8N
32-bit NUMA & non-NUMA benchmarking
Memory:
bandwidth & latency
At
the begining of this review we put the question: "Is
the DK8N better at the things Opterons already do well than existing
AMD-chipset based dual-channel motherboards?"
Since
we have dismally failed (as yet) to test the DK8N's NFORCE3 AGP tunnel; the major
performance area in which the DK8N can further optimise the strengths of
the Opteron platform is in memory bandwidth - the higher the better - &
latency - the lower the better.
Rather
than bore & confuse our readers with a random scattering of the usual
semi-meaningful benchmarks, we thought we'd start with the least
meaningful of all, then really focus in a bit . . . .
. . .
let's
look again at our DK8N's SiSoftSandra figures for bandwidth
& efficiency - it is remarkable for a multiprocessor platform to be said to
be over 85% efficient at exploiting its memory.
We have included for interest the
same test with 'chip kill' & all 'BG scrub' ECC options enabled,
alongside NUMA-enabled figures from XP Pro 32-bit SP2 & Windows Server 2003, Enterprise Edition
32-bit:
 |
75% efficiency |
|
81%
efficiency |
|
86% efficiency |
|
88%
efficiency |
This
widely-used benchmark is a tweaked version of STREAM, & is designed
& optimised to show what is possible under the most favourable
circumstances. The key point to have in mind when looking at SSSandra
'scores' is that they reflect the most favourable memory operation,
repeated ad nauseam to explore the limits of the silicon implementation.
Applications
rarely if ever use memory in this way - each has its own profile of
frequencies of usage of a whole range of operations; we thought it might be worth
looking a little further into what STREAM (the version in ScienceMark
v2b) might show:
.
. . not quite as dramatic; tho' it seems fair to say that NUMA
configurations show overall a measured boost over XP Pro 32-bit of around 70%
in streaming operations, & that XP Pro SP2 appears to have a constant
slight edge over Server 2003.
Looking a little closer still, using ScienceMark's Membench,
we can see the development put into in the XP SP2 kernel:
| Algorithm
Bandwidth outside L1/L2 cache region |
2003:XPSP1 |
 |
+
50.1% |
| +
38.1% |
| + 51% |
| +
44.9% |
| -
5.9% |
| -
5.7% |
| -
6.2% |
| - 12% |
|
.
. . & lastly, we were astonished to see this:
| Main
memory latency in ns against stride in bytes |
|

|
.
. . here
we have a multiprocessor platform, with (in NUMA guise) main memory latency
near that of a
uniprocessor AMD64 system, & truly astonishing gross stream bandwidth. The
DK8N whether in conventional or NUMA configuration is clearly far &
away the best solution yet at emphasising the strengths of the Opteron
in a dual-CPU configuration. The other big surprise is its
performance with ECC 'chip kill' enabled . . . so far there does not
really seem to have been a price; but:
| L2
cache latency in ns against stride in bytes |
|

|
.
. . . there ain't no such thing as a free lunch . . .
DK8N
32-bit NUMA & non-NUMA benchmarks
| PCMark
2004/120 |
| Win
XPSP1 |
XPSP2
NUMA |
 |
 |
| (figures:
larger = better; times: smaller = better) |
Win
XPSP1 |
Win
XPSP2 NUMA |
speedup
NUMA : SP1 |
| File
Compression (MB/s) |
6.21 |
6.44 |
+3.7% |
| File
Encryption (MB/s) |
73.28 |
74.12 |
+1.1% |
| File
Decompression (MB/s) |
56.42 |
57.42 |
+1.8% |
| Image
Processing (Mpixels/s) |
28.92 |
29.18 |
same |
| Virus
Scanning (MB/s) |
5636.51 |
5529.08 |
-1.9% |
| Grammar
Check (KB/s) |
5.92 |
6.08 |
+2.7% |
| File
Decryption (MB/s) |
71.31 |
73.66 |
+3.3% |
| Audio
Conversion (KB/s) |
3075.49 |
3174.18 |
+3.2% |
| Web
Page Rendering (Pages/s) |
5.98 |
6.04 |
+1.0% |
| WMV
Video Compression (fps) |
86.67 |
88.88 |
+2.5% |
| DivX
Video Compression (fps) |
84.32 |
89.38 |
+6.0% |
| Physics
Calculation & 3D (fps) |
76.36 |
76.38 |
same |
| Graphics
Memory - 64 lines (fps) |
469.23 |
470.95 |
same |
| PCMark
(System Suite) |
5461 |
5564 |
+1.9% |
| Memory |
4850 |
5011 |
+3.3% |
|
CLIBench
Mk III SMP 0.7.16 |
| Dhrystone
2.1 (kDhryst) |
12603 |
12990 |
+3.1% |
| Whetstone
(MFLOPS) |
2545 |
2622 |
+3.0% |
| Eight
queens problem (pps) |
16353 |
16845 |
+3.0% |
| Number
crunch (k ops) |
370370 |
378726 |
+2.3% |
| Floating
point (k ops) |
39152 |
40554 |
+3.6% |
| Memory
throughput (kB/s) |
1387278 |
1476937 |
+6.5% |
|
Cinebench2003 |
| OpenGL
h/w lighting; scene 1 (secs) |
11.3 |
11.23 |
same |
| OpenGL
h/w lighting; scene 2 (secs) |
5.99 |
5.99 |
same |
| OpenGL
s/w lighting; scene 1 (secs) |
11.12 |
11.10 |
same |
| OpenGL
s/w lighting; scene 2 (secs) |
4.79 |
4.4 |
+8.1% |
| Cinema
4D shading; scene 1 (secs) |
30.6 |
28.34 |
+7.4% |
| Cinema
4D shading; scene 2 (secs) |
11.03 |
10.68 |
+3.2% |
| Single
CPU render (secs) |
80.3 |
77.0 |
+4.1% |
| Multiple
CPU render (secs) |
42.5 |
41.6 |
+2.1% |
|
Povray
3.61 |
| Povray
3.5/1.02 benchmark (min:secs) |
26.53 |
26.51 |
same |
DK8N
& real world performance
The
next part of this review will be a detailed comparison, using our regular
workaday video & graphics applications, between this dual-Opteron Iwill DK8N
system at 2.4GHz & AMD's previous dual-CPU platform, the AthlonMP,
also at 2.4GHz. We will also continue to focus on NUMA v non-NUMA
32-bit performance.
We're
using our regular applications because finding respectable, &
multithreaded, benchmarking utilities is tough: the only generalised suite
we respect is COSBI
- but this, alas, does not yet adequately support nor reflect the
performance of SMP systems.
To
give an idea of the very considerable real world performance of this DK8N; here are a couple
of screenshots from an XP Pro SP1 installation with two versions of
the Cinemacraft CCE-SP MPEG2 software encoder re-encoding PAL MPEG2
streams at an average of 4000 Mbps (3-pass VBR, frameserving with AVISynth
2.54/mpeg2dec3dg; output settings identical):
|
CCE-SP
2.5x
|
|
|
|
CCE-SP
2.67x
|
|
|
Those
who know this top-quality encoder will immediately notice two things: it
is - er - quite fast (re-encoding a 95-minute MPEG2 stream to
DVD-quality in 25 minutes or
less) & that for the first time in our experience, CCE-SP 2.67x, which
is specifically optimised for the PIV-family & SSE2, runs faster on an
AMD platform than version 2.5x
|
Memory
tests & benchmarks: conclusions
As
of this date; the DK8N's memory is as fast as it gets in 32-bit Windows. As
regards 32-bit NUMA operations; we were impressed by the measured
performance of XP pro SP2 - this kernel has clearly enjoyed a
general optimisation relative to Server 2003. We feel the above
benchmarking utilities show a slight but persuasive edge for XP SP2,
NUMA-enabled, over XP SP1 - but the difference is slight enough to
be mostly due to kernel optimisations rather than NUMA. |
|