Playstation 3 Performance (May 2007)

Lately I’ve been working on porting Geekbench 2 to the Playstation 3. While I’m hoping to release a version of Geekbench that takes advantage of the Cell processor, I thought I’d share some preliminary results that show how the Playstation 3 (and the Cell processor) performs when running code that’s not optimized for the Cell processor.

Setup

Here’s the configuration of the Playstation 3.

  • Playstation 3
    • Cell Processor @ 3.2 GHz
    • 256 MB XDR RAM
    • Fedora Core 6
    • PS3 Linux Addon v1.3 (25 April 2007)
    • Geekbench 2.0.3 (32-bit and 64-bit pre-release)

I’m reporting the baseline score, rather than the raw score, for each benchmark (where a score of 1000 is the score a Power Mac G5 1.6GHz would receive). Higher is better.

Results

Overall Performance

Playstation 3
32-bit
956
 
Playstation 3
64-bit
912
 

Integer Performance

Playstation 3
32-bit
920
 
Playstation 3
64-bit
786
 

Floating Point Performance

Playstation 3
32-bit
702
 
Playstation 3
64-bit
696
 

Memory Performance

Playstation 3
32-bit
1568
 
Playstation 3
64-bit
1678
 

Stream Performance

Playstation 3
32-bit
749
 
Playstation 3
64-bit
583
 

Update: You can view the complete 32-bit and 64-bit Geekbench results on the Geekbench Result Browser.

Conclusions

It’s clear that the Cell processor isn’t all that impressive as a general-purpose CPU; if it’s not executing code designed to run on the Cell processor, it’s generally slower than a PowerPC G5 @ 1.6GHz (the baseline processor for Geekbench).

What remains to be seen is how the Playstation 3 performs when running code designed for the Cell processor; over the next few months I’m hoping to add Cell-specific optimizations to Geekbench that will exploit all the potential the Cell processor has to offer.

  • Ranulf Doswell

    To be honest, if you're just taking standard code and re-compiling it for teh PPU with any compiler you happen to find, you'll get the same results.


    The code needs to be written specifically for the SPU to get good performance. The Cell's power comes from having 8 (only 6 available on the PS3) additional processors that are very optimised for certain tasks.


    If you design your code to take advantage of them, you will see good benefit. However, it's harder than just compiling some old source from another machine - you'll want to take advantage of the vectorised instructions, doing 4 or 8 calculations at once, the huge bank of 128 registers allowing you to unroll loops better to avoid stalls waiting for a previous instruction to complete, reordering instructions to take into account the even-odd instruction units, etc.


    You can generate good code with gcc, but don't just think that because you've run it through the compiler with -O3 that it's the very best a processor can do. I recently optimised some of my SPU code to gain about 20% performance increase by reordering C instructions after looking at the generated code and seeing that unnecessary stalls had been introduced.


    As a rough idea of the power available, a julia set renderer I wrote for the PS3 achieves over 700 FPS at 640x480 and about 75 FPS at 1920x1080 on moderately difficult sets. This is obviously higher than the display rate and so I didn't bother to optimise it further, but there were some major out-of-order stalls in that code because it was the first code I wrote for the SPU.


    More recently, I'm writing a graphics library for python using the SPU to handle work that I would normally use a GPU for. Simple alpha blending whilst blitting an image (i.e. dest+=alpha*(source-dest)/255 done for each of RGB for each pixel) took about 42ms in each loop of my demo game when using the PPU. AMy first attempt at writing that as SPU code produced code that ran in about 16ms when using 4 SPU processors. When optimised to take into account of stalls, I got it down to about 6ms when using 4 SPUs. There are still serious inefficiencies in my code, but it is "good enough" for now; it's seven times faster than just using the PPU, I still have 2 SPU processors unused and the PPU is basically sitting idle. It could do all the above and still run your tests above with neglible impact on the results.


    Out-of-order execution is one approach to solving the generic problem of branches causing pipeline stalls. The other, favoured by the cell, is to calculate all possible branches. There are established logic rules for this, but for instance, consider:


    a = (b==42) ? (c+3) : (d+4)


    with an instruction cmp_eq that returns all ones if the two arguments are equal and all zeros if not, we can rewrite this as:


    cond = cmp_eq(b,42);
    a = (cond & (c+3)) | (~cond & (d+4))


    Not only does this eliminate pipeline stalls, it also processed all 4 or 8 of the vectorised variables at once. More than that, due to the even-odd execution units, these instructions can be almost free to use.


    In summary, the cell processor is specifically not designed for PPU usage. The out-of-order execution units were deliberately removed to make room for the SPU cores and whilst this means "any old code" may run slower, code that is written to target the cell processor can realise much better performance gains than the out-of-order unit can possibly hope to achieve.

  • Drew,


    Geekbench relies heavily on standard library functions for the memory benchmarks (where the Cell processor was significantly faster than the G5); most of the other benchmarks hardly use standard library functions. That said, it's been my experience that the PowerPC G5 posts somewhat lower scores (10-40% lower) for the memory benchmarks under Linux than Mac OS X.


    Geekbench was built with the compiler that ships with Fedora Core 6 (gcc 4.1.1, I believe). I've tried building Geekbench with ppu-gcc available from the Barcelona Computing Center and found little (if any) difference in performance. I'll give the IBM compiler a shot and post the results (hopefully sooner rather than later).

  • Your conclusion is absolutely right: The Cell is definitely not a general-purpose CPU. :-) Integer code is not exactly the Cell's strength. It's not bad at it, but really the processor was designed to burn through parallel vectorized floating-point operations like a monster.


    A few things to keep in mind:


    <ul>
    <li>As Lucid said, the PPE (which is these benchmarks are running) is just 1/8th of the chip. The PPE is about the same size on the die as a single SPE, and the Cell in the PS3 has 7 working SPEs (with 6 accessible).
    http://www.realworldtech.com/page.cfm?ArticleID...>
    <li>How much do your benchmarks use library functions, and how well are Fedora Core 6's libraries optimized for the Cell? Mac OS X's libraries are certainly hand-optimized for the G5. I'm a little skeptical of a PPC Linux measuring up to that.</li>
    <li>Did you use gcc 4.1 as your compiler? It would be interesting to see the results from IBM's xlc instead. The code generated by stock gcc on a relatively new processor is not usually that great. The Cell version of XLC lives here: http://www.alphaworks.ibm.com/tech/cellcompiler...>
    </ul>
  • Lucid

    Hi John,


    We're on the same page. I wasn't trying to imply that just running a bunch of threads would show the power of the PS3. The Cell's one PPE can run at most two threads concurrently, and it can only execute one instruction per cycle, regardless. The benefit of having two threads running is that when one stalls for a memory access, the other can fill in the gaps. I'd say that probably gets you about a 40% performance boost, as you said.


    To really utilize the power of the Cell, you have to keep those SPEs busy which can't be done by simply multithreading. They require code written specifically for them that can only be run on them. The PPE is basically the guy with the whip behind the team of horses (the SPEs). It's his job to coordinate their efforts and motivate them. If you just have that guy with the whip trying to pull the carriage all by himself, he's going to seem pretty ineffective. It's only by utilizing all of the horses that you can move the carriage at a rapid pace.


    Cheers

  • Lucid,


    Geekbench has both single-threaded and multi-threaded benchmarks; unfortunately, the multi-threaded benchmarks are, at best, 1.4x faster than the single-threaded benchmarks on the Playstation 3.


    So, really, I don't think the problem is that Geekbench isn't multi-threaded. I think the problem is that Geekbench doesn't execute any code on the Cell processor's SPUs (which is somewhat more complicated to do than multi-threading).

  • Lucid

    So you're comparing one in-order PowerPC unit to a full out-of-order G5? Of course the G5 is faster. Nobody said it wasn't. But the power of the Cell includes numerous co-processors that are actually better than you'd expect at general purpose code as long as you multithread it. Trust me: this comparison is totally irresponsible. The idea of the Cell is that to take advantage of its potential, you have to write code specifically for it. A single-threaded benchmark that is written without any Cell-specific code is wasting 90% of its power.

blog comments powered by Disqus