forums.ps2dev.org

Glas · Joined: 06 Jan 2008 Posts: 26

Hi.

The next code segment, which is the whole main loop is very slow.
It has just 2 for loops, which may has the main time consumption.
Please have a quick look at the code.

Glas · Joined: 06 Jan 2008 Posts: 26

I found something out.

The function

Glas · Joined: 06 Jan 2008 Posts: 26

In the gmath lib of ibm, there is function called:

Glas · Joined: 06 Jan 2008 Posts: 26

New results:

Now Im at 50fps.

I changed the following term in the inner loop:

Mihawk · Joined: 03 Apr 2007 Posts: 29

Looks like the color doesn't change inside the loop anyway, so why not just define an unsigned int color outside the loop and calculate it just once there?

2nd:
maybe you also need to add "-O3" compiler flag in the Makefile?
_________________
Ask and it will be given to you; seek and you will find; knock and the door will be opened to you.

Glas · Joined: 06 Jan 2008 Posts: 26

Hi Mihawk.

The loop is empty because i was wondering why it so slow.
And its still too slow for just two for loops. I have no idea whats going on!

The -O3 flag is set.

Thanks

rapso · Joined: 28 Mar 2005 Posts: 147

Jim · Joined: 02 Jul 2005 Posts: 487 Location: Sydney

Refactor so the screen is just linear
ie.
When you get to
*p++ = 0x0000ff00;

then you've made real progress :)

Jim
_________________
http://www.dbfinteractive.com

Glas · Joined: 06 Jan 2008 Posts: 26

Thanks for replying.

Ok, I changed it.
But the frame rate is still 50fps.
The for loops looks like this now.

rapso · Joined: 28 Mar 2005 Posts: 147

Jim · Joined: 02 Jul 2005 Posts: 487 Location: Sydney

vector float color = (vector float){1.0f,1.0f,1.0f,1.0f} ;//_load_vec_float4(1.0f,1.0f,1.0f,0.0f) ;
*p++ = _pack_color8(color) ;

this still has the potential to generate a load of fp code (depends on the optimiser/code analyser).

*p++ = 0xffffffff;

gives the same result.

How does this pair of loops compare with memset(screen, 0xff, w*h*4)?

Jim
_________________
http://www.dbfinteractive.com

d-range · Joined: 26 Oct 2007 Posts: 60

Maybe this is a stupid question, but why are you doing this kind of stuff from the PPU anyway? Regardless of how you're writing your code, your still doing �50 million dword writes a second @25 fps, which will be pretty slow no matter how you do it.

Glas · Joined: 06 Jan 2008 Posts: 26

Hi all and thanks again.

First of all, rapso:
I cut the for loops out of the code and got a frame rate of ~50000 fps.
For a 3.2 Ghz Cell, utilizing the ppu only, I consider this also as a bit too slow.
Putting the for loops in, shrinking the fps count 10000 time down?!

Jim:
Of course you are right. But later on, when ray tracing with color, this function will be needed anyway, so it doesnt matter to put it in or not. It also hasnt a deep impact in the fps count...
This means, I just cut out the ray tracing algo, just to take a look, why the loops constrain my fps count to 50 fps.

d-range:
I dont really understand the question. Sorry.

Thanks

ldesnogu · Joined: 17 Apr 2004 Posts: 95

What d-range meant is that your for loops write 1100 x 800 words per frame and that doing it this way on the PPU is far from optimal.

The "standard" way to write big amounts of data on the Cell is by using DMA with multiple SPU's.

However even at 50 fps, we are talking of 1100 x 800 x 4 x 50 = 176,000,000 B/s which is low.
If you look this http://www-128.ibm.com/developerworks/forums/thread.jspa?messageID=13975586& you will see that you are not limited by the memory BW from PPU to memory. (As a side note, I got higher results for STREAM than what the guy quoted by using some assembly with manual unrolling and cache preload.)
_________________
Laurent

Glas · Joined: 06 Jan 2008 Posts: 26

Hi ldesnogu.

Ah, you are talking about the spus, but unfortunately, currently Im not considering using the spus.
I first want to test trees and algorithms on the ppu.

Im just wondering why these for loops are so slow!
But tanks for the link. I will consider it when it time for spu programming and
dma transfers. This is a interesting discussion.

I have another question.
How do you consider the example libs from ibm.
I mean the libs like, the misc lib, the gamth lib, the vector lib and so on...

Because currently Im intersecting tri data with rays in the dimension mentioned
above. 1100 x 600.
I got a frame time of 4 secs for just a cube with 2 tris per face, resulting in
2 * 6 tris = 12 tris

Im a bit frustrated, because everything is so slow!!!

Thanks
Alex

ldesnogu · Joined: 17 Apr 2004 Posts: 95

Stupid question just in case: do you use -O2 or -O3 when you compile?

And again: the Cell PPU is a very poor processor, using it alone will only be giving you very deceiving results.
_________________
Laurent

Glas · Joined: 06 Jan 2008 Posts: 26

Hello ldesnogu.

I compile it with -O3. -O2 I havent tried jet. Though, for debugging I use no optimization of course.

ldesnogu · Joined: 17 Apr 2004 Posts: 95

In order, no (or poor can't remember :D) branch prediction.

IBM has posted SPECint 2k results of 423 and fp of 387 (ref).

That basically means the PPU has a similar performance of a PIII 800 MHz for integers and about 10% better than the PIII for FP.

So if you don't use the Cell "properly" you will be *very* disappointed :)
_________________
Laurent

Glas · Joined: 06 Jan 2008 Posts: 26

Hi ldesnogu.

http://spec.it.miami.edu/cgi-bin/osgresults
I found results here, but none for cbea.

ldesnogu · Joined: 17 Apr 2004 Posts: 95

1. I posted the link to IBM result in my previous post (it's close to the end of the article)
2. If you want to search SPEC 2000 results use the official site: http://www.spec.org/cgi-bin/osgresults?conf=cpu2000
_________________
Laurent

Glas · Joined: 06 Jan 2008 Posts: 26

Hi.

Thanks.

Ok. I dont mind the original problem for now. I use what I get and porting everything to the spus soon.

Btw, I tried a simple console app on windows on my athlonX2 2.4 Ghz by testing a while loop. Its the same construct like that on the ppu.
Here I get ~600k fps. On the ppu, ~50k fps.
This is much difference I think, especially with a 800Mhz slower cpu...

#edit#
Its just like ldesnogu said. It like a PIII ;)
#/edit#

Thanks to all for your help.
Alex

rapso · Joined: 28 Mar 2005 Posts: 147

IronPeter · Joined: 06 Aug 2007 Posts: 207

>So if you don't use the Cell "properly" you will be *very* disappointed :)

Yes, that is the point.

Use PPU as IO-processor and SPU scheduler only.

HD · Joined: 11 Mar 2008 Posts: 4

I think you can get a lot more speed out of the ppu by using altivec, cache clearing and loop unrolling. An optimized ppu-memset similar to the one you need achieves appr. 5800MBytes/sec ~1650fps. Download from here:
http://www.fh-furtwangen.de/~dersch/memcpy_cell.c
If you need to do format conversions (float4->uchar4): these can
also be done quite efficiently in altivec-code.

Regards

HD

Glas · Joined: 06 Jan 2008 Posts: 26

Hi and thanks for your replies.

Sorry that my answer has taken so long.

@rapso:
You were right!
Now I cut everything out, even the display flip and got a fps count of
~100k - ~120k fps.
But when I cut out the input handling, I get
~600k - ~700k fps

This is all sdl code!
I have never thought that this could be the bottleneck, because the input handling is just a bit switch - case stuff.
Is this because of the in-order processing of the ppu?

I count the fps like this:

rapso · Joined: 28 Mar 2005 Posts: 147

cheriff · Regular Joined: 23 Jun 2004 Posts: 262 Location: Sydney.au

Glas · Joined: 06 Jan 2008 Posts: 26

Hi all and thanks again.

rapso:
Here is the sdl input code. Its quite simple and short.