|
forums.ps2dev.org Homebrew PS2, PSP & PS3 Development Discussions
|
View previous topic :: View next topic |
Author |
Message |
Glas
Joined: 06 Jan 2008 Posts: 26
|
Posted: Mon May 12, 2008 8:27 am Post subject: two for loops very low frame rate |
|
|
Hi.
The next code segment, which is the whole main loop is very slow.
It has just 2 for loops, which may has the main time consumption.
Please have a quick look at the code.
Code: |
...
while( !bDone )
{
/////////////////////////////////////////////
// Handle input
while( SDL_PollEvent( &event ) )
{
switch( event.type )
{
case SDL_KEYDOWN:
switch( event.key.keysym.sym )
{
case SDLK_ESCAPE:
bDone = true ;
break ;
case SDLK_p:
{
char sz[25] ;
sprintf( sz, "/tmp/screenshot\0" ) ;
int i = SDL_SaveBMP(screen, sz) ;
i=0 ;
}
break ;
} // switch sym
break ;
case SDL_MOUSEMOTION:
x = event.motion.x ;
y = event.motion.y ;
break ;
case SDL_MOUSEBUTTONDOWN:
break ;
} // switch type
} // poll event
///////////////////////////////
// start timing
runtime = clock() ;
// lets do something
SDL_LockSurface( screen ) ;
unsigned int *p = ((unsigned int*)(screen->pixels)) ;
for( int j=0; j< SCREEN_HEIGHT; j++ )
{
for( int i=0; i< SCREEN_WIDTH; i++ ){
vector float color = _load_vec_float4(1.0f,1.0f,0.0f,0.0f) ;
// put the color to the screen
p[j*SCREEN_WIDTH+i] = VEC_TO_A8R8G8B8(color) ;
}
}
SDL_UnlockSurface( screen ) ;
//time_int(1) ;
elapsed = clock() - runtime ;
SDL_Color sdlcolor={1,1,1,1} ;
char szfps[50] ;
float ftime = float(elapsed)/float(CLOCKS_PER_SEC) ;
sprintf(szfps, "Frame Time: %.3f", ftime ) ;
RenderText(screen, font, sdlcolor,5,50,szfps) ;
sprintf(szfps, "FPS: %.3f", 1.0f/ftime ) ;
RenderText(screen, font, sdlcolor,5,150,szfps) ;
//printf( "s/frame: %f\n", float(elapsed)/float(CLOCKS_PER_SEC) ) ;
SDL_Flip( screen ) ;
}
...
|
With this code I just get a max frame rate of 5-7 fps, which means when I dont add anything to it.
Screen width and height are defined as follows:
Code: |
#define SCREEN_WIDTH 1100 // 1100 max
#define SCREEN_HEIGHT 600
|
So its 660000 times looping on a ppc of 3.2Ghz.
Is this possible?
I mean whats wrong?
I use Fedora 7 on the ps3 and the libs sdl, sdl_ttf, zlib, truetype.
And this code is just executed on the ppu.
Thanks |
|
Back to top |
|
|
Glas
Joined: 06 Jan 2008 Posts: 26
|
Posted: Mon May 12, 2008 9:24 pm Post subject: |
|
|
I found something out.
The function Code: |
VEC_TO_A8R8G8B8(vector float &)
|
costs about 20fps. This function is called in the inner for loop, which
is 660000 times.
This function is defined as follows:
Code: |
unsigned int VEC_TO_A8R8G8B8( vector float &x )
{
return ((unsigned int)( x[3]* 255) << 24 ) |
((unsigned int)(x[0]*255) << 16) |
((unsigned int)(x[1]*255)<<8) |
(unsigned int)(x[2]*255) ;
}
|
Ok, here we have four multiplications.
Do you have any suggestion how to improve this function?
In the code from the first post, I changed the code like this:
From
Code: |
p[j*SCREEN_WIDTH+i] = VEC_TO_A8R8G8B8(color) ;
|
to
Code: |
p[j*SCREEN_WIDTH+i] = 0x00ff0000 ; //VEC_TO_A8R8G8B8(color) ;
|
where p is a pointer to the screen array.
Thanks
Alex |
|
Back to top |
|
|
Glas
Joined: 06 Jan 2008 Posts: 26
|
Posted: Mon May 12, 2008 9:46 pm Post subject: |
|
|
In the gmath lib of ibm, there is function called:
Code: |
unsigned int _pack_color8( vecotr float rgba )
|
With which I have nearly 20fps.
But its still very slow, or not?
Is 20fps slow for two for loops with 660000 cycles?
Ok, if I calculate the max loop count, this will result in:
Code: |
20 * 660000 = 13.2 million loops
|
With a 3.2 Ghz ppc, still too slow in my opinion!
Grateful if somebody could agree or disagree or give some answers!?
Thanks
Alex |
|
Back to top |
|
|
Glas
Joined: 06 Jan 2008 Posts: 26
|
Posted: Mon May 12, 2008 9:57 pm Post subject: |
|
|
New results:
Now Im at 50fps.
I changed the following term in the inner loop:
Code: |
vector float color = _load_vec_float4(1.0f,1.0f,1.0f,0.0f) ;
|
to
Code: |
vector float color = (vector float){1.0f,1.0f,1.0f,1.0f} ;//_load_vec_float4(1.0f,1.0f,1.0f,0.0f) ;
|
This is 30fps. Incredible!!!
Code: |
vector float _load_vec_float4( ... )
|
is also a function of the ibm vector lib of the example libraries.
Have I made mistakes in compiling these example lib from ibm?
Please give me <our opinion about these libs and what you think.
Thanks |
|
Back to top |
|
|
Mihawk
Joined: 03 Apr 2007 Posts: 29
|
Posted: Mon May 12, 2008 11:06 pm Post subject: |
|
|
Looks like the color doesn't change inside the loop anyway, so why not just define an unsigned int color outside the loop and calculate it just once there?
2nd:
maybe you also need to add "-O3" compiler flag in the Makefile? _________________ Ask and it will be given to you; seek and you will find; knock and the door will be opened to you. |
|
Back to top |
|
|
Glas
Joined: 06 Jan 2008 Posts: 26
|
Posted: Wed May 14, 2008 8:38 am Post subject: |
|
|
Hi Mihawk.
The loop is empty because i was wondering why it so slow.
And its still too slow for just two for loops. I have no idea whats going on!
The -O3 flag is set.
Thanks |
|
Back to top |
|
|
rapso
Joined: 28 Mar 2005 Posts: 147
|
Posted: Wed May 14, 2008 6:20 pm Post subject: |
|
|
Code: | p[j*SCREEN_WIDTH+i] |
maybe you could avoid that math, i'm not sure weather the compiler is optimizing this or if the ppu calculates the offset every loop, but calculating it could be slow due to the register dependencies.
you could try to do the same amount of work in one loop with just
and i think it's slower using
than the declaration, because the compiler cannot optimize the intrinsic, it's done every loop, while it can take the declaration out of the loop... just a guess.
you could also check if __restrict could give you any benefits ;) |
|
Back to top |
|
|
Jim
Joined: 02 Jul 2005 Posts: 487 Location: Sydney
|
Posted: Wed May 14, 2008 8:15 pm Post subject: |
|
|
Refactor so the screen is just linear
ie.
When you get to
*p++ = 0x0000ff00;
then you've made real progress :)
Jim _________________ http://www.dbfinteractive.com |
|
Back to top |
|
|
Glas
Joined: 06 Jan 2008 Posts: 26
|
Posted: Thu May 15, 2008 12:57 am Post subject: |
|
|
Thanks for replying.
Ok, I changed it.
But the frame rate is still 50fps.
The for loops looks like this now.
Code: |
(some code here)...
///////////////////////////////
// start timing
runtime = clock() ;
// lets do something
SDL_LockSurface( screen ) ;
unsigned int *p = ((unsigned int*)(screen->pixels)) ;
vector float vcHit = (vector float){MATHCONST_FLOATMAX,MATHCONST_FLOATMAX,MATHCONST_FLOATMAX,MATHCONST_FLOATMAX};
for( int j=0; j< SCREEN_HEIGHT; j++ )
{
for( int i=0; i< SCREEN_WIDTH; i++ ){
RAY ray1, ray2, ray3, ray4 ;
ray1 = CreateRayFromSurfacePixel_Perspective_scalar(i,j,g_fSW,g_fSH,g_frecip_SW, g_frecip_SH,cam ) ;
vector float color = (vector float){1.0f,1.0f,1.0f,1.0f} ;//_load_vec_float4(1.0f,1.0f,1.0f,0.0f) ;
*p++ = _pack_color8(color) ;
}
}
SDL_UnlockSurface( screen ) ;
(some code here) ...
SDL_Flip( screen ) ;
}
|
So, unfortunately, the pointer arithmetic has no effect at all.
Instead, another problem occurs.
If I want to trace 4 rays at once, I still have to introduce the calculation again.
For example.
If I trace 4 rays( in screen coords ):
ray 1: (y,x)
ray 2: (y, x+1)
ray 3: (y+1, x )
ray 4: (y+1, x+1)
With array index technique I could just write again:
Code: |
p[y*width+x]= ray_color1
p[y*width+(x+1)] = ray_color2
...
|
The current results are:
The two for loops without the ray creation function
Code: |
CreateRayFromSurfacePixel_Perspective_scalar()
|
its 50 fps
And with this function, this means, it creates primary rays for width*height pixels, where
width = 1100
height = 800
the frame rate is 10 fps.
What can I consider next?
Do you have any other opportunities for me to choose?
Thanks |
|
Back to top |
|
|
rapso
Joined: 28 Mar 2005 Posts: 147
|
Posted: Thu May 15, 2008 5:42 pm Post subject: |
|
|
Glas wrote: | Thanks for replying.
Ok, I changed it.
But the frame rate is still 50fps. | how fast is it without the loops, with just the rest of the code, maybe the bottleneck isn't in this loops.
Quote: |
ray 1: (y,x)
ray 2: (y, x+1)
ray 3: (y+1, x )
ray 4: (y+1, x+1)
|
prefer p[index] over *p++ for two reason, 1. the loop-iteration counter is incremented anyway 2. if you do several *p++ in a row, you can get register RAW hazards on most cpus (on ARM *p++ is actuall faster :D)
if you need offsets by one line or one pixel, just add them directly without the mul
Code: |
p[index]
p[index+1]
p[index+SCREEN_WIDTH]
p[index+SCREEN_WIDTH+1]
|
or with Jim's version
Code: |
p[0]
p[1]
p[SCREEN_WIDTH]
p[SCREEN_WIDTH+1]
p+=2;
|
but this again can result in register RAW hazards. if you want to write 4pixel in a row, you could combine them into a 128bit quad and write them at once, this can also be beneficial for the buffer fill performance, especially if you do it with those altivec intrinsics.
of course, you wont write a pixel-quad anymore, but you can try to either trace 4*1 packets or to trace 2*2 but write them as 4*1 (swizzled), doing some post processing afterwards is a good chance to unswizzle the buffer (kinda for free).
sorry if something is incorrect, i'm not that deep into ppu assembler, i've just seen that there is a load for one indirection, so p[index] should be free, p[index+...] shouldn't.
cheers |
|
Back to top |
|
|
Jim
Joined: 02 Jul 2005 Posts: 487 Location: Sydney
|
Posted: Thu May 15, 2008 8:35 pm Post subject: |
|
|
vector float color = (vector float){1.0f,1.0f,1.0f,1.0f} ;//_load_vec_float4(1.0f,1.0f,1.0f,0.0f) ;
*p++ = _pack_color8(color) ;
this still has the potential to generate a load of fp code (depends on the optimiser/code analyser).
*p++ = 0xffffffff;
gives the same result.
How does this pair of loops compare with memset(screen, 0xff, w*h*4)?
Jim _________________ http://www.dbfinteractive.com |
|
Back to top |
|
|
d-range
Joined: 26 Oct 2007 Posts: 60
|
Posted: Fri May 16, 2008 12:09 am Post subject: |
|
|
Maybe this is a stupid question, but why are you doing this kind of stuff from the PPU anyway? Regardless of how you're writing your code, your still doing ±50 million dword writes a second @25 fps, which will be pretty slow no matter how you do it. |
|
Back to top |
|
|
Glas
Joined: 06 Jan 2008 Posts: 26
|
Posted: Fri May 16, 2008 12:06 pm Post subject: |
|
|
Hi all and thanks again.
First of all, rapso:
I cut the for loops out of the code and got a frame rate of ~50000 fps.
For a 3.2 Ghz Cell, utilizing the ppu only, I consider this also as a bit too slow.
Putting the for loops in, shrinking the fps count 10000 time down?!
Jim:
Of course you are right. But later on, when ray tracing with color, this function will be needed anyway, so it doesnt matter to put it in or not. It also hasnt a deep impact in the fps count...
This means, I just cut out the ray tracing algo, just to take a look, why the loops constrain my fps count to 50 fps.
d-range:
I dont really understand the question. Sorry.
Thanks |
|
Back to top |
|
|
ldesnogu
Joined: 17 Apr 2004 Posts: 95
|
Posted: Fri May 16, 2008 7:12 pm Post subject: |
|
|
What d-range meant is that your for loops write 1100 x 800 words per frame and that doing it this way on the PPU is far from optimal.
The "standard" way to write big amounts of data on the Cell is by using DMA with multiple SPU's.
However even at 50 fps, we are talking of 1100 x 800 x 4 x 50 = 176,000,000 B/s which is low.
If you look this http://www-128.ibm.com/developerworks/forums/thread.jspa?messageID=13975586& you will see that you are not limited by the memory BW from PPU to memory. (As a side note, I got higher results for STREAM than what the guy quoted by using some assembly with manual unrolling and cache preload.) _________________ Laurent |
|
Back to top |
|
|
Glas
Joined: 06 Jan 2008 Posts: 26
|
Posted: Sat May 17, 2008 11:30 am Post subject: |
|
|
Hi ldesnogu.
Ah, you are talking about the spus, but unfortunately, currently Im not considering using the spus.
I first want to test trees and algorithms on the ppu.
Im just wondering why these for loops are so slow!
But tanks for the link. I will consider it when it time for spu programming and
dma transfers. This is a interesting discussion.
I have another question.
How do you consider the example libs from ibm.
I mean the libs like, the misc lib, the gamth lib, the vector lib and so on...
Because currently Im intersecting tri data with rays in the dimension mentioned
above. 1100 x 600.
I got a frame time of 4 secs for just a cube with 2 tris per face, resulting in
2 * 6 tris = 12 tris
Im a bit frustrated, because everything is so slow!!!
Thanks
Alex |
|
Back to top |
|
|
ldesnogu
Joined: 17 Apr 2004 Posts: 95
|
Posted: Sat May 17, 2008 9:34 pm Post subject: |
|
|
Stupid question just in case: do you use -O2 or -O3 when you compile?
And again: the Cell PPU is a very poor processor, using it alone will only be giving you very deceiving results. _________________ Laurent |
|
Back to top |
|
|
Glas
Joined: 06 Jan 2008 Posts: 26
|
Posted: Mon May 19, 2008 2:42 am Post subject: |
|
|
Hello ldesnogu.
I compile it with -O3. -O2 I havent tried jet. Though, for debugging I use no optimization of course.
ldesnogu wrote: |
And again: the Cell PPU is a very poor processor, using it alone will only be giving you very deceiving results.
|
But why?
What makes it so poor compared to an ordinary intel like cpu?
Alex |
|
Back to top |
|
|
ldesnogu
Joined: 17 Apr 2004 Posts: 95
|
Posted: Mon May 19, 2008 3:03 am Post subject: |
|
|
In order, no (or poor can't remember :D) branch prediction.
IBM has posted SPECint 2k results of 423 and fp of 387 (ref).
That basically means the PPU has a similar performance of a PIII 800 MHz for integers and about 10% better than the PIII for FP.
So if you don't use the Cell "properly" you will be *very* disappointed :) _________________ Laurent |
|
Back to top |
|
|
Glas
Joined: 06 Jan 2008 Posts: 26
|
|
Back to top |
|
|
ldesnogu
Joined: 17 Apr 2004 Posts: 95
|
Posted: Mon May 19, 2008 7:27 am Post subject: |
|
|
1. I posted the link to IBM result in my previous post (it's close to the end of the article)
2. If you want to search SPEC 2000 results use the official site: http://www.spec.org/cgi-bin/osgresults?conf=cpu2000 _________________ Laurent |
|
Back to top |
|
|
Glas
Joined: 06 Jan 2008 Posts: 26
|
Posted: Mon May 19, 2008 1:06 pm Post subject: |
|
|
Hi.
Thanks.
Ok. I dont mind the original problem for now. I use what I get and porting everything to the spus soon.
Btw, I tried a simple console app on windows on my athlonX2 2.4 Ghz by testing a while loop. Its the same construct like that on the ppu.
Here I get ~600k fps. On the ppu, ~50k fps.
This is much difference I think, especially with a 800Mhz slower cpu...
#edit#
Its just like ldesnogu said. It like a PIII ;)
#/edit#
Thanks to all for your help.
Alex |
|
Back to top |
|
|
rapso
Joined: 28 Mar 2005 Posts: 147
|
Posted: Mon May 19, 2008 5:20 pm Post subject: |
|
|
Glas wrote: |
First of all, rapso:
I cut the for loops out of the code and got a frame rate of ~50000 fps.
For a 3.2 Ghz Cell, utilizing the ppu only, I consider this also as a bit too slow.
Putting the for loops in, shrinking the fps count 10000 time down?!
| but you kept everything else like SDL_LockSurface( screen ) ; ?
50MB/s is extremly slow, i kinda doubt it's just the loop. |
|
Back to top |
|
|
IronPeter
Joined: 06 Aug 2007 Posts: 207
|
Posted: Mon May 19, 2008 6:37 pm Post subject: |
|
|
>So if you don't use the Cell "properly" you will be *very* disappointed :)
Yes, that is the point.
Use PPU as IO-processor and SPU scheduler only. |
|
Back to top |
|
|
HD
Joined: 11 Mar 2008 Posts: 4
|
Posted: Tue May 20, 2008 2:27 am Post subject: |
|
|
I think you can get a lot more speed out of the ppu by using altivec, cache clearing and loop unrolling. An optimized ppu-memset similar to the one you need achieves appr. 5800MBytes/sec ~1650fps. Download from here:
http://www.fh-furtwangen.de/~dersch/memcpy_cell.c
If you need to do format conversions (float4->uchar4): these can
also be done quite efficiently in altivec-code.
Regards
HD |
|
Back to top |
|
|
Glas
Joined: 06 Jan 2008 Posts: 26
|
Posted: Wed May 21, 2008 4:51 pm Post subject: |
|
|
Hi and thanks for your replies.
Sorry that my answer has taken so long.
@rapso:
You were right!
Now I cut everything out, even the display flip and got a fps count of
~100k - ~120k fps.
But when I cut out the input handling, I get
~600k - ~700k fps
This is all sdl code!
I have never thought that this could be the bottleneck, because the input handling is just a bit switch - case stuff.
Is this because of the in-order processing of the ppu?
I count the fps like this:
Code: |
unsigned long long start = __mftb() ;
unsigned long long elapsed = __mftb() - start ;
print out ( 1.0f/(elapsed / timebase Per Sec));
|
So, what do you suggest?
I could manage the ps3 ray tracer via my pc(over network) but later in this real time rt project, I need some joypad input.
On the other hand, I even change the code to spe code so this wouldnt make that
difference, because even 50k fps are enough for just management tasks.
But what makes the sdl code so slow? I dont use any rasterization or any other compute intensive sdl code.
Alex |
|
Back to top |
|
|
rapso
Joined: 28 Mar 2005 Posts: 147
|
Posted: Wed May 21, 2008 7:48 pm Post subject: |
|
|
Glas wrote: | Hi and thanks for your replies.
Sorry that my answer has taken so long.
@rapso:
You were right!
|
i'm glad I guide you to some issue finding although this special input issue wasn't what I was intending. so additionally:
I asked if you kept SDL_LockSurface( screen ) in your code when you tested without the loop, cause although it might look like a simple function, SDL might do a lot of format conversions before returning of the ptr. this can be a way more expensive than your loop.
regarding input. maybe you can post some of the 'expensive' code. |
|
Back to top |
|
|
cheriff Regular
Joined: 23 Jun 2004 Posts: 262 Location: Sydney.au
|
Posted: Thu May 22, 2008 10:21 am Post subject: |
|
|
Glas wrote: | Now I cut everything out, even the display flip and got a fps count of
~100k - ~120k fps.
But when I cut out the input handling, I get
~600k - ~700k fps
This is all sdl code!
I have never thought that this could be the bottleneck, because the input handling is just a bit switch - case stuff. |
Hi, whilst I cant find the link right now, I do seem to remember an article on gamedev or something on the folly of relying on FPS for this kind of performance tuning (especially this early on in development) I cant recall the exact details, but it basically comes down to the fact that FPS is measuring the reciprocal of time, which is not linear, and so conflicts with intuition. Maybe you should be looking at average frame time instead
So say your app as a bare loop runs at 600kfps (really? half a million?) so each frame is taking 1/600k = 1.6e-6 seconds each.
With the SDL input routines being run, you're down to 1/100k = 1.0e-5 seconds each
So a call to input routines takes 8.3e-6 seconds each, which is the same order of magnitude as the actual loop - so its only natural that 99% of processing time is in SDL - where else would it be? Since the code doesnt attempt to do anything else, of course the few things you ARE doing will dominate execution time
Now consider a project a bit further along, calculating graphics and stuff, running at a respectable 100fps, or 0.01 seconds per frame.
Now lets add input code, which we already know to take 8.3e-6 seconds, so each frame now requires 0.010008 seconds to render - which equates back to 99.9167 FPS. Does SDL still feel like a bottleneck to you now? :)
So the SDL code either costs 500k frames per second - or 0.08 frames per second - depending on what else in in the game loop.
In short - dont get too caught up with FPS early on in the project :) If you insist on trying to optimise this early on, at the very least, deal in seconds per frame on your mental graph paper as you plot performance - at least it is linear and will be a truer indication of the cost of features you add. _________________ Damn, I need a decent signature! |
|
Back to top |
|
|
Glas
Joined: 06 Jan 2008 Posts: 26
|
Posted: Fri May 23, 2008 6:16 am Post subject: |
|
|
Hi all and thanks again.
rapso:
Here is the sdl input code. Its quite simple and short.
Code: |
/////////////////////////////////////////////
// Handle input
while( SDL_PollEvent( &event ) )
{
switch( event.type )
{
case SDL_KEYDOWN:
switch( event.key.keysym.sym )
{
case SDLK_ESCAPE:
bDone = true ;
break ;
case SDLK_p:
{
char sz[25] ;
sprintf( sz, "/tmp/screenshot.png\0" ) ;
int i = SDL_SaveBMP(screen, sz) ;
i=0 ;
}
break ;
} // switch sym
break ;
case SDL_MOUSEMOTION:
x = event.motion.x ;
y = event.motion.y ;
break ;
case SDL_MOUSEBUTTONDOWN:
break ;
} // switch type
} // poll event
|
Actually, I just need sdl for input handling. Because the fb utility from http://www.cellperformance.com works just fine but I dont use it currently, as you can imagine.
But still, in the next one or two weeks, I change to the spus and that, this shouldnt make that difference.
cheriff:
Actually I have more code as just the two for loops.
I have a ray tracer running. The problem was that I didnt know where the whole performance has gone. So I cut out pieces and end up with just that what rapso
suggested.
With 20 Spheres and 6 plane (all implicit) a frame took about 4s.
Without the ray tracer and just input and for loops and primary ray gen, a frame took 0.5s.
And with just the main while loop and no for loops and no input, I got 700kfps.
I should try the whole thing with just the fb utility from cellperformance.com, to
see how it actually is.
So what do you say about this?
Thanks
Alex |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|