forums.ps2dev.org

cnlohr · Joined: 05 Feb 2006 Posts: 24

Since the other thread has been very eventful with important news, and it appears thousands of people are visiting it for news, I decided to post here for a general discussion for some of the noobs like myself who have been following the thread.

I had three questions, but if possible I would like this thread to be continued as a Q/A thread.

1. Do we know if the swapping buffers is the same as on the '7800?

2. I'm having fun writing an interactive program, except it crashes very quickly, in fact, whenever wptr (ps3gpu.c:511) approaches 16384. I noticed some brief discussion about trying to place a jump in the FIFO buffer, but I haven't found any information on how to do that. It would be great if someone could provide some insight there.

3. The kernel patch provided doesn't seem to take to _any_ kernel I applied it to. It wasn't until I manually edited the files and pasted in the code that I could get 3D working on my PS3. What kernel are you guys using!?

Additionally, I ran into the following problems and solved them. I think this may be useful for any other noobs like myself.

I was having a system that got a black screen after KBOOT, or failed to do much of anything with the 2.6.22 and 2.6.23 kernels. My solution was to use the kernel in this post http://forums.ps2dev.org/viewtopic.php?p=59961&highlight=kernel+git#59961.

I couldn't get my system to find the hard drives after I upgraded my kernel: Two problems 1: Name of hard drive driver changed. Now, it gets built into the kernel by default. 2: drive /dev/sdaX was changed to /dev/ps3daX.

I couldn't check out the files with the web URL for the subversion repo. Answer: actual url of the repo was svn://svn.ps2dev.org/ps3ware/trunk/libps3rsx

IronPeter · Joined: 06 Aug 2007 Posts: 207

1.) You can flip visual screen area with

http://wiki.ps2dev.org/ps3:hypervisor:lv1_gpu_context_attribute with L1GPU_CONTEXT_ATTRIBUTE_DISPLAY_FLIP attribute.

2.) Just place OUT_RING(0x20000000 | address_to_jump ); in your push buffer. This comand do not have subchannel id and tag mask.

It is good idea to rule GPU with jumps in push buffer, do not modify control registers in runtime, kick is very slow. There are many other ways to keep GPU in sync with CPU.

3.) GIT head from Levand Geoff kernel, http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/geoff/ps3-linux.git ( not sure, I am away from my console ).

cnlohr · Joined: 05 Feb 2006 Posts: 24

Thank you for your prompt response. I will continue to follow your project closely, as well as experimenting and continuing learning about this on my own time.

And, I know you've heard it a lot but:

Thank you for your excellent work, making 3D possible for all of us!

cnlohr · Joined: 05 Feb 2006 Posts: 24

Hmm -- whenever I perform the following code:

IronPeter · Joined: 06 Aug 2007 Posts: 207

just put OUT_RING( 0x20000000 | 0xe1f0000 ); in simple triangle code before fifo_push. And watch nice picture :).

This loop can be broken by leave_direct ( after 3 seconds timeout or by Ctrl-C signal ).

It is very good idea to loop your push buffer. You can create semaphor like jump : goto jump; and rewrite this semaphor by CPU just by putting 0. You can use DMA by GPU to insert this semaphor in the push buffer.

Any access to wptr/rptr is very slow.

cnlohr · Joined: 05 Feb 2006 Posts: 24

Oh man, I apologize. My problem was that I wasn't updating wptr and ptr before performing the jump.

Even with executing all of the unnecessary setup code every frame, I'm getting 200+ FPS. I have a wonderful animated spinning bunch of triangles.

I was able to use the built-in Vsyincing stuff to lock the image to 60FPS. I haven't figured out a clean way of using the display flipping using lv1_gpu_context, as I can't find anywhere in the program where the context_handle is exposed, unless it's ctrl member of 'gpu.' I'll mess around more in the morning.

Thanks for all the help.

cnlohr · Joined: 05 Feb 2006 Posts: 24

I still can't seem to figure out how to get the GPU context and call the lv1 gpu command from userspace as root (ie from the simple_triangle demo). If anyone has a demo for calling those functions from outside the kernel space, any help would be appreciated. *Edit* I'm sure people other than iron peter have got to know the answer. I feel bad taking up his valuable time answering my noob questions.

But, I was just playing around, and even with jumping and clearing the buffers, I was able to get 4,400 FPS drawing the three triangles in immediate mode. When not clearing the screen, the number was around 22,000 FPS.

I tried doing the jump every frame vs every 20 frames. The speed difference was virtually non-existent (4.484s vs 4.487s (averaged over 3 runs) for 100,000 frames). I was under the impression jumping and messing with those variables would be slow.

ralferoo · Joined: 03 Mar 2007 Posts: 122

IronPeter · Joined: 06 Aug 2007 Posts: 207

Do not worry about my time, it is fun for me.

To call lv1 function from userspace it is good idea to use driver ioctl ( you can refer glaurung's patch to ps3fb.c ). Not very fast. ~10K CPU ticks for one single ioctl call, I think.

Jumps in push buffer are our friends.

Keep in mind that we totally miss L3 caching ( so called TILES ), Z and color compression. We will have good speedup with these features enabled.

Also we miss swizzle for framebuffer, it is great for locality and caching.

It is great if you can test something. Glaurung wrote:

cnlohr · Joined: 05 Feb 2006 Posts: 24

Wow, that was easy. Thanks for the suggestion to use IOCTL to use context_attribute, I can now look at arbitrary locations easily.

I can't seem to change the COLOR0 offset. It seems like no matter where I put the following code, the actual location of where the output is drawn remains the same. Is this the wrong method for changing the offset of the draw buffer?

cnlohr · Joined: 05 Feb 2006 Posts: 24

Ok, I tried modifying the value going into the function in ps3fb.c:1370. I couldn't seem to make it run without crashing. And, I must be going about this the wrong way, each modify->compile->reboot cycle takes in upwards of 15 minutes for me. Also -- When I tried making it a module, I couldn't seem to get the init3d command to work.

IronPeter · Joined: 06 Aug 2007 Posts: 207

Use only lv1_gpu_memory_allocate( 0, 0, 0, 0, 0 ... ); for context memory.

After context is allocated you can make sequence of lv1_gpu_memory_allocs ( make these call via ioctl ). Hypervisor will place memory regions sequentially from the zero offset. Memory allocation seems to alter global GPU state by design.

So allocate context as usually, then call lv1_gpu_memory_allocate( screen_size_up_to_megabyte, ?, ?, ?, ? ) via ioctl and test perfomance. Make deallocate, repeat tests :).

I think modular build is possible. But I did not try it. Good ioctl is enough for me.

cnlohr · Joined: 05 Feb 2006 Posts: 24

Ok. I think I get it. I will test it in about 12 hours. I didn't really understand what you were asking for earlier.

So, if I get this right:

Never change the way the allocate( 0,0,0,0,0...) call in the kernel works.

Just modify the un-vsynced, un-flipped animated program that just executes a few thousand frames.

Add in the program, before the gfx_test call, a call to lv1_gpu_memory_allocate( <various sizes>, <various numbers>, <various numbers>, <various numbers>, handle, lpar ) via IOCTL call.

Run program over fixed number of frames (probably 5000.) And record total running time.

Modify parameters to the allocate function in code, recompile, re-run test.

If I find any interesting results (it not always being exactly the same) I will graph the results.

If I misunderstood anything, it'd be great if you can correct me.

I'm in the EST time zone in the USA, so it's about 4:30 AM here and I have classes tomorrow, so I'm not going to get to this until the evening.

Once again, thanks for the help, and with any luck, I may be able to contribute :).

IronPeter · Joined: 06 Aug 2007 Posts: 207

Ok, everything is ok.

> And record total running time.

and unmap the old memory.

cnlohr · Joined: 05 Feb 2006 Posts: 24

Understood. I am not used to systems that don't automatically relinquish allocations and rights upon exit, it's going to take some getting used to on my part.

cnlohr · Joined: 05 Feb 2006 Posts: 24

Ok, I ran a bunch of tests with various sizes. I ran all tests at least twice, some four times. I was running it with all the triangles (Animated immediate mode and index buffer'd ones, all textured, no vsync or double buffering over 1000 frames)

IronPeter · Joined: 06 Aug 2007 Posts: 207

Negative result is result also :).

I'll search it more deeply. Thanks you for the hard work.

I can explain why this memory allocation routine is interesting for me.

NV40 class hardware has some "channels" of L3 cache memory ( 16 + 8 ? ). Each channel can be mapped to memory region, you can assign amount of cache memory dedicated to that channel, can define compression flags, etc...

I think that these perfomance tunnings are important for large scenes with posteffects, HDR rendering, etc. Not critical for now, only critical if we want to beat commercial titles :).

cnlohr · Joined: 05 Feb 2006 Posts: 24

Wouldn't it be necessary to allocate small chunks for each thing (depth, texture, framebuffer) and then setting the offsets (NV40TCL_ZETA_OFFSET, NV40TCL_COLOR0_OFFSET, NV40TCL_TEX_OFFSET, etc.). I am certainly not doing that yet.

Should I try to run another test with the last two values being different, and allocating two separate large chunks (one for FB, other for Z)? Or does that have nothing to do with it?

IronPeter · Joined: 06 Aug 2007 Posts: 207

You may try, but perfomance difference must be noticable with "tiled memory" in any case.

It seems like broken interface by Sony.

Not really critical thing. It is better to tame DXT textures, vertex streams, shaders.

cnlohr · Joined: 05 Feb 2006 Posts: 24

Most of my experience is in higher level programming, IE game engines, games, tools, etc.

What has interested me most right now is trying to come up with some assembler for the NV Fragment/vertex stuff. Or writing a higher level C++ engine that on its back-end directly performs the RING_ calls.

Since the first thing is kinda required for the second, I guess I have more interest in the first.

I noticed that the Nouveau dumps,

http://nouveau.freedesktop.org/tests/g70-00f5/card_10de-00f5_test_nv_fragment_program.txt.gz
and
http://nouveau.freedesktop.org/tests/g70-00f5/card_10de-00f5_test_nv_vertex_program2.txt.gz

are both very complete in the dump analysis. I was wondering if anyone knew offhand what that system is like, and if I could use it to my advantage, instead of manual transcoding. *EDIT: the renouveau CVS has a treasure trove of awesome stuff*

I have particular interest in supporting cgc (yes, I know it's intel-architecture only) simply because in my many run-ins with it, in almost every case, the assembly it put out was extremely good and tight. (Heck, once or twice, it even out-optimized me)

Being able to simply transcode the nv30 to nvidia binary seems like something that would be extremely useful. I noticed some talking about it in the other thread. Has anyone really dug into this?

IronPeter · Joined: 06 Aug 2007 Posts: 207

>Has anyone really dug into this?

Nouveau did. There is a link in the other thread to the full featured NV_fp / NV_vp assembler. Nouveau project has many branches, you must dig these branches for information.

Fragment program assembler is simple: operation opcodes, src swizzles and result transforms, register opcodes, constants, stop bit, temporary registers amount. That's all.

You may start with nouveau assembler ( MIT licensed ). Or write your own assembler, with yacc or antlr it is relative easy.

Relative hard thing is register compactification. You must reduce the number of temporary registers on NV40 hardware. Any program must be annotated with that number during setup.

You are welcome to commit in libps3rsx ( if you agree with MIT license terms ). You may send patches with your animated demo to me and I'll commit these patches. Or I can ask admin to grant you write svn access.

I want to reinstall Linux and my ps3 will be closed for coding for few days. After that I'll refactor libps3rsx into { more } usable form.

cnlohr · Joined: 05 Feb 2006 Posts: 24

Aah,

http://gitweb.freedesktop.org/?p=mesa/mesa.git;a=tree;f=src/mesa/drivers/dri/nouveau

I haven't looked into it too deeply but I am really confused already. I guess it will just take time. As far as I can tell though (which isn't very far), it looks pretty much fully featured for a shader assembler.

About 70% of everything I do is MIT, 20% New BSD, and 10% GNU. So, I have no problem putting my work under the MIT license, especially since something like AFL or GNU would put unacceptable restrictions on the work.

Why are you trying to do register compactification if we're already dealing with the assembly code? Wouldn't whatever compiler that takes us from high level (CG or GLSL) code handle the compactification for us?

And you say the number has to be reduced -- does the RSX have less temp registers than other NV40 chipsets or something?

If I can strip out and strip down the shader stuff from Nouveau effectively, then I may want to either tar you the package or have write access. I expect it to take me about a week to get a good handle on the code.

If anyone else wants to work on this as well and does it better or faster than me, I won't feel bad if my work doesn't get used.

EDIT: PS: I am starting to get excited about the prospect of writing a C++ game engine that isn't middleware.

IronPeter · Joined: 06 Aug 2007 Posts: 207

Ok, if we trust in CG we do not need register optmization.

>And you say the number has to be reduced -- does the RSX have less temp registers than other NV40 chipsets or something?

I do not think so. There is a pretty full article about NV40 hardware http://www.digit-life.com/articles2/gffx/nv40-part1-a.html

cnlohr · Joined: 05 Feb 2006 Posts: 24

Ok, it looks like my plan of attack is to use most of the header information from the nouveau mesa driver, parse the files myself (since before, mesa did that), and pop out a linked-list of sorts of torn apart shader.

In doing so, it will get all of the opcodes collected in that list. Each element can represent one instruction. For instructions that cannot fit in a single opcode, I will generate multiple nodes, and string them together.

I'm going to focus on fragment programs first, then vertex programs.

I don't know how fully functional my code will be to begin with. But hopefully in a week or two, I will be able to put shader code in one side, and on the other side I will get a series of these structures that I can synthisize the opcode stream with.

One major note: Do you want me to do this using the NV assembly shading language or the GLSL Assembly shading language? Everything in Nouveau is set up for the GLSL Assembly shading language so it would be easier to code. Note that the NV asm shading language does have a little bit more expanded functionality.

I'm currently working using the GLSL Asm shading language, since that's where most of the work has been done for me.

Additionally -- if I would be working in the environment of the full Nouveau-Mesa implementation, this process would be much easier since all of the shader parsing and opcodizing would be done for us, are you sure we don't want to try to mod the Nouveau-Mesa drivers?

I'd understand the cons of being slower having more overhead, doing a lot of stuff we don't want, etc. So, I can understand why you probably wouldn't want to do it, but I'm just throwing it out there.

IronPeter · Joined: 06 Aug 2007 Posts: 207

Hi. Your only chance to survive is to be at very low level.

I have 10 years with Mesa experience. Then Voodoo2 launched the roadmap of Mesa was "in a year of HW accelerated Quake". Now the roadmap of Mesa/Nouveau is "in a year of HW accelerated Quake3".

You will die in bugfix with GLSL.

You will die with high-level concepts also.

One small example. High-level interface has SetPixelShaderConstant function. That's ok, but NV40 hw does not have pixel shader constants. Pixel shader constants are embedded in the pixel shader body. To set constant you must rewrite its locations ( after each using of that constant ) in the shader microcode. You have two possibilities.

1.) Make fragment shaders double-buffered and patch mircocode by CPU. That is about many SetPixelShaderConstant call per frame? Die...

2.) Patch shader constants by GPU. Via DMA or 2D blitting. Die...

The solution is simple. Keep synchronization on the user side. User can use many instances of pixel shader, patching by CPU and fencing. User can use one instance for immutable shader. User can patch shaders via blit.

cnlohr · Joined: 05 Feb 2006 Posts: 24

So, don't code using mesa but do code for GLSL ARB ASM?

IronPeter · Joined: 06 Aug 2007 Posts: 207

I think, do code for NV_FRAGMENT/VERTEX_PROGRAM. It is good low-level.

CG has NV_FRAGMENT/VERTEX_PROGRAM output, so we can have full toolchain.

ps2devman · Joined: 09 Oct 2006 Posts: 265

Iron Peter, do you feel a high level language is missing out there?
I mean one, that is well adapted to describe what will become both NV_FRAGMENT/VERTEX_PROGRAM binary micro-code and the code on CPU (or SPU) side that synchronizes well with it.

Feel free to keep on being our locomotive and just describe this language to create the binary micro-code and syncing code matching it. All coders will help to build the compiler of such language once spec exists.

Maybe, Cgc may insert itself as a lower part of the micro-code side chain.

IronPeter · Joined: 06 Aug 2007 Posts: 207

ps2devman, development is now in the very early stage.

At this moment I know that NV_FP/NV_VP language is very close to hardware. It is very good idea to write assembler for it ( not very hard ).

Also I want some high-level features. These things will be critical in the production code.

For fragment program it is static dispatching. For example, we want to have two versions of shader: with fog and without fog. Good idea is to precompile these 2 versions of shader. Also there are many other switches for material like specular mapping, bump mapping, envmapping, selfillum, etc. You will have multidimensional matrix of precompiled shaders for many switches. Without this feature you will unable to develop fast production code with many materials/shaders.

For vertex pipeline it is SPU geometry processing. We want to handle skeletal animation, vertex lighting, back face culling on the SPU. It is very high-level code. I can develop such a library in a future. Few months ago I coded full featured SPU driven skeletal animation with COLLADA export, it worked just perfect on real in-game models ( on software MesaGL :).

So do not worry about high level, I have some homeworks. Our specs are NV_VP/NV_FP now.

IronPeter · Joined: 06 Aug 2007 Posts: 207

PS. Of course, I can write header file for NV_VP/NV_FP assembler. With shader compiling, setting and constants setting interfaces. If cnlohr wants I can do it.