Sunday, August 28, 2011

Changing blogs

The time has come, I'm switching blogs (again). Blogger was just too frustrating to use (the editor was a royal pain, images would break, etc). I'm now using Wordpress, over at

http://volcore.limbic.com

See you there!

PS: this blog and all its posts will stay here as an archive.

Monday, July 11, 2011

Fullscreen Motion Blur on iDevices with OpenGL ES 1+

Hey iDevBlogADay,

This, sadly, is my last post for this cycle, but I promise I'll be back. It's been a lot of fun being on the rotation, and it helped me a lot to share my findings.

However, for this final post, I've picked something special. Actually, a lot of people have asked me about this, how we do the motion blur in our latest game Nuts! and our very-soon-to-be-released game Zombie Gunship. The technique is by no means new, but the fact that it works so beautifully on the iDevices and it's simplicity really seal the deal for me.

Showcase

First of all, let me give you some arguments on why the motion blur is so cool.

In the case of Nuts!, it is actually pretty hidden. The only place where you can see it is when you pick up a fireball nut. But as you can see in the screenshots (and even more so when you play the actual game), the motion blur adds a lot of "speed" feeling to those nuts. The whole fireball effect is a lot more convincing with the motion blur effect. Interestingly, the motion blur is only used in those situations and runs at half the resolution of the original game. But it is not noticeable, because of the temporal bluring. Even when the resolution switches back to the full 640x960, once the effect has worn out, there is no popping noticeable.


In the case of Zombie Gunship, the visuals of the whole game are in essence built around this effect. It gives the game this 80s-built warplane-targeting-computer like look and artificial "imperfection". Also, as you can see in the screenshots, we're actually running a quite low resolution (480x320), and the models are quite low-res as well. But with the motion blur the game looks a lot smoother, it's harder to make out individual pixels.

Since it is a temporal blur by its nature, it is actually harder to see in screenshots :-)

How it's done

The best about this technique is that it's super simple. It even works in OpenGL ES 1, and like many post-processing effects it can just be dropped into the game very easily.

In a traditional rendering setting on iOS, we would map the final framebuffer, then draw the solid geometry, blended geometry, and then the ui on top. Finally we would present the renderbuffer and the frame is done.

With motion-blur, instead of rendering into the final framebuffer, we render into an intermediate framebuffer that renders into a color texture. For us, this buffer is usually half the size of the final framebuffer. Once we've rendered the solid and blended geometry into this buffer, we enable alpha blending and render this intermediate texture into a so-called accumulation buffer with an alpha value smaller than one. This accumulation buffer is only cleared when the blur begins. Finally, this accumulation buffer is then rendered into the final framebuffer.

In pseudocode, it looks something like this:

Traditional Rendering:

ActivateFinalFramebuffer();
Clear();
RenderScene();
RenderUI();
Present();

With Motion Blur:
ActivateIntermediateFramebuffer();
Clear();
RenderScene();
ActivateAccumulationFramebuffer();
// No clear here!
RenderIntermediateTextureWithAlpha(alpha);
ActivateFinalFramebuffer();
RenderAccumulationTexture();
RenderUI();
Present();

As you can see, you "just" need to add a few functions to your -(void) draw call in order to add the motion blur, and you can turn it on and off on-the-fly.

The smaller the alpha, the longer the blur, because less of the pixel is "overwritten" every frame. In the first frame, the pixels contribution to the final pixel value is alpha, in the second frame it is alpha*(1-alpha), then alpha*(1-alpha)^2, so it slowly fades out over time.

Of course, alpha can be varied every frame. We use that in Nuts! to slowly fade out the fireball effect at the end.

Two small remarks

One simple idea for optimization would be to use the final framebuffer as the accumulation buffer. This would save us one full-screen quad rendering operation. However, the framebuffer on iOS is at least double buffered. That means every second frame has a different render target, which leads to a very choppy and mind twisting blur effect. Also, if you want to display non-blurred components, such as UI and text, such things should be rendered into the final framebuffer, after the accumulation buffer has been rendered.

Another thing to note is that the first frame needs to have alpha=1, eg. when the fireball nut is picked up in Nuts!. This makes sure the accumulation buffer is properly initialized and doesn't have any very old data.

Conclusion

If you like what you read, consider following the official Limbic Software twitter account and of course buying our great game Nuts! :-)

Cheers, see you next time!

Monday, June 27, 2011

Multithreaded Rendering in iOS Pt. 2

Hey #iDevBlogADay,

This is an update to my previous post about multi-threaded rendering, and some thoughts about leveraging the A5 processor in the iPad 2.

The Update

I've finally managed to track down the issues that broke the multi-threaded rendering. Turned out that, for reasons unknown to me, the [context renderbufferStorage:fromDrawable:] call has to be performed on the main thread. If not, it will simply not work.

After I found this out, I was able to get all my methods to work, and I added a new method. Here is a little summary:

  • The single-threaded method does everything on the main thread and is just for reference.
  • The GCD method uses a display link on the main thread to kick off the rendering on a serial GCD queue that runs on another thread. Display link events may get dropped if the main thread is busy.
  • The threaded method uses a display link on a separate thread that kicks off the rendering on the same thread. Display link events may get dropped when the rendering takes too long.
  • The threaded GCD method combines the GCD and threaded methods. It runs a display link on a separate thread and kicks off the rendering into a serial GCD queue that runs on yet another thread. It is completely decoupled from the main thread, and the rendering doesn't block the display link either. Hence, the display link should be very reliable.
I didn't conduct any real performance measurements to see which method is better. However, I personally like the last approach. It should minimize blocking, and one nice benefit is that it is very easy to count frame drops (GCD queue is still busy while display link fires again).

In addition to getting it to work, I've also added a very simple asynchronous .pvr texture loader.

The code is available at https://github.com/Volcore/LimbicGL .

The Thoughts

Based on the above results, I've been thinking about how to write a renderer that properly utilizes the A5 chip.

Before the A5, we had to balance three principal systems: the cpu, the tiler (transforming the geometry and throwing it at the rendering tiles), and the renderer (renders the pixels for each tile)

Balancing between tiler and renderer is app dependent and somewhat straight forward: if the tiler usage is low, we can use higher poly models "for free". And if the renderer usage is low, we can do more pixel shader magic. If both are low and the game runs slow, it's probably cpu bound.

Now, with the A5, there is an additional component in the mix, a second cpu core. The golden question is: How can we use this in a game effectively?

Here are some of my ideas:

  • Run the game update and the rendering in parallel. This requires double buffering of the game data, either by flip-flopping, or by copying the data before every frame. Interestingly, this works well with the threaded GCD approach from above. We can just kick off a game update task for the next frame into a separate serial GCD queue at the same time we render the current frame, and they both run in parallel.
  • After the game update is done (this should only take a fraction of a frame unless you do some fancy physics), we can pre-compute some rendering data:
  1. View Frustum Culling, Occlusion Culling, etc
  2. Precompute skinning matrices, transformations
  3. CPU skinning. Instead of handling the forward kinematics skinning in the tiler on the GPU, we could run it on the cpu. This is more flexible, since we're not bound to the limits of the vertex shaders (limit of the number of matrices comes to mind). I'm uncertain about the performance benefits here. It's a trade between CPU and DMA memory bandwidth vs tiler usage. I think this may pay off very well in situations where one mesh is rendered several times (shadow mapping, deferred shading without multiple render targets, multi-pass algorithms in general). One of the biggest drawbacks is that the memory usage is (#instances of mesh * size of mesh) versus just one instance.
  4. Precompute lighting with methods such as spherical harmonic lighting, where the results can be backed into the vertex colors. This could even run over several frames, and then only be updated at a certain rate (eg. every 10 frames)
  5. Procedural meshes and textures. This is interesting, and mostly depends on a fast memory bandwidth, which the A5 should provide.
  • Asynchronous loading of data (textures, meshes). This is mostly limited by IO though, but some interesting applications (such as re-encoding, compression) come to mind.
I'm going to try a few of these over the next month, I hope I'll have some nice results and insights :)

As my closing words: We live in exciting times for mobile GPU programming! <3

Wednesday, June 15, 2011

Multi-threaded OpenGL

I've finally gotten it to work. One GCD version and one NSRunLoop version.

It's all open source, check it out here: https://github.com/Volcore/LimbicGL

Monday, June 13, 2011

Multithreaded Renderer on iOS

Hey #iDevBlogADay,

You've probably seen this: You start a game. After the loading is complete, the game runs smooth for a brief period of time, then it suddenly starts stuttering significantly for a few seconds, culminating in a automated banner: "Welcome back to GameCenter". If it's an action game, you may just have lost a life to the stutter.

Last week, I tried to investigate this, and a potential cause was revealed to me: GameCenter is running in the same NSRunLoop as everything else, on the main thread. Hence, when it connects, performs the SSL authentication, encryption, decryption, it blocks the main thread. Apparently, this costs enough time to delay the rendering.

Not only was this cause revealed to me, but also a potential solution: Put the entire OpenGL rendering into a separate thread, so it's not tied to the temper of the NSRunLoop and it's many potential input sources. So I set out to try this.

Multithreading OpenGL Requirements

Writing a multithreaded OpenGL renderer isn't trivial. And due to my perfectionism, I wanted to do it right. That means:
  • Clean implementation
  • Fires exactly at display refresh, using CADisplayLink
  • As simple as possible, trying to avoid any low-level multithreading if possible
  • Since a single EAGLContext can only be used on one thread at a time, ideally everything should be only on the secondary thread, and no code should run on the main thread.
These requirements led me to my first approach.

Using GCD

I love Grand Central Dispatch (GCD). It's a great way to parallelize and defer code execution. And some tests I conducted showed that the overhead caused by blocks is tiny.

Hence, I created a new serial queue (as suggested by the Apple iOS OpenGL documentation), and essentially queued all calls to OpenGL in blocks on that queue, which runs on a different thread. One of the first problems that arose was that setting up CADisplayLink on that thread doesn't work, because the GCD queue threads don't have a NSRunLoop, which is what CADisplayLink uses. Hence, my display link callback wouldn't be called at all.

However, since CADisplayLink doesn't call any OpenGL code by itself, I moved it onto the main thread, and then dispatched the draw event onto the rendering queue from there. Now the callback got triggered at 60hz, as expected. But the rendering didn't work. I'm pretty sure I enforced the right EAGLContext on every draw event. And after a couple of dispatch_asyncs, the [context presentRenderbuffer:] function would stall for 1s at a time. I fiddled around a lot with this, but couldn't get it to work.

If change the rendering queue to run on the main thread (by using the main GCD queue), it magically worked well. But everything was executed on the main thread, which set me back to beginning. That's as far as I've gotten with the GCD approach.

Using a separate NSThread

My second attempt then involved an NSThread. The setup was very simple: When the GLView was created, I created the new thread, and inside I started a NSRunLoop. Then, I set up the CADisplayLink to run on this run loop. And it worked. The CADisplayLink fired reliably and the scene was rendered correctly. However, there was a small issue: There seems to be no (reliable) way to terminate the run loop. Hence, once started, I couldn't stop the rendering anymore. That's not really what I needed.

A classical approach

This is as far as I've gotten. The next thing I want to try is to create an NSThread that runs a very simple loop. It sleeps until a semaphore is fired, then renders one frame, and then sleeps again. Then, on the main thread, I run the display link, and trigger the semaphore whenever the display link fires. This is very old school, but at least I can turn it off at any time, and it appears to match all the requirements I have.

Summary

What seemed like a nice little afternoon project starts to cost a lot of time now. However, considering that moving the rendering to a separate thread seems like the best solution for the GameCenter login stuttering problem, and any other stutter that is caused by high runloop latency, it seems worth the effort.

Once I've found a good solution, I plan to make it open-source, and also include my performance monitor thingie that I described in my last post.

To close this post, I'd like to ask everyone out there: Have you written a threaded renderer on iOS? How did you make it work? Did it work reliably?

Cheers,
Volker



Monday, May 30, 2011

Nuts performance levels

Hey iDevBlogADay,

Since my last blog post I've arrived in the Limbic HQ, Palo Alto, CA. We've also launched our latest game, Nuts!


Today I'm going to write about how we managed to make Nuts! a beautiful 3D game running at 60hz even on the 3GS. I'll go into the details of many common optimizations and try to analyze how much they actually gained us.

Measuring Performance

The most important thing for optimizing performance is a way to measure the performance, and its change as you modify the application.

The tool of choice for us is a little plug-in module we call the performance monitor. It records wall-clock render times, game update times and idle times. Idle time is the period spent between update and the next render, usually sleeping, updating cocoa, etc. You can see a annotated example here.

It's a very valuable tool, because of the way it plots performance of individual parts of the app over time, it helps you correlate performance events with potential causes. In our games, it's only compiled into the developer version, and it can be activated by double-tapping the top left corner of the screen.

30hz vs 60hz?

This is a question I'm very passionate about. Recently at the Santa Cruz iOS Game Dev meeting, the great Graeme Devine mentioned this as well. It's very important that your game runs smooth. Although most users will not admit it, slugishness, unresponsiveness, stuttering in a game are huge factors for instantly putting it away.

Keeping this in mind, when you're at the early stage of a new project, and you have a working prototype, you need to make an important decision: Do you want to go for 30hz or 60hz?

Let's think about this for a second. If you decide to go for 30hz, it means you can have twice as much stuff in one frame, compared to a 60hz game.

Many people will also argue that noone will notice the difference between 30hz and 60hz. They could not be more wrong. It really depends on the game you're making. For Nuts!, we experiemented with both 30hz and 60hz, and 30hz, although it had a very smooth and stable framerate, just didn't feel right. It was less responsive, and it didn't play well. Plus people were more likely to get motion sickness from it, which is a big factor as the game involves a camera that is constantly rotating around a tree. Hence, we knew the game has to be 60hz, and we took this into consideration for all the art and further engineering.

For another game that we're currently working on, it's a completely different story. It is a very different kind of game and 30hz is completely fine. And because we're 30hz, it means we can show more stuff, and at higher quality.

General Engine Design

To start out, I'd like to give you a small overview of how our game and engine is structure. Our OpenGL ES 2.0 engine is really simple and "dumb", it doesn't do any kind of automatic batch sorting. All we do is load models, which are objs plus a set of OpenGL states. There is only one shader that is quite simple and highly optimized to do everything we need.

The problems

In the week before launch, the game actually ran pretty well, mostly exceeding 60hz on both the iPhone 4 and the 3GS. However, we had random stutters here and there, that were really distracting and even could cause you to crash into a branch and lose.

Optimization 1: Vertex Array Objects

At WWDC last year, the Apple engineers recommended having a look at VAOs, as they can lead to a significantly reduced overhead when drawing a lot of batches. Hence, I went ahead and updated our engine. In principle, this is very easy, but there are some pitfalls and the implementation is very unforgiving. If you make a mistake, the code is very likely to crash, often by some form of memory corruption, deep inside the OpenGL code. After it all worked, we even saw a moderate performance gain, but it wasn't anything significant.

However, considering how simple this extension is, and how easy it can be built into an engine, I strongly recommend everyone to use it. There is nothing to lose here. UPDATE: Actually, there is something to lose. Every single VAO takes up 28 KiB of memory. For Nuts!, That's 2.5 MiB just for the VAOs. It heavily penalizes VBO animation. It seems to be a good combination with skeletal animation, though.

Optimization 2: State Caching

Before my final optimization pass, we were already caching many states, so I can't really give any feedback on that. But we basically didn't cache any of the OpenGL ES 2.0 states: Shaders, uniform bindings, uniforms, etc. In every drawcall, we were re-enabling the same shader, loading all uniform locations for that shader, and setting it to the right values. That sounded very much like an opportunity to optimize.

However, after I implemented this, I did not notice any improvement in performance. I don't know if the driver is now "smart enough" to do the state caching itself, but it seems to not have much effect on the overall performance. As such, I would still recommend caching for any of the easy stuff (glEnable states for example), but caching each individual uniform value seems to be overkill.

Optimization 3: Instruments

Instruments is a double edged sword. On the one hand, I love the leak checking and the new driver analysis. On the other hand, I think the CPU, GPU performance monitors, and the driver analysis are mostly useless. You may have noticed that I mentioned the driver analysis in both, that's because while it gives you a lot of cool insights, and it may catch a few bugs, it didn't have a lot of valuable insights into making the rendering faster. For the most part, the things it was very obsessed about didn't have any effect at all. But that may also be because I've been doing this for too long.

Optimization 4: Alpha Sorting

Initially, we rendered the scene kind-of arbitrarily. We would render the tree, the squirrel, then render some transparent effect, then the branches. We were more concerned about depth-correct rendering, than about performance at that point. However, the way the iPhone GPU works, it's actually more beneficial to completely separate the solid from the transparent rendering.

To help implement this, I added a two-pass mode to the engine. The first pass would only allow solid objects to be rendered, and it would complain if any rendering call tries to enable alpha blending. For the second pass, it was the other way around.

This actually helped the performance, especially in the peaks, which were sometimes caused by displaying a lot of transparent effects that would be alternated with solid render calls, like the fireball nuts and their particle effects.

I strongly recommend designing the whole renderer in this way. First, render all solid objects, then come back to render all non-solid objects. And enforce it, in case the artists try to be smart and fancy about something.

Optimization 5: High-level optimizations

By far the most significant improvement was the higher level optimizations. Usually, the performance issues came down to rendering too many things of one kind, or a model that was weirdly engineered to trash the texture cache with every single one of it's hundreds of triangles.

The performance monitor and A/B testing really helped a lot in pinpointing down the causes and fixing them.

Also, often when you're getting stuttering, the performance monitor will tell you that it's because the frame time is just a little bit too long every other frame, so the system keeps missing one out of every few draw events.

One other important thing to note is that once you know what performance target and visual quality level you're aiming for, you should figure out the limits of what you can display, and enforce them. If you don't, players will most definitely take your game, clump up all enemies in one spot, and blow it up in some crazy, unanticipated way that will completely destroy the performance. And it will become a norm. We learned that the hard way in TowerMadness.

Hence, if you implement an effect system that keeps track of and animates effects, also make sure that it has a cap on how many effects it will show, and that it gracefully handles a situation where too many effects are present.

Also, if you were to make a Zombie game, don't just allow unlimited Zombies to spawn. Make sure the numbers are limited, and design the game to work with that number. If the game is only fun through excess that can't be sustained, you should go back to the drawing board. That's also a good lifestyle advise, now that I think about it.

Summary

As you may have noticed, none of the optimizations by itself really did the job alone. It was the mix that made our game run at 60hz no matter what the player does, even on the 3GS.

There are also many things left to optimize. Like the math library is completely not optimized. But there is no need for that, as it's not the bottleneck of the game. Optimizing it would probably take a long time, and only reduce the total CPU usage by 2-5% that's what we estimated for Nuts! Having a good profiler helps a lot.

I hope summarizing up my notes on the Nuts! performance tuning process gave you some ideas about what to optimize, and what is probably not worth it, and I hope it makes your life easier in the future. And hopefully mine too, since I thought about this a lot while writing the article.

In case you're there, see you at WWDC! We'll be wearing Limbic shirts most of the time and my MacBook Air has a Yoshi on it, so we're easy to see. Don't hesitate to come over and say hi!

Monday, May 16, 2011

Guest Post: Virtual Game Development

Hey iDevBlogADay,


since I have got very little time, due to the fact that I'm leaving for Palo Alto in a few hours, the start of this years WWDC trip, I have asked my fellow Limbic co-founders Iman and Arash to write a little guest post. They're writing about the problems we face as a company working in two timezones with 9 hours in between, and being almost "purely virtual". Here it goes:


Unlike many startups, Limbic operates as a virtual company. In our case, we collaborate with team members in seven locations across the globe (Palo Alto, Davis, San Diego, Burbank, Germany, the Netherlands, and New Zealand). As one can imagine, operating in this fashion brings many challenges, but in our experience it comes with substantial benefits as well.


In order to support this arrangement, some degree of planning is essential, as meetings across multiple time zones must be coordinated. For our projects, we use a slew of tools for communication and task management:


* Skype, IRC, and iChat for voice and video conferencing

* Skitch and Dropbox for sharing images and videos

* BananaScrum and Lighthouse for project planning and task management

* GitHub for source code hosting and collaborative development

* Doodle.com for scheduling


The most common problem with working across multiple time zones is finding overlaps in the availability of US and European team members to meet. This leads to inevitable late night or extremely early morning meetings. When working on dependent project tasks, we have found it is important to sync up daily and hand off to other team members to ensure smooth and continuous development. If a voice or video meeting cannot be attended for a particular day, individual members communicate their progress via email to the team. Also, because team members aren't able to casually communicate throughout the day and all discussion happens during meetings, they tend to run quite long in order to cover all issues.


One of the difficulties with virtual collaboration is that it can be slower than face to face communication for rapid iteration. We minimize this by using screen sharing, chat, and video conferencing whenever necessary. A tremendous advantage to working virtually is that it allows everyone to work from their own favorite environment (coffee shop, home, etc.). In addition to the environmental benefits, commute time is reduced or eliminated in many cases, allowing more productive time for work. FInally, with no office expenses to pay, the operational overhead of the company can be reduced.


Recently at Limbic, we have moved towards capturing the benefits of a shared workspace by establishing a small studio in Palo Alto as a hub for physical collaboration, while maintaining the flexibility provided by continuing to operate primarily virtually. We also like to bridge the gap between all team members by periodically planning retreats where we all meet up face to face to have fun, brainstorm, and help kick-off new projects.




That's it! I'm rather exited for the next post, as it's going to be one week after we launch our new game, Nuts!