Monday, May 30, 2011

Nuts performance levels

Hey iDevBlogADay,

Since my last blog post I've arrived in the Limbic HQ, Palo Alto, CA. We've also launched our latest game, Nuts!


Today I'm going to write about how we managed to make Nuts! a beautiful 3D game running at 60hz even on the 3GS. I'll go into the details of many common optimizations and try to analyze how much they actually gained us.

Measuring Performance

The most important thing for optimizing performance is a way to measure the performance, and its change as you modify the application.

The tool of choice for us is a little plug-in module we call the performance monitor. It records wall-clock render times, game update times and idle times. Idle time is the period spent between update and the next render, usually sleeping, updating cocoa, etc. You can see a annotated example here.

It's a very valuable tool, because of the way it plots performance of individual parts of the app over time, it helps you correlate performance events with potential causes. In our games, it's only compiled into the developer version, and it can be activated by double-tapping the top left corner of the screen.

30hz vs 60hz?

This is a question I'm very passionate about. Recently at the Santa Cruz iOS Game Dev meeting, the great Graeme Devine mentioned this as well. It's very important that your game runs smooth. Although most users will not admit it, slugishness, unresponsiveness, stuttering in a game are huge factors for instantly putting it away.

Keeping this in mind, when you're at the early stage of a new project, and you have a working prototype, you need to make an important decision: Do you want to go for 30hz or 60hz?

Let's think about this for a second. If you decide to go for 30hz, it means you can have twice as much stuff in one frame, compared to a 60hz game.

Many people will also argue that noone will notice the difference between 30hz and 60hz. They could not be more wrong. It really depends on the game you're making. For Nuts!, we experiemented with both 30hz and 60hz, and 30hz, although it had a very smooth and stable framerate, just didn't feel right. It was less responsive, and it didn't play well. Plus people were more likely to get motion sickness from it, which is a big factor as the game involves a camera that is constantly rotating around a tree. Hence, we knew the game has to be 60hz, and we took this into consideration for all the art and further engineering.

For another game that we're currently working on, it's a completely different story. It is a very different kind of game and 30hz is completely fine. And because we're 30hz, it means we can show more stuff, and at higher quality.

General Engine Design

To start out, I'd like to give you a small overview of how our game and engine is structure. Our OpenGL ES 2.0 engine is really simple and "dumb", it doesn't do any kind of automatic batch sorting. All we do is load models, which are objs plus a set of OpenGL states. There is only one shader that is quite simple and highly optimized to do everything we need.

The problems

In the week before launch, the game actually ran pretty well, mostly exceeding 60hz on both the iPhone 4 and the 3GS. However, we had random stutters here and there, that were really distracting and even could cause you to crash into a branch and lose.

Optimization 1: Vertex Array Objects

At WWDC last year, the Apple engineers recommended having a look at VAOs, as they can lead to a significantly reduced overhead when drawing a lot of batches. Hence, I went ahead and updated our engine. In principle, this is very easy, but there are some pitfalls and the implementation is very unforgiving. If you make a mistake, the code is very likely to crash, often by some form of memory corruption, deep inside the OpenGL code. After it all worked, we even saw a moderate performance gain, but it wasn't anything significant.

However, considering how simple this extension is, and how easy it can be built into an engine, I strongly recommend everyone to use it. There is nothing to lose here. UPDATE: Actually, there is something to lose. Every single VAO takes up 28 KiB of memory. For Nuts!, That's 2.5 MiB just for the VAOs. It heavily penalizes VBO animation. It seems to be a good combination with skeletal animation, though.

Optimization 2: State Caching

Before my final optimization pass, we were already caching many states, so I can't really give any feedback on that. But we basically didn't cache any of the OpenGL ES 2.0 states: Shaders, uniform bindings, uniforms, etc. In every drawcall, we were re-enabling the same shader, loading all uniform locations for that shader, and setting it to the right values. That sounded very much like an opportunity to optimize.

However, after I implemented this, I did not notice any improvement in performance. I don't know if the driver is now "smart enough" to do the state caching itself, but it seems to not have much effect on the overall performance. As such, I would still recommend caching for any of the easy stuff (glEnable states for example), but caching each individual uniform value seems to be overkill.

Optimization 3: Instruments

Instruments is a double edged sword. On the one hand, I love the leak checking and the new driver analysis. On the other hand, I think the CPU, GPU performance monitors, and the driver analysis are mostly useless. You may have noticed that I mentioned the driver analysis in both, that's because while it gives you a lot of cool insights, and it may catch a few bugs, it didn't have a lot of valuable insights into making the rendering faster. For the most part, the things it was very obsessed about didn't have any effect at all. But that may also be because I've been doing this for too long.

Optimization 4: Alpha Sorting

Initially, we rendered the scene kind-of arbitrarily. We would render the tree, the squirrel, then render some transparent effect, then the branches. We were more concerned about depth-correct rendering, than about performance at that point. However, the way the iPhone GPU works, it's actually more beneficial to completely separate the solid from the transparent rendering.

To help implement this, I added a two-pass mode to the engine. The first pass would only allow solid objects to be rendered, and it would complain if any rendering call tries to enable alpha blending. For the second pass, it was the other way around.

This actually helped the performance, especially in the peaks, which were sometimes caused by displaying a lot of transparent effects that would be alternated with solid render calls, like the fireball nuts and their particle effects.

I strongly recommend designing the whole renderer in this way. First, render all solid objects, then come back to render all non-solid objects. And enforce it, in case the artists try to be smart and fancy about something.

Optimization 5: High-level optimizations

By far the most significant improvement was the higher level optimizations. Usually, the performance issues came down to rendering too many things of one kind, or a model that was weirdly engineered to trash the texture cache with every single one of it's hundreds of triangles.

The performance monitor and A/B testing really helped a lot in pinpointing down the causes and fixing them.

Also, often when you're getting stuttering, the performance monitor will tell you that it's because the frame time is just a little bit too long every other frame, so the system keeps missing one out of every few draw events.

One other important thing to note is that once you know what performance target and visual quality level you're aiming for, you should figure out the limits of what you can display, and enforce them. If you don't, players will most definitely take your game, clump up all enemies in one spot, and blow it up in some crazy, unanticipated way that will completely destroy the performance. And it will become a norm. We learned that the hard way in TowerMadness.

Hence, if you implement an effect system that keeps track of and animates effects, also make sure that it has a cap on how many effects it will show, and that it gracefully handles a situation where too many effects are present.

Also, if you were to make a Zombie game, don't just allow unlimited Zombies to spawn. Make sure the numbers are limited, and design the game to work with that number. If the game is only fun through excess that can't be sustained, you should go back to the drawing board. That's also a good lifestyle advise, now that I think about it.

Summary

As you may have noticed, none of the optimizations by itself really did the job alone. It was the mix that made our game run at 60hz no matter what the player does, even on the 3GS.

There are also many things left to optimize. Like the math library is completely not optimized. But there is no need for that, as it's not the bottleneck of the game. Optimizing it would probably take a long time, and only reduce the total CPU usage by 2-5% that's what we estimated for Nuts! Having a good profiler helps a lot.

I hope summarizing up my notes on the Nuts! performance tuning process gave you some ideas about what to optimize, and what is probably not worth it, and I hope it makes your life easier in the future. And hopefully mine too, since I thought about this a lot while writing the article.

In case you're there, see you at WWDC! We'll be wearing Limbic shirts most of the time and my MacBook Air has a Yoshi on it, so we're easy to see. Don't hesitate to come over and say hi!

No comments: