Presentation on theme: "Tracking Attitudes and Behaviors to Improve Games Ramon Romero Game Developers Conference 2008."— Presentation transcript:
Tracking Attitudes and Behaviors to Improve Games Ramon Romero Game Developers Conference 2008
What is it? The website is filled with useful information we have presented in previous talks on the subject of User Research
Creating a Feedback Loop For Game Designers Lend audience insights Detect problems Create opportunities to fix those problems Prior to release User Advocate / Data Champion Research expert We have a lot of experience working with game developers and conforming our approaches to the challenges of the development schedule In our setup we have a person who is entirely devoted to the problem of representing and working on user data
Using Formal Research Methods From a Variety of Research Disciplines Industry Researchers Usability Engineer Human Factors Engineer Academic Researchers Cultural Anthropologist, Ethnographer Experimental Psychologists Cognitive, Social, Developmental, Behavioral There are multiple research types that can provide value to game developers at different points in the development cycle
What is it? Some people will talk about logging or automation to refer to the same thing. But what do we mean…
Tracking Real-time User Experience (TRUE) This refers to logging the things that matter most to the play experience. Where did people die, what killed them, what were their opponents carrying… We will re-present the key points from this diagram through the remainder of this presentation Note how our report viewer is active – the arrow is pointing towards that report – the viewer is actively trying to obtain information
TRUE System Critical events Surrounding context On the TRUE System slide(s) we will be rebuilding that diagram. Critical events and Surrounding context are things we measure. Critical events are relatively easy to understand. Think of major progress (beating a boss) and major setbacks (dying, losing all cash money) as Critical. Surrounding context refers to related information (what were they holding, what level were they).
Our first example. Imagine it is the Summer of 2004 and that we just asked a number of real consumers to play through all of Halo 2 over the course of a weekend of testing. That test has just completed and now we are going to look over the results of that test, just like the active viewer from the diagram. Time is pressing, the game will be out this Fall and we need to turn feedback around fast…
…average deaths per mission… …for all missions in the game… Mission 6 8+ hours in… Although in the course of regular analysis we would examine everything, our eyes are drawn to the spike for Mission 6. And so we click on the bar in the chart
Mission 6 8+ hours in… And now we see the details of death counts for individual encounters across the entire mission. An encounter is the smallest meaningful chunk of a Mission to a designer at Bungie. It can be a conversation or a cutscene but most often it will be a firefight. Results from firefights are below…
interesting results 8+ hours in… In regular analysis we would examine all 5 spikes… but for this talk we will focus in on one example with interesting results. Now lets say you want to get even more information so you click on this bar.
Cause of Player Death And now we see who is causing the various deaths. The Flood Humans are the greatest source of problems and so we perhaps could retune them in some fashion. But our interesting result was the high number of Unknowns. Something occurred here which we did not anticipate, and hence had no label for it. 16% was very high. The highest we saw in other parts of the game was 1-2%. And so wanting to learn more we click on the Unknown portion of the chart…
Video plays… multiple players are drawn to their own doom… to a pit that is too attractive and looks too much like the correct way to go… …nearly everyone is fooled by the pit once… they fall straight in…
And this is how we fixed that pit
TRUE System Critical events Surrounding context Supporting information Video Deaths, both averages and raw counts What killed them In this example video acts as the back- up plan. Something occurred that we could not anticipate. Rather than plowing through hours and hours of video to discover the source of the problem. The TRUE system points us right to the problem so we can use those hours to think about solutions.
TRUE Principles Instrument to answer why When building a TRUE system it is important to be able to use the data to tell a story when you have collected it. Otherwise it has a tendency to sit or for findings to remain mysterious and unhelpful. Another example…
In Forza 2 one of the modes of play is a Time Trial. You beat the time on a given track and you earn a car. We tested every one of them and will focus in on the results from the Tsukuba Short Circuit…
Time Trial Summary Tsukuba Short % of participants passed 0% Target Time to Beat the Trial (seconds) 45.7 Average Time (seconds) 84.9 …where things did not go well On first appearance things look really bad. People are averaging runs that are nearly 40 seconds off. But that is a little misleading…
Arcade | Time Trial | Tsukuba Short You see we actually had people run the race 10 times and so it is a better data presentation to break out results per run. Averages are not always your friend. Click ahead to look through individual results…
Tsukuba – P1 This participant improved over time, nearly beating it but not quite
Tsukuba – P2 She nearly beat it on 4 separate occasions
Tsukuba – P3 Wow – dramatic changes… and was able to put together one decent lap
Tsukuba – P4 Lots of good runs but even that last one was not good enough
Tsukuba – P5 Another case of dramatic change. Data loss can occur. Sometimes our logging system fails, or the game crashes or a participant needs to use the restroom. All of this means that sometimes we do not learn the full story for all participants
Tsukuba – P6 Several close runs but no cigars
Time Trial Summary Tsukuba Short % of participants passed 0% Target Time to Beat the Trial (seconds) 45.7 Average Time (seconds) 84.9 UR suggested 50.2 seconds Design decided 48.8 seconds So 84.9 was the wrong number to pay attention to. Instead we focused on each individuals best run and how they progressed over time. And then we made a suggestion… It has been suggested that in the face of all this data that a development team will lose control of their vision. This is not the case. The Game Development team makes all the final decisions about how to adjust their game. The Game Designers looked over the same data but they knew that the cars were not done being tuned and decided a different number would work well. The next few slides show you how well their number worked…
Adjusted Target Time
Tsukuba – P1, Adjusted
Tsukuba – P2, Adjusted
Tsukuba – P3, Adjusted
Tsukuba – P4, Adjusted
Tsukuba – P5, Adjusted
Tsukuba – P6, Adjusted
TRUE System Critical events Surrounding context Supporting information Video Data visualization N e e d s i t e r a t i o n. L O T S ! ! Data visualizations are a key aspect of the TRUE system. You have seen a bunch already and they are not always so straightforward to create
TRUE Principles Instrument to answer why Make the key findings pop Intent Designers must declare intent This is the goal of those visualizations. No theoretical discussions. No debates. Clear clear clear findings. If not then the Data Visualizations actually work against you, no matter how clear they are to you. Luckily there are a few Game Designers around who can help you out with this. Once they declare intent then working together you can create those visualizations. Examples of what we mean by intent…
Time Trial Summary Tsukuba Short % of participants passed 0% Target Time to Beat the Trial (seconds) 45.7 Average Time (seconds) 84.9 Beatable within 10 tries We already saw the intent statement from the Forza 2 designers… but there was more to it. This is an excellent statement. The statement not only helps determine the nature of the visualization it also helped us determine precisely how to test it… i.e., lets give them 10 laps and see what happens. Its also a really easy example of design intent, as was our Halo 2 example earlier… people need to die at the intended rate… But there are much harder cases…
Crackdown is a successful title that was released in 2007… It is a non-linear game… This creates difficulties when attempting to map out intent. If people die too much then they are supposed to find another way around. At any one time players can be doing anything…
The many ways to play Crackdown…
Users must find their own fun… The intent statement is very broad… and so we found that certain aspects of the games intent were not really declared but were understood only in the context of the play experience… The experience players had with Agency Nodes, also referred to as Supply Points is an interesting example… In the game these points are places where you go to re-supply your weapons… they also double as re-spawn points.
Video plays… opening play sequence in Crackdown (starting after opening cutscene completes)… run around a little… find car… drive out of Agency starting point… takes a minute or two…
Video plays… jump ahead… we found some fighting… run around… eventually die… and respawn… back at the Agency… where we started… now we have to go through the same tired sequence of finding a car and driving out of there before we can get back to the action…
People were not finding the orange supply points which, again, are respawn points that players could use to get back to the action sooner. So this meant death was more punishing than the Designers intended. Using TRUE we started tracking how long it took players to find the orange beacons…
Users must find their own fun… Average Time (mins) to First Agency Node Test Test Test Test … but first they should find … How many times do you think people died in 31 minutes of play… quite a few it turns out… more than 50 times for one poor individual. So we made a few adjustments to make sure that players would notice these beacons and things improved over time. And returning to our key point… the intent statement needed to evolve and did so as an iterative function.
TRUE Principles Instrument to answer why Make the key findings pop Intent Designers must declare intent UR must find a way to measure it Once the intent is declared (or discovered) the act of measuring it and analyzing it can be really straightforward as in all examples so far. But sometimes the measurements can be misleading or unclear...
Valhalla is a multiplayer map in Halo 3. It was available as early as the Alpha test period. In the distance is one of the towers on this map. Towers are places where players will spawn into the game, vehicles are usually nearby, there are a pair of transport devices that will also shoot a players out into the environment. Players are expected to fight for control over them. On the next slide we will look at another picture of the same tower. This time it will be a small red blob on the right side of the picture.
H3 Alpha Here it is… Everywhere you see red is a spot where relatively MORE deaths occurred then in the black and grey areas. We call this a heat map. The deeper the red, the hotter the spot, the more violence people are committing. Anyway the neat thing is looking at the huge problem we can see here. You see it right? No not this… Not the other tower… And not here either… Its here… and easiest to understand when we look at the beta results for comparison. Do not feel bad if you missed it. The User Researcher working on the product almost missed it too.
H3 Alpha H3 Beta Users must use the entire map… People were not using this part of the map… So an assumption in the design intent was found and declared and then the adjustment was relatively straightforward. They changed the direction that the transport devices would shoot players so that they could experience all parts of the map. And the beta results showed that the adjustment worked out as hoped. But lets not gloss over the key point here. The User Researcher working by him or herself might have missed this because all aspects of design intent will not be clearly declared every time.
TRUE Principles Instrument to answer why Make the key findings pop Intent Designers must declare intent UR must find a way to measure it Design and UR analyze together Designers are expert at the experience they are trying to create, so naturally they should help with the analysis. In the example you just saw the Designers at Bungie saw the problem instantly where the User Researcher nearly missed it All the more reason to concentrate on the visualizations and ensuring the findings are instantly understandable
TRUE System Critical events Surrounding context Supporting information Video Data visualization Needs iteration. LOTS!! The initial version of this chart was a giant indistinct red blob, informing us that our alpha participants died just about everywhere. It took time and repeated revision to get the heat map meaningful.
The next few slides are visualizations noting where players are standing (measurements pulled every 15 seconds) in a level broken down by encounter. Encounters are the smallest meaningful chunk of an individual mission to a Bungie Designer. An encounter can be a firefight as in the Halo 2 example earlier. In this case we are looking at encounters of the other variety (NPCs are speaking) where there is no action.
Encounter 1 Here we see all the places where the participants were standing during Encounter 1. We are not distinguishing among participants – all points where a participant ever stood are noted.
Encounter 2 Now we see that in Encounter 2 (encounters proceed based on a combination of player movement and timing) players are beginning to move forward and also appear to be milling around in the starting area of the mission.
Encounter 3 Note that there are more dots visible now. The number of participants has not changed, we are seeing greater indication of people spreading out. The overall pattern continues – they move forward… and then they move back. Remember this entire sequence contains no action.
Encounter 4 And now we see even more spreading out… Yet some people are still wandering back here.
There are numerous steps we could take to attempt to learn what the problem is. In the TRUE system we could click on any individual dot and watch video to observe what transpired. But lets be honest – if you are planning to emulate this system the video will be the last thing you get functioning. What else could inform you about what is going on?
Maybe the participants themselves can be of service… Whenever testing Halo 2 or Halo 3 a survey was integrated into the play experience. Every 3 minutes, participants are given a chance to tell us how they feel about the experience I know what you are thinking. Every 3 minutes… People get used to the survey. Every 3 minutes… they get used to it. How do we know? Well we test our own methods and ask people how they feel about the surveys. And they get used to the surveys… Lots of Brown. What was that again? Here we are at the beginning of the first mission (did I mention that?), people have not even fired a weapon and they are so frustrated with the game that they are done with it… they are ready to quit. Peoples attitudes seem like an important thing to know… The purple and green dots clue us in. People are feeling lost. The brown dots tell us how extreme the problem is. We fixed the problem by providing more guidance and the problem dissipated.
TRUE System Critical events Surrounding context Supporting information Video Data visualization Measure attitude The 4 th and final pillar of the TRUE system. There were several things going on all at once in that example so we will break out the benefits of measuring attitude into separate examples.
On the next slide we will again be examining a mission in Halo 3. This time dots represent spots where the player was killed and color indicates what killed them. The level is the first one where players face a Scarab (a bossfight). Purple dots tell you where players were killed by the Scarab.
The boss fight… We tested again one week later… Werent purple dots the boss? Where did all the tan dots come from? Oh… the Grunts… Hold it… the Grunts??!?!… It turns out we are talking about a bug. This testing took place at the Normal difficulty level. But there was a bug introduced into the code somewhere such that whenever a Grunt was behind the controls of a vehicle their AI was suddenly reset to maximum difficulty, turning them into, in the words of one participant…
…grunts of death… At the end of every mission we asked participants how they felt about the play experience. We also include several open-ended questions and this was one of the responses. The bug was identified, fixed and dismissed. Fine… but we were discussing measures of attitude, how does this matter?
…grunts of death… When preparing this presentation the User Researcher working on the product (John Hopson) was struggling to remember good examples (we ran a multitude of studies, generating hundreds if not thousands of individual findings). Yet within a matter of moments he pulled this single individual finding out of his head with little effort. It was memorable…
TRUE System Critical events Surrounding context Supporting information Video Data visualization Measure attitude Memorable quotes When dealing with large sets of data that we are attempting to interpret quickly it is important to discover certain hooks or anchors around which to tell the story of what happened. Without those anchors the dataset no matter how powerful or complete Is much more likely to be ignored. Different people are convinced by different kinds of information. Some need a clear breakdowns and percentages of who did what and where and others need to hear the story of what transpired. These memorable quotes are critical to helping your audience (and yourself) understand what is going on.
Time Trial Summary Tsukuba Short % of participants passed 0% Target Time to Beat the Trial (seconds) 45.7 Average Time (seconds) 84.9 Beatable within 10 tries... …and the experience should feel appropriately challenging… It turns out the Design intent statement had multiple clauses…
How challenging was this race? After each set of 10 races we asked them how they felt… Those who dislike this so- called subjective data tend to criticize the fact that players only complain in one direction As you can see looking at the data from the Sunset Peninsula track this is not always the case In the games industry we are always attempting to build new and novel experiences. This means that we could be wrong in either direction and the behavioral data by itself does not necessarily communicate that. Perhaps people could have completed their 10 laps of the Tsukuba Short circuit and indicated that it represented a reasonable challenge as with the Suzuka Circuit. Even with a well understood mode of gameplay such as a Time Trial we can be wrong in unpredictable fashion.
How challenging was this race? Attitude indicates valence Attitude indicates magnitude of dissatisfaction So we need some outside opinions to helps us understand the polarity or valence of the experience. And inform us whether or not there is a problem. We also get a statement of priority. Nearly 80% of the participants felt the Tsukuba Short circuit was harder then it needed to be, making it the worst circuit we tested that day. If we were in a crunch and could only fix a few things in the game, then this data would help us make the right decision about where to expend resources.
Tsukuba – All, Adjusted Attitude says nothing about the magnitude of the fix Those who hate this form of data tend to point out this issue. And they are absolutely right to do so. The behavioral data gives you precision information on how to fix an issue. But dont let that down-play the significance of acquiring attitudinal data. Without it you may be focusing on the wrong things.
TRUE System Critical events Surrounding context Supporting information Video Data visualization Measure attitude Memorable quotes Valence Priority One final point is that these measures are less susceptible to the so-called subjectivity problem. If we conducted that same Forza 2 test 100 times then the memorable quotes would differ based on the individuals but the Valence and Priority findings will not differ substantially. These metrics are statistically reliable when used appropriately.
TRUE Principles Instrument to answer why Make the key findings pop Intent Designers declare intent UR must find a way to measure it Design and UR analyze together Behavior and Attitude receive equal weight Another way to state this point is to say that each type of metric holds primacy in its arena of mastery. But neither type of data should leave its arena
Final example in the presentation…
Test 12 Imagine another one of the all weekend tests completed. This is Test 12, so the Agency Node/Supply Point problem is basically behind us. People played a total of 12 hours of the game. There are tons of data available to us. We could examine: Deaths via heatmap or by enemy type Where people go How far people progressed in the RPG component of the game Which boss fights they discovered and/or completed What skill types are players using to defeat enemies We have to turn the data around tomorrow so the team can make adjustments as needed. Remember its a non-linear game and so the intent statement is not incredibly good at leading your queries…
Users must find their own fun… Actually it does kind of point towards one question…
How fun was this game? Naturally at the end of that 12 hours of play we asked people how they felt about the experience. Lets take a look…
3.8 We use a 5-point scale so thats pretty good but not fantastic. You want to be up and over 4.1 or 4.2 if possible. Looking at the average response is a little weird so…
How fun was this game? Its really just another valence question… And things look pretty good. Usually a great deal of our efforts are spent on reducing the negatives but we appear to have removed most of the primary frustrations from the experience… So if we want to amp the positive… where should we go…?
What was fun…? I really liked the overall feeling of mobility the game possesses. Especially once my agility greatly improved… the super hero like abilities I like leveling up the most. Seeing my guy get stronger and jump higher made the game good. being able to upgrade my character Lets check the open-ended responses. Maybe there is something memorable to think on… It seems like there is a trend… what is it in Crackdown that makes people feel like a Super Hero…?
THEOREM… GIVEN ; AGILITY = FUN? They are mostly talking about jumping and running fast, both of which are tied to the Agility statistic. A player statistic that they can increase by collecting agility orbs within the game. How would we verify that…? There was some data on this… how far did people get on the RPG skillset…
What level were they…? By the end of the study (12 hours of play)… Just focusing on agility… And once leveled by as much as two stars then players can jump on to rooftops with ease and they should feel substantially faster. Lets take just these folks who got to 2 stars... And see how they responded…
How fun was this game? …to this question… And they all fell in this bucket… Interesting but its not absolute proof of a relationship. We need far more evidence to prove this. We discussed the possibility with the developers…
TRUE System Critical events Surrounding context Supporting information Video Data visualization Measure attitude Memorable quotes Valence Priority C a n d i r e c t d a t a m i n i n g … Quick aside: Not shall. Not will. Those memorable quotes can be lighthouses in the mist.
Anyway, we discussed the possibility with the developers. Meaning that while all of this was going on Players needed to treat this as a goal… And these ones too… Several adjustments were made with the intent of drawing users attention to the agility orbs, in some cases changing their locations to make them more accessible
So there would be a lot more of this going on…
Test 13 And then we retested…
How fun was this game? Checked our results…
4.5 And felt pretty good about where the game was going…
Tracking Real-time User Experience (TRUE) – System Critical events Surrounding context Supporting information Video Data visualization Measure attitude Memorable quotes Valence Priority There is lots of cool stuff in the TRUE system. We are stressing the efficacy of attitudinal measures because when the behavioral data comes rolling in there is a tendency to downplay attitude and this is not an advisable practice.
Tracking Real-time User Experience (TRUE) – Principles Instrument to answer why Make the key findings pop Intent Designers declare intent UR must find a way to measure it Design and UR analyze together Behavior and Attitude receive equal weight Behavior measures Design intent Attitude validates Design intent In truth, multiple clauses or not, the design intent statement will always be a behavioral statement… we want/expect players to do x… The second part… and we want players to feel good about x… i.e., acquiring opinions about the intended experience, is the determining factor in your overall success. You need to measure attitude and ensure the metrics receive appropriate primacy in the face of overwhelming (and cool) behavioral data.
Tracking Real-time User Experience (TRUE) – Do it yourself… General MeasuresCritical Events by Game Type Overall Status Location, timestamp, minor progress, items Attitudes Linked to general status Forced choice Open-ended Critical Events Event Games Measure outcomes Linear Games Major progress Major setbacks Non-linear Games Time till… x Attitude! Measuring EVERYTHING impedes analysis We may expand on these points in future talks…
Thanks to… Bungie Design and Bungie Engineering: Halo 2 and Halo 3 User Research: John Hopson, Kris Moreno, Randy Pagulayan Turn 10: Forza 2 User Research: Daniel Gunn, Tracey Sellar Real Time Worlds: Crackdown User Research: Jerome Hagen, Eric Schuh Microsoft Game Studios User Research Manager: Dennis Wixon Game Essentials Director: David Holmes Corporate Vice President: Shane Kim