Turning a data set into a game

12 Apr 2017

What are some examples of turning a data set into a game? Pokémon Go, where the whole world is the data set? On a smaller scale, how about baseball videos on YouTube? A while back I worked on a little side project - passedballorwildpitch.com - that turned some of those into a game. Well, for some value of "game". Onwards!

The idea is that there's a collection of videos, each showing a play where the pitcher throws the ball and it gets by the catcher. Then the video is paused and the player needs to make the call - was this a passed ball (the catcher's fault) or a wild pitch (the pitcher's fault)? After making the call, the player is told what the official call was and on we go to the next video. The player is shown three videos and congratulated or insulted based on call accuracy. Simple!

More generally, I was trying to think of interesting ways to reuse large public data sets. YouTube videos certainly qualify for that; there are a zillion of them and they're more or less all public. And a game is a fine way to employ any data set. Additionally, they can be embedded in a site via an iframe, so it was easy to build a site around that content; the player didn't have to switch tabs and I didn't even have to sign up for an API key and make server to server calls or whatever. Also I'm a baseball fan (go Yankees!) and so I wanted to do a project where I was interested in the content.

First off, after starting the project I realized that there was a more general pattern here, and so I added hitorerror.com which is more or less the same thing except with a fielding call. And still later I added dubbedordrained.com where the player would predict whether a golf shot was going to go in the hole or not. This was slightly different than the others because the player would be shown the first half of the video and then would be prompted to predict whether the shot would be made. Then it showed the rest of the video. But there was the same general idea around evaluating a particular aspect of video content. The downside was $X per year for the domain names, but, hey.

One initial bump in the road was that it was hard to find appropriate videos. A good candidate video for passedballorwildpitch needed to be a play where the ball didn't go 10 feet over the catcher's head, otherwise it was too obvious. I spent a while poking around with various search phrases and discarded a bunch of obvious videos before finding enough borderline videos to populate the site. This was even more of a problem for hitorerror; I could find lots of plays where a shortstop obviously booted the ball but it was hard to find plays where it wasn't clear what the call would be. In both cases I needed to get a critical mass of videos - at least 10 - to avoid a player whipping through all the inventory in two minutes. But it was definitely a challenge.

Speaking of inventory, a huge source of videos would have been mlb.com. But I couldn't make that happen. The main problem was that although mlb.com provides a way to embed videos, you can't embed a video and only play a segment - i.e., you can't show just the timeslice from 00:15 to 00:25. Without that capability, the whole video just plays and the experience is ruined since the player hears the announcer say "whoa a that's a wild pitch" or whatever. Also, those videos are frequently of a whole game, or of a 10 minute condensed version of a game, and of course the point of the site was to just show one play.

Another issue was determining what the actual call was; that is, was this play officially scored as a passed ball or a wild pitch? I found a few sites that had play by play stats for each game and that kind of solved the problem. But I never automated per-game scoring scraping, so it was always a hassle to look up and record that determination. Maybe RestClient or Mechanize plus some judicious regexes could have solved that...

Another thing that came up was figuring out the correct embed params to hide the player chrome, otherwise the video title would give away the answer. Fortunately the youtube player embed docs are solid and this was straightforward.

I was never really concerned about site performance. A single unicorn instance running on a small VPS was plenty; site usage never got so high that there were any problems. The only thing that I had to tune were the notifications. I had set it to to send me an email when a player made a call on a video and that got out of hand when a bunch of Reddit traffic came in, so I turned those off. But generally the whole thing was simple enough that there weren't any scalability issues.

Switching to the technical side, I should have written integration tests with Capybara. I did a bunch of controller tests, but those were tedious to write since I has to do a bunch of set up logic. For example, after a player makes calls on three plays I show a message about how well or poorly the player has done; to test that I did a bunch of factory calls and such. But really I should have just done an end to end integration test that walked through the process. That would have been more readable, less hassle, and easier to add small assertions for - i.e., 'page should have content "try again"' - and whatnot.

Once I generalized things so that several domains were being served by the same application I had to make a couple of changes. I only had 512M of RAM, so I only wanted to run one copy of the application. To do that, I had to determine the desired "web property" the browser was interacting with, so I had to get the HTTP_X_FORWARDED_HOST header and allow that to be overridden by a parameter for testing. Similarly, each property had session entries which would have overlapped, so I had a Property class with a session_root instance method to keep them separate. There were bunch of other similar tweaks - per-property JSON-LD/twitter metadata/copy/"companion site" text and links - but that was all pretty straightforward. Solid integration tests would have helped a lot with my confidence levels, though.

One thing that Chad Fowler suggested to me way back when was to do a separate "admin" app even for small-ish efforts. That way you don't clutter up your main application with all sorts of admin CRUD'ing functionality and layouts and so forth. It's a little more hassle to set up but it sure is nice to be able to iterate on an admin app without worrying about accidentally breaking the main app. And it lets you do things like mark models as read-only on the main app, so console work there is a little safer. There are also the usual shared database concerns, but, it's worth doing.

So to sum up: 1) programming is fun 2) turning data sets into games is interesting 3) even small projects are fun to watch wax and wane 4) separate out admin code into a separate app and 5) integration tests for the win.