Sep 28, 2012

DCnU Sales: An Econometric Regression

For a long while, Matt and I had been thinking of running studies to gauge the various factors that go into comic book sales: creators, characters, price, etc. Since both of us have economics backgrounds, we decided that the way to do it would be by means of an econometric regression.An econometric regression is a method by which you can enter a dependent variable and name the independent variables. The regression would then assign coefficients to each of the independent variables. The coefficients would indicate the approximate impact of those variables. In essence, it looks something like this:

Y is the dependent variable — in this case, sales — and the Xs are the independent variables, meaning various creators and characters. The Bs are the coefficients, except for that first one, which is the constant (essentially, if none of the independent variables are met, that's the one that sticks). That last term is the error term, because obviously, it's not gonna be exact.

But Matt and I had some problems with these studies. For one thing, icv2 gives the top 300 comics each month. That's great, because these things are helped out proportionally by the size of the sample, but that's 300 titles immediately just to analyze sales for that one month. In the time since, as well, DC relaunched, and the New 52 happened, and the thing is that the New 52 had so many variables that didn't apply to other titles that couldn't be accounted for, such as the amount of publicity it got. Analyzing comic book sales with the New 52 didn't seem to make much sense to us. After all, if being a DCnU book helps, it doesn't give much predictive power or help anyone because it's not like Archaia, for example, will ever produce DCnU material.

So we thought, how about just DCnU sales? When we started, it was the start of September, and there were 11 months of sales data available, with 52 titles each (we decided to just stick to the ongoings, figuring that miniseries were different beasts). That's 571 data points (BATMAN INC came later than the other second wave launches). So we decided to do it.

How did things turn out? Let's see.

Methodology and Critical Terms

So bear with me here, because this is difficult to do without being technical, but there's no other way around it. The key to an econometric regression of a bell curve type of frequency distribution. For that, we had to use a logarithmic model, because sales don't give that kind of distribution. Log(sales) on the other hand, gives an almost-perfect distribution. Check it out.

We used a binary model, meaning that most of our variables were creators and characters, and we'd put a "1" into the data point if it fulfilled a condition, and a "0" if it didn't. For example, for BATMAN #1, it would read "1" for Snyder, "1" for Batman, and "0" for Superman. We didn't name every character or every creator; we just used the ones that we thought would have an impact. We also had variables for guest appearances by various characters. And we also had variables for the price, the month (as if BATMAN, for example, went down by a certain amount, you need to account for just how much comics sales as a whole were impacted that month. That's essentially what a regression analysis does; it lets you say "Okay, let's hold all of these other factors still and change just this one around. What does it do?")

For the characters, we decided to count them as long as they were in any way, shape, or form advertised to appear in the comic. It doesn't matter that Aquaman didn't show up in JUSTICE LEAGUE #1; he's on the cover, so he counts. Similarly, if a creator is advertised to be working on the book, and there was a last-minute unadvertised artistic change, such as in one issue of Aquaman (where Reis did breakdowns anyway), we counted that creator as a factor on the book. If the fans were warned well ahead of time that an artist would not be working on a book, such as JH Williams III on BATWOMAN #6, we didn't.

The logarithmic model makes things a percentage change instead of a unit change. For example, for a non-logarithmic model, if Scott Snyder's coefficient was a 1, then that means that him being the writer would increase sales by 1 unit. However, for the logarithmic model, it means that his name would increase sales by 100 percent.

Got it?

Okay. Then Matt ran the data through a program called Stata (Matt adds: "Which is awesome and fun an nerdy. None of this need to be in there, I just thought I’d point it out."), and we did diagnostics and everything, adjusting variables as needed.

We ran into a bunch of problems. We'll talk about that later. Before that though, let's define a couple of terms we're going to keep using throughout this article:
  • R squared — The R squared defines how much of the variance in the data is explained by the model. For similar studies, an R squared of somewhere around 0.5 to 0.6 is expected. Barry Litman's study on movie sales had an R squared of 0.485. A similar study I did on hardcover novels in college had one of around 0.525. In both cases, the models explained half the variance. Not bad, but obviously, the rules aren't hard and fast. R squared can also increase just by adding more variables, so it’s not the end all be all of model statistics.
  • P value — This value gives the significance of a particular independent variable. We pay attention to variables with P values of under 0.05 and 0.01 (noting both types). If it's greater than 0.05, the variable isn't significant, and can actually be removed. Luckily for us, Stata automatically indicates which variables are significant at which levels.

Initial Results

For our first regression, we just decided to run every single variable we had. That's the ideal after all, right? The R-squared was 0.861. Great! What a high R-squared, right?

And then we looked at the data, and according to the data, JH Williams didn't affect sales, James Robinson decreased sales by 37% (significant at a P value of less than 0.01), Grant Morrison increased sales by 22.5% (P less than 0.05), and Geoff Johns increased sales by 315.6% (P less than 0.01).

That seems kind of weird, huh? That brings us to our next point, the bane of all econometric analysts: multicollinearity.


The easiest way to describe multicollinearity is this: the model sees your variables in terms of numbers, so if you have two variables that are remotely alike, the model will confuse them. For example, one of our variables is "perezartist." It's every book where George Perez is the penciller. That's basically the WORLDS' FINEST series. It would therefore be of no use to have a "powergirl" variable, which indicates if Power Girl is the star of the book, because the model would see them as exactly the same thing.

And that's an exact correlation. Some of our variables are highly correlated with each other, thus making it impossible to run them together. Brian Azzarrello and Cliff Chiang, for example, have worked on most of the year together on Wonder Woman and nothing else; it's impossible to run them in the same study. You run into the risk of someone getting a wrong sign. (Would anyone believe me if I said Cliff Chiang lowered sales? No? I didn't think so. Multicollinearity, everyone!)

The biggest problem was Jim Lee. Since there hasn't been much in the way of crossovers, Jim Lee is highly correlative with every single member of the Justice League, aside from Batman, because Batman has had too many books in this past year to give him enough variance. In fact, in all of these cases, every character is highly correlated to his writer and to his artist — except Batman and, aside from Jim Lee, Superman, because Superman's gone through so many creators in the last year. What's more, every member of the Justice League — again, except for Batman — is highly correlated with each other.

Regarding multicollinearity, there's really nothing we can do other than omit one of the variables, and then explaining afterwards that the impact of the variable left behind may be impacted by the variable omitted.

So, for example, in our case, we decided to drop Cliff Chiang. Be aware that any impact Brian Azzarello has may also be because he worked on most of his stuff this year with Cliff Chiang.

Stata was also kind enough to drop certain variables when there's a perfect match, but leaves others when there isn't.

This is a problem because Geoff Johns is, by nature, highly correlative with Jim Lee and Ivan Reis (as both artists haven't done anything all year but work with Johns). We want to see how Jim Lee and Ivan Reis impact sales, but we want to see how Johns does as well. Since Johns affects their data, we had to come up with alternative solutions.

To that end, we decided to run several regressions: one involving just the characters, one involving just the creators, one involving the creators and Batman (since he's not collinear with any creator), and one involving the creators and Batman and Superman (because Superman is just collinear with Jim Lee, and at a 60% level). We also ran multiple variations where we'd drop Johns and leave in Lee and Reis, and drop Lee and Reis and add Johns. Fair enough.

So, because of the high level of multicollinearity, the only thing we can say with any level of confidence is the direction of the impact (whether someone increases or decreases sales), or whether they have any effect. The actual level of impact, on the other hand, we can't be sure of.

The way regressions work is, the more variables you add, the more it explains the variance. So, of course, it's problematic if we have multicollinear variables, because it means we are taking out possible explanatory factors.

The final thing to remember before we move on is this: correlation does not equal causality. Just because it says that being the last issue has a correlation on sales, does not make it so. In that case, there may be a reverse causality, in which it's the last issue specifically because the sales are low.

More Results

We ran a total of 23 regressions, and our R-squared values ranged from 0.366 to 0.722. Removing anything with an R-squared of less than 0.5, we noticed some consistently performing variables, some variables that always had an impact, some variables that never did, and some variables whose performance changed per regression, mostly, again, because of multicollinearity issues. The pure character regressions had a slightly higher R-squared than the pure creator ones, but that may again be because of multicollinearity, since most characters, especially those in the JLA, are correlative with each other.

I know what you guys want to hear, in light of their Twitter war last month. "How do Scott Snyder, Rob Liefeld, and Batman" affect things?

Well, Scott Snyder and Batman are actually not correlative with anything (actually, Snyder is with Capullo, so keep in mind that Snyder's effects here may be influenced by Capullo's — we had to take Capullo out of the model), so their instances stand alone. In other words, Batman's impact is accounted for with Snyder's variable, and vice versa. Liefeld, on the other hand, was closer to the characters he worked on — the impact of the characters he worked on is tied into his own impact — so we'll get to Liefeld first.

The Rob Liefeld variable consistently had a negative correlation to sales, significant at a P-value of less than 0.01. He was consistently in the -33% range.

Now, before you all go piling on Liefeld, keep in mind the following things: he was given low-selling titles to revitalize, and those characters he worked on involved Hawk and Dove, Hawkman, Deathstroke the Terminator, and Grifter — not exactly the upper echelon. There may be a reverse causality in play as well: Liefeld was given those titles to bring back up, and it just didn't work, hence the negative correlation.

Matt and I were curious about the effects of Liefeld as a writer and Liefeld as an artist. Liefeld's impact as a writer is consistent with the aforementioned -33%, and as an artist, he came up as insignificant. Rob Liefeld's variable as an artist is insignificant to the model. Again, this may be because of the characters he worked on, so the most you can say is that Liefeld as an artist didn't have enough pull to overcome whatever obstacles there were to count as a significant factor.

Other creators that consistently popped up as insignificant were James Robinson and Paul Cornell. Robinson, it should be said, only wrote three books that were out in this sample, so there isn't much variance with which to work for him.

Before we move on to Batman and Scott Snyder, let's look at David Finch. In the pure creator regressions (i.e., not accounting for Batman), Finch accounted for a 98% increase in sales. Once we took Batman into account, it went down, as you might expect. David Finch in all the regressions that accounted for Batman regularly accounted for a 15%–30% increase in sales (given the error term, these differences aren't really significant), significant during every regression, but at different P levels.

Batman, on the other hand, in the pure character regressions, had coefficients of 98% as well. And when we took away the other characters and introduced the creators, Batman's coefficients ranged from 78.3% to 92.4%. It's safe to say, therefore, that Batman is significant.

In fact, Liefeld made a comment on Twitter that his grandma could break 50,000 on a Batman book, and this data would support that. Most of the regressions would, in fact, support that in July, a Batman book that didn't have a big name creator on it, or guest stars, or anything else — just a plain book starring Batman — would have probably sold something in the neighborhood of 60,000 copies.

But the BATMAN series isn't selling 60,000 copies. It is DC's highest-selling title, regularly at over 120,000 copies. And as the data would suggest, much of this is because of Snyder. Like David Finch, Snyder's coefficients were high in the pure creator regressions — 96%. But once Batman was introduced into the equation, Snyder's coefficients went down... but not to 15%–30% as it did for Finch; no, Snyder's went down to 60%–70%, significant, every single time, at a P value of less than 0.01. Remember, Snyder isn't correlative with Batman here; their variables aren't eating into each other. They're working together.

So yes, Batman sells himself. But Snyder makes him sell more. And in fact, there's more at play here, because we're still off by around 30,000 sales, but it's clear that Snyder has a positive impact on sales.

In fact, there are only two other creators who consistently were significant at P values of less than 0.01 and had coefficients that were as high or more, and they're exactly who you think they'd be: Grant Morrison (who regularly had a coefficient of 100% or more, that is, doubling sales, except for two regressions where it read 55%. We haven't been able to figure out what's causing the disparity in those two particular regressions, but suffice it to say that Morrison has a positive correlation and it's a huge one, and we all pretty much figured he would, right? He's also not collinear with Superman, and when we plugged Superman into the regressions, his impact would only go down to just over 90%. Not as big a drop as Snyder's.) and Geoff Johns (who regularly gave values of 90%–130%).

But again, Johns was highly correlative with Jim Lee and Ivan Reis, so we did some regressions where we took them out and some where we took Johns out. Generally, when we took out Reis and Lee, Johns' coefficients were 90% to 100% – still very high. When we added them, Johns' impact would be above 100%, but Reis' and Lee's coefficients would be negative — a true symptom of multicollinearity.

When we took out Johns, Reis would have an 83%–87% coefficient. Again, we can assume a lot of that is Johns as well, but at least we can say that Reis has a positive correlation to sales.

And Jim Lee? It's close to impossible to tell. Lee's highest coefficient is 37%, which is difficult to explain, since a lot of the people I know who aren't participating on the Internet are collecting the Justice League purely because of Jim Lee. Jim Lee is the single most problematic variable in the entire model, because not only is he highly correlative to Geoff Johns, but he's also highly correlative to just about every member of the Justice League but Batman. He's also the only creator who's highly correlative with price. Alone among all his peers, he's the only one who worked on nothing but $3.99 books all year. I would love to give you guys a value for Jim Lee's impact. But by the very nature of this model, I can't do it. He's the only one whose coefficients go from hugely negative to positive, and there's just too many factors in the way.

Here's a few more bits and pieces for you guys. Francis Manapul was regularly significant with coefficients of around 100%, but since he's highly collinear with the Flash, we don't know how much of that is because of the character and how much of that is Manapul. The Flash himself in the pure character regression tested as insignificant, but again, so did Wonder Woman, Aquaman, and Hal Jordan (but not the Green Lanterns in general) — again, the Justice League multicollinearity screws up our findings.

Aside from those highly collinear guys, just about every character we tested was significant, Swamp Thing, Nightwing, Red Robin (but not Robin), Supergirl, Superboy, Batgirl, Batwoman, and Catwoman all had positive coefficients, with Nightwing having the highest at 68%. The Legion of Super Heroes tested negatively, which makes some sort of sense because their adventures do not take place in the modern times, and therefore have no bearing in modern continuity. The Justice League name, applied to all titles with the term "Justice League" was also negative, probably because you could call a team with the Big Six on it anything you want and it would sell, and probably because you can't just put the name on a team with a seemingly random slathering of characters on it and expect it to sell.

George Perez was regularly positive at a P value of less than 0.01, but with a wide range. We narrowed his regressions down to the work he did as a penciller (so on WORLDS' FINEST), and his impact was 61%. We then introduced a "female lead" variable, which would account for if a woman were in the title role, and Perez's role as an artist was reduced to 24%–33%.

In one of the most inexplicable things in our model, the "female lead" variable was regularly significant, and always positive, at a range of 33%–42%. This is counterintuitive because we often think of the superhero genre as a boys' medium, and certainly one of the things that's always been levied against Wonder Woman as to why she doesn't sell is because she's a woman. There could be any number of explanations. It's possible that there's not as much variance among female-led books. It's possible that they think more conservatively about female-led books that they only put out the ones that they'd think would succeed. There's no female equivalent of I, VAMPIRE, for example. Or maybe conventional wisdom is just wrong and female protagonists do increase sales.

Taking the female lead into account, JH Williams III as an artist regularly had coefficients ranging from 20%–35%, and Brian Azzarrello (see therefore the previous note concerning Cliff Chiang) 30%–42.8%. Taking female lead as a variable out of it, JH's coefficients go to 65% and Azzarello's to 72%.

And I know what you're thinking — what about Judd Winick, writer of CATWOMAN, and Scott Lobdell, writer of RED HOOD AND THE OUTLAWS, both of whom wrote those controversial sex issues sexing up Catwoman and Starfire? Well, Judd Winick rarely ever tested significantly, and when he did it was at a P level of greater than 0.05, with coefficients of around 12%, which is still pretty low when you consider the numbers given. Lobdell, however, regularly tested significantly at P values of less than 0.01, and consistently between 44% and 48%. We didn't, however, consider RED HOOD a female-led book. (Should we have? Tell me so in the comments!) Regardless, the data suggests Lobdell is good for sales, so that should be good news for Superman fans.

Like Winick, Jeff Lemire tested insignificantly for some regressions and significantly for others, but only with high P values and low coefficients.

The only other creator who tested significantly was Pete Milligan, who regularly had coefficients from 35%–40%.

The "first issue" variable tested significantly only once, and only in the pure character regression, with a coefficient of 33%. It tested insignificantly all the other times. It's possible that DC puts out so many titles that enough series — and their first issues — don't matter enough, and that's reflected in the data.

Pretty much every guest appearance was insignificant to sales. Maybe it did boost sales, but if it did, not in a way that was statistically significant.

Price has no significant effect. As it turns out, comics are an inelastic good. Comic book fans will buy what they buy, regardless of the price, or so the data suggests. (This will not stop everyone, however, from complaining about the price increase.)

And Superman, DC's flagship superhero? Without taking creators into account, he had a coefficient of 82%, second only to Batman's 98%. And taking creators into account — and again, this is tough, because he's somewhat correlative to Jim Lee — he would vacillate from 35% to 80%. It's hard to tell, and the only thing you can say for sure is that Superman still sells, even if we're not sure just by how much.

One last thing: if you're thinking about what sells the most, all signs point to, all things being equal, characters. If you have Scott Snyder and Ivan Reis on a Batman book, it will sell more than Scott Snyder and Ivan Reis on an I, Vampire book. That these are percentage changes is the likely answer to why DC puts big-name creators on big characters — a 50% increase on a book that would otherwise sell 50,000 is bigger than a 50% increase on a book that would otherwise sell 25,000. It also seems, especially if you look at Snyder and Finch, who are both working on the same character and are presumably the draws of those books aside from Batman, that writers have more of a draw than artists.

The importance of characters either makes Snyder's BATMAN beating JUSTICE LEAGUE more impressive, or it shows that more people really prefer Batman solo instead of the entire Justice League.


Matt and I did this entire thing in our spare time for a month. These studies take a long time to really be comprehensive, and it's not like we did this professionally. We worked with what data we could easily gather and we did it as more of an exercise than anything else. We'd happily do more and not just with DC, but it is quite time-consuming and as it turns out, not that easy to do. I guess we just both miss working with numbers.

This study only takes into account DCnU data. It didn't take into account the past sales, or sales of other comics in the same time period. This is the DCnU in a vacuum. It is helpful only so far as it relates to the DCnU. Putting Scott Snyder on a title may be good in the DCnU, but it doesn't necessarily mean that he'd boost sales on the Punisher, nor does it necessarily mean otherwise either.

So remember, if you're going to use this study, be aware that it's preliminary, and there'd still be a lot of bugs to work out. There's not enough variance in the sample, and ideally we'd be working with data where all the characters are not multicollinear with their creators.

If any of you statisticians spot any errors, blame Matt. If you think this is brilliant, however, I'll gladly take all the credit.


The Professor said...

This is so delightfully geeky. I had a blast reading through this. Thanks!

Duy Tano said...

Thanks, Prof! I had to get all refreshed on how stats work and everything; should've consulted you!

Post a Comment

All comments on The Comics Cube need approval (mostly because of spam) and no anonymous comments are allowed. Please leave your name if you wish to leave a comment. Thanks!

Note: Only a member of this blog may post a comment.