This paper has an interesting, almost scary fact in it. Amoung other things, they looked at the 2.5% of commits made to revision control of the projects they studied, that did not change any code, but only added comments. In those commits an average of 47 lines of comments were added. 47 lines‽ That's two full pages of comments, either added in multiple places or as one big block.

The only sensible explanation I can think of for a 47 line comment block is if you're:

  • Talking about some design-level type thing in detail.
  • Explaining a complex and horrible hack or bug.
  • Writing a literate or self-documenting program, such as a perl program with a big POD block. (But perl was the least commented language they studied, with only 1/3 the comments of Java.)
  • Adding a license block. But the typical license block is less than 47 lines.
  • Trying to explain why this picture appears in the middle of your blog post and makes people happy.

I suspect what's really going on mostly is none of the above, but instead adding lots of scattered little comments, or horrible comment boilerplate, or horribly exessively verbose comments.

Of course, these days I write commit messages first, user documentation second, and comment dead last. And if I want to understand why code is the way it is or even what it does, the first thing I reach for is git annotate. Looking at how revision control system use influences comment density would be an interesting followup.


PS: They also found a single commit that contained 39 lines of code and 364,438 lines of comments. I'm curious where that lurks, it must be some interesting code.

PPS: According to the paper, "the Debian distribution of Linux is mostly generated code, repeating the same patterns over and over." Grains of salt fly everywhere.

PPPS: Why did this paper leave out the median and mode? Meh.

discussion