We have a challenge for you: think of a set of data. A really big one, for preference. It doesn’t have to be random – it could be “the populations of all US cities,” for example, or “every social security number.” But it does need to span over many orders of magnitude: something like “human height” or “birthday month” won’t do, because all possible answers are going to be quite close to each other.
Got one? Great. Now: what do you think the most frequent leading digit is in that set?
Intuitively, the question doesn’t seem to make that much sense, does it? It’s a huge and fairly unpredictable set of numbers, so it makes sense that the leading digits – that is, the first digit of each entry, so for instance the leading digit of six hundred thirty-three is six – would be spread evenly. One ninth of the data would start with the number one; one ninth would start with two; one ninth with three; and so on.
But what if we told you that wasn’t the case? In fact, the most frequent leading digit is almost certainly one – by quite a lot, too. In practice, you’ll generally find that about 30 percent of your data points start with the number one. What’s going on?
What is Benford’s law?
This lopsided frequency is the mathematical phenomenon called Benford’s Law. Despite the name, it was discovered by the astronomer Simon Newcomb, and completely by accident: he happened to be looking up logarithmic tables back in 1881 when he noticed that the pages beginning with one were much more worn than any of the others. He dashed off a note to the American Journal of Mathematics, and a phenomenon was quietly born.
Nobody paid the discovery much notice until 1937, when a physicist named Frank Benford decided to test it out for himself. There’s a reason we call it Benford’s Law and not Newcomb’s Law – see, Benford put the work in. He tested the phenomenon on over 20,000 data points from wildly different sources – death rates, molecular weights, population numbers, addresses, rivers, numbers from the Reader’s Digest, you name it – and the first-digit law held up across all of them.
Sounds unbelievable, right? So let’s see it in action – all we need is a large, naturally-occurring dataset. How about… the area, in square kilometers, of every country in the world.
Counting up the frequencies of each of the leading digits – and getting rid of Vatican City on account of it being too small for our purposes – gives us this:
The bars are the actual numbers of… uh, numbers. The line is what we would expect from Benford’s Law. Spooky!
What causes Benford’s law?
Looking at that example, you might think, okay, maybe it’s a human phenomenon – maybe we just like lower numbers, so we stop expanding our kingdoms or whatever when we get to one million square kilometers. Well, look at this:
See that? It’s the same pattern, right? Except this one is measuring the leading digits of 2n – hardly something physically set out by human hands.
Now, no doubt some of the more mathematically savvy of you out there are already heading towards the comment section to say something about how this effect is most likely dependent on what base you choose. We happen to work in base ten, so when we say most leading digits are ones in a given data set, what we’re really saying is that most entries are one, or something-teen, or one hundred and something, and so on.
If we switch to, say, base five, or hexadecimal, those same values will have a different representation, not necessarily starting with a one, so surely the frequency of leading digits will be different too.
Here’s the cool thing: it isn’t dependent on base. Let’s take our country sizes dataset and convert it all into base… oh, let’s choose base eight:
And here’s the same for the dataset in hexadecimal, or base sixteen:
That doesn’t really answer the question …
That’s fair. But, well, here’s the thing: nobody really knows the mathematical explanation for Benford’s law. “Benford’s Law continues to defy attempts at an easy derivation,” wrote probabilists Arno Berger and Theodore Hill in their 2011 paper Benford’s Law Strikes Back.
“Even though it would be highly desirable to have both a rigorous formal proof and a reasonably sound heuristic explanation, it seems unlikely that any quick derivation has much hope of explaining BL mathematically.”
That isn’t to say people haven’t tried, though. For a while, the leading hypothesis was that it had something to do with scale invariance: if the leading digits of some dataset obey some universal law, the argument ran, then it must not depend on any particular units, since “God is not known to favor either the metric system or the English system,” mathematician Ralph Raimi wrote in 1976.
Using a bit of mathematical logic, you can indeed get from there to Benford’s Law – but there’s a problem. Remember how we said, “if the leading digits obey some law”? The proof only works if we assumed that was true – and it didn’t take long for people to notice that no such law existed.
Perhaps the answer is, as Hill suggested in 1998, that data sets are rarely as simple as they look. “For example,” he wrote, “suppose you are collecting data from a newspaper, and the first article concerns lottery numbers (which are generally uniformly distributed), the second article concerns a particular population with a standard bell-curve distribution and the third is an update of the latest calculations of atomic weights.
“None of these calculations has significant-digit frequencies close to Benford’s law, but their average does,” Hill explained, “and sampling randomly from all three will yield digital frequencies close to Benford’s law.”
Of course, neither of those can explain why purely mathematical sets, like our previous example of the leading digits of 2n, follow Benford’s law exactly. If you want to know what happens when mathematicians completely give up, look no further: Benford’s law is “a built-in characteristic of our number system,” wrote Weaver, “merely the result of our way of writing numbers,” per Goudsmit and Furry. Sorry, kids – Benford’s law just is. Stop asking questions.
Well, then what’s the point of Benford’s law?
We may not know why Benford’s law exists, but that doesn’t mean it’s useless. Think about it: if we know that large datasets often have this property, then any data which doesn’t follow Benford’s law – well, that’s a bit suspicious.
“The IRS has been using it for decades to ferret out fraudsters,” Hill told Reuters as false conspiracies flew in the aftermath of the 2020 Presidential election. The law helps the agency in “identifying suspicious entries,” he explained, “at which time they put the auditors to work on the hard evidence.”
In the era of big data and social media, Benford’s law is more important than ever. “It implies that if the distribution of first digits deviate from the expected distribution, it is indicative of fraud,” explained Madahali and Hall in 2020.
“We investigate[d] whether social media bots and Information Operations activities are conformant to the Benford's law. Our results showed that bots’ behavior adhere to Benford's law […] however, activities related to Information Operations did not.”
We may not understand Benford’s law, but it seems Benford’s law understands us – just like random number sets, it seems the human brain just isn’t very good at coming up with convincing fake data. So whatever the reason behind Benford’s law, two things are for sure: it’s not going away, and it doesn’t look like we’re going to understand it any time soon.
Maybe that’s okay. “A broad and often ill-understood phenomenon need not always be reduced to a few theorems,” wrote Berger and Hill, and “there is currently no unified approach that simultaneously explains its appearance in dynamical systems, number theory, statistics, and real-world data.”
“In that sense, most experts seem to agree,” they conclude, “that the ubiquity of Benford’s law, especially in real-life data, remains mysterious.”