Statistics Every Writer Should Know
Sign in

Statistics Every Writer Should Know

Sr Subject Matter Expert
See interview of Siva  Prasad

A simple guide to understanding basic statistics, for journalists and other writers who might not know math.

Numbers can't "talk," but they can tell you as much as your human sources can. But just like with human sources, you have to ask!

So what should you ask a number? Well, mathematicians have developed an entire field - statistics - dedicated to getting answers out of numbers. Now, you don't have to have a degree in statistics in order to conduct an effective "interview" with your data. But you do need to know a few basics.

Here, described in plain English, are some basic concepts in statistics that every writer should know...

So, you’re a Beginner?

Mean
Let's get started...

Median
How to find out how the "average Joe" is doing

Percent
Ch-ch-ch-changes...

The Next Step: Not Getting Duped

Per capita and Rates
When an increase is really a decrease and other ways people can use numbers to trick you

Standard Deviation and Normal Distribution
A quick look at the King of Stats

Margin of Error and Confidence Interval
How not to get suckered by polls and other research

Data Analysis
How to tell if these numbers are really worth writing about anyway

Frequently Asked Questions

Sample Sizes
"So how come a survey of 1,600 people can tell me what 250 million are thinking?"

Statistical Tests
"How do I pick the correct statistical test for me?"

Moving On

Student's T
Is your sample relevant to the larger population it is supposed to represent? Use the t-test to find out.

.

Mean

This is one of the more common statistics you will see. And it's easy to compute. All you have to do is add up all the values in a set of data and then divide that sum by the number of values in the dataset. Here's an example:

Let's say you are writing about the World Wide Widget Co. and the salaries of its nine employees.

The CEO makes $100,000 per year,
Two managers make $50,000 per year,
Four factory workers make $15,000 each, and
Two trainees make $9,000 per year.

So you add $100,000 + $50,000 + $50,000 + $15,000 + $15,000 + $15,000 + $15,000 + $9,000 + $9,000 (all the values in the set of data), which gives you $278,000. Then divide that total by 9 (the number of values in the set of data).

That gives you the mean, which is $30,889.

Not a bad average salary. But be careful when using this number. After all, only three of the nine workers at WWW Co. make that much money. And the other six workers don't even make half the average salary.

So what statistic should you use when you want to give some idea of what the average worker at WWW Co. is earning? It's time to learn about the median.

Median

Whenever you find yourself writing the words, "the average worker" this, or "the average household" that, you don't want to use the mean to describe those situations. You want a statistic that tells you something about the worker or the household in the middle. That's the median.

Again, this statistic is easy to determine because the median literally is the value in the middle. Just line up the values in your set of data, from largest to smallest. The one in the dead-center is your median.

For the World Wide Widget Co., here are the worker's salaries:

$100,000
$50,000
$50,000
$15,000
$15,000
$15,000
$15,000
$9,000
$9,000

That's 9 employees. So the one halfway down the list, the fifth value, is $15,000. That's the median. (If halfway lies between two numbers, split 'em.)

Comparing the mean to the median for a set of data can give you an idea how widely the values in your dataset are spread apart. In this case, there's a somewhat substantial gap between the CEO at WWW Co. and the rank and file. (Of course, in the real world, a set of just nine numbers won't be enough to tell you very much about anything. But we're using a small dataset here to help keep these concepts clear.)

Here's another illustration of this: Ten people are riding on a bus in Redmond, Washington. The mean income of those riders is $50,000 a year. The median income of those riders is also $50,000 a year.

Joe Blow gets off the bus. Bill Gates gets on.The median income of those riders remains $50,000 a year. But the mean income is now somewhere in the neighborhood of $50 million or so. A source now could say that the average income of those bus riders is 50 million bucks. But those other nine riders didn't become millionaires just because Bill Gates got on their bus. A reporter who writes that the "average rider" on that bus earns $50,000 a year, using the median, provides a far more accurate picture of those bus riders' place in the economy.

(Statisticians have a value, called a standard deviation, that tells them how widely the values in a set are spread apart. A large SD tells you that the data are fairly diverse, while a small SD tells you the data are pretty tightly bunched together. If you'll be doing a lot of work with numbers or scientific research, it will be worth your time to learn a bit about the standard deviation.)

Percent Change

Percent changes are useful to help people understand changes in a value over time. Again, figuring this one requires nothing more than third-grade math.

Simply subtract the old value from the new value, then divide by the old value.

Multiply the result by 100 and slap a % sign on it. That's your percent change.

Let's say Springfield had 50 murders last year, as did Capital City. So there's no difference in crime between these cities, right? Maybe, maybe not. Let's go back and look at the number of murders in those towns in previous years, so we can determine a percent change.

Five years ago, Capital City had 42 murders while Springfield had just 29.

Subtract the old value from the new one for each city and then divide by the old values. That will show you that, over a five year period, Capital City had a 19 percent increase in murders, while Springfield's increase was more than 72 percent.

That's your lead.

Or is it? There's something else to consider when computing percent change. Take a look at per capita to find out.

Per capita, Rates and Comparisons

Percent change in value tells you only part of the story when you are comparing values for several communities or groups. Another important statistic is each group's per capita value. This figure helps you compare values among groups of different size.

Let's look at Springfield and Capital City again. This year, 800,000 people live in Springfield while 600,000 live in Capital City. Five years ago, however, just 450,000 people lived in Springfield while 550,000 lived in Capital City.

Why is this important? The fact that Springfield grew so much more than Capital City over the past five years could help explain why the number of murders in Springfield increased by so much over the same period. After all, if there are more people in a city, one might expect there to be more murders.

To find out if one city really is more dangerous than another, you need to determine a per capita murder rate. That is, the number of murders for each person in town.

To find that rate, simply divide the number of murders by the total population of the city. To keep from using a tiny little decimal, statisticians usually multiply the result by 100,000 and give the result as the number of murders per 100,000 people.

In Springfield's case, 50 murders divided by 800,000 people equals a murder rate of 6.25 per 100,000 people. Capital City's 50 murders divided by 600,000 people equals a murder rate of 8.33 per 100,000 people.

Five years ago, Springfield's 29 murders divided by 450,000 people equaled a murder rate of 6.44 per 100,000 people. And Capital City's 42 murders divided by 550,000 equaled a murder rate of 7.64 per 100,000 people.

In Percent, we found that the number of murders in Springfield increased 72 percent over five years, while the number of murders in Capital City grew by just 19 percent. But when we now compare per capita murders, Springfield's murder rate decreased by almost 3 percent, while Capital City's per capita murder rate increased by more than 9 percent.

Standard Deviation

I'll be honest. Standard deviation is a more difficult concept than the others we've covered. And unless you are writing for a specialized, professional audience, you'll probably never use the words "standard deviation" in a story. But that doesn't mean you should ignore this concept.

The standard deviation is kind of the "mean of the mean," and often can help you find the story behind the data. To understand this concept, it can help to learn about what statisticians call normal distribution of data.

A normal distribution of data means that most of the examples in a set of data are close to the "average," while relatively few examples tend to one extreme or the other.

Let's say you are writing a story about nutrition. You need to look at people's typical daily calorie consumption. Like most data, the numbers for people's typical consumption probably will turn out to be normally distributed. That is, for most people, their consumption will be close to the mean, while fewer people eat a lot more or a lot less than the mean.

When you think about it, that's just common sense. Not that many people are getting by on a single serving of kelp and rice. Or on eight meals of steak and milkshakes. Most people lie somewhere in between.

If you looked at normally distributed data on a graph, it would look something like this:

The x-axis (the horizontal one) is the value in question... calories consumed, dollars earned or crimes committed, for example. And the y-axis (the vertical one) is the number of datapoints for each value on the x-axis... in other words, the number of people who eat x calories, the number of households that earn x dollars, or the number of cities with x crimes committed.

Now, not all sets of data will have graphs that look this perfect. Some will have relatively flat curves, others will be pretty steep. Sometimes the mean will lean a little bit to one side or the other. But all normally distributed data will have something like this same "bell curve" shape.

The standard deviation is a statistic that tells you how tightly all the various examples are clustered around the mean in a set of data. When the examples are pretty tightly bunched together and the bell-shaped curve is steep, the standard deviation is small. When the examples are spread apart and the bell curve is relatively flat, that tells you you have a relatively large standard deviation.

Computing the value of a standard deviation is complicated. But let me show you graphically what a standard deviation represents...

One standard deviation away from the mean in either direction on the horizontal axis (the red area on the above graph) accounts for somewhere around 68 percent of the people in this group. Two standard deviations away from the mean (the red and green areas) account for roughly 95 percent of the people. And three standard deviations (the red, green and blue areas) account for about 99 percent of the people.

If this curve were flatter and more spread out, the standard deviation would have to be larger in order to account for those 68 percent or so of the people. So that's why the standard deviation can tell you how spread out the examples in a set are from the mean.

Why is this useful? Here's an example: If you are comparing test scores for different schools, the standard deviation will tell you how diverse the test scores are for each school.

Let's say Springfield Elementary has a higher mean test score than Shelbyville Elementary. Your first reaction might be to say that the kids at Springfield are smarter.

But a bigger standard deviation for one school tells you that there are relatively more kids at that school scoring toward one extreme or the other. By asking a few follow-up questions you might find that, say, Springfield's mean was skewed up because the school district sends all of the gifted education kids to Springfield. Or that Shelbyville's scores were dragged down because students who recently have been "mainstreamed" from special education classes have all been sent to Shelbyville.

In this way, looking at the standard deviation can help point you in the right direction when asking why information is the way it is.

The standard deviation can also help you evaluate the worth of all those so-called "studies" that seem to be released to the press everyday. A large standard deviation in a study that claims to show a relationship between eating Twinkies and killing politicians, for example, might tip you off that the study's claims aren't all that trustworthy.

Of course, you'll want to seek the advice of a trained statistician whenever you try to evaluate the worth of any scientific research. But if you know at least a little about standard deviation going in, that will make your interview much more productive.

Okay, because so many of you asked nicely...
Here is one formula for computing the standard deviation. A warning, this is for math geeks only! Writers and others seeking only a basic understanding of stats don't need to read any more in this chapter. Remember, a decent calculator and stats program will calculate this for you...

Terms you'll need to know
x = one value in your set of data
avg (x) = the mean (average) of all values x in your set of data
n = the number of values x in your set of data

For each value x, subtract the overall avg (x) from x, then multiply that result by itself (otherwise known as determining the square of that value). Sum up all those squared values. Then divide that result by (n-1). Got it? Then, there's one more step... find the square root of that last number. That's the standard deviation of your set of data.

Now, remember how I told you this was one way of computing this? Sometimes, you divide by (n) instead of (n-1). It's too complex to explain here. So don't try to go figuring out a standard deviation if you just learned about it on this page. Just be satisified that you've now got a grasp on the basic concept.

The more practical way to compute it...
In Microsoft Excel, type the following code into the cell where you want the Standard Deviation result, using the "unbiased," or "n-1" method:

=STDEV(A1:Z99) (substitute the cell name of the first value in your dataset for A1, and the cell name of the last value for Z99.)

Or, use...

=STDEVP(A1:Z99) if you want to use the "biased" or "n" method.

Margin of Error

Margin of Error deserves better than the throw-away line it gets in the bottom of stories about polling data. Writers who don't understand margin of error, and its importance in interpreting scientific research, can easily embarrass themselves and their news organizations.

Check out the following story that moved in the summer of 1996 on a major news wire:

WASHINGTON (Reuter) - President Clinton, hit by bad publicity recently over FBI files and a derogatory book, has slipped against Bob Dole in a new poll released Monday but still maintains a 15 percentage point lead.

The CNN/USA Today/Gallup poll taken June 27-30 of 818 registered voters showed Clinton would beat his Republican challenger if the election were held now, 54 to 39 percent, with seven percent undecided. The poll had a margin of error of plus or minus four percentage points.

A similar poll June 18-19 had Clinton 57 to 38 percent over Dole.

Unfortunately for the readers of this story, it is wrong. There is no reasonable statistical basis for claiming that Clinton's lead over Dole has slipped.

Why? The margin of error. In this case, the CNN et al. poll had a four percent margin of error. That means that if you asked a question from this poll 100 times, 95 of those times the percentage of people giving a particular answer would be within 4 points of the percentage who gave that same answer in this poll.

(WARNING: Math Geek Stuff!)
Why 95 times out of 100? In reality, the margin of error is what statisticians call a confidence interval. The math behind it is much like the math behind the standard deviation. So you can think of the margin of error at the 95 percent confidence interval as being equal to two standard deviations in your polling sample. Occasionally you will see surveys with a 99 percent confidence interval, which would correspond to 3 standard deviations and a much larger margin of error.
(End of Math Geek Stuff!)

So let's look at this particular week's poll as a repeat of the previous week's (which it was). The percentage of people who say they support Clinton is within 4 points of the percentage who said they supported Clinton the previous week (54 percent this week to 57 last week). Same goes for Dole. So statistically, there is no change from the previous week's poll. Dole has made up no measurable ground on Clinton.

And reporting anything different is misleading.

Don't overlook that fact that the margin of error is a 95 percent confidence interval, either. That means that for every 20 times you repeat this poll, statistics say that one time you'll get an answer that is completely off the wall.

You might remember that just after Dole resigned from the U.S. Senate, the CNN et al. poll had Clinton's lead down to six points. Reports attributed this surge by Dole to positive public reaction to his resignation. But the next week, Dole's surge was gone.

Perhaps there never was a surge. It very well could be that that week's poll was the one in 20 where the results lie outside the margin of error. Who knows? Just remember to never place too much faith in one week's poll or survey. No matter what you are writing about, only by looking at many surveys can you get an accurate look at what is going on.

Data Analysis

You wouldn't buy a car or a house without asking some questions about it first. So don't go buying into someone else's data without asking questions, either.

Okay, you're saying... but with data there are no tires to kick, no doors to slam, no basement walls to check for water damage. Just numbers, graphs and other scary statistical things that are causing you to have bad flashbacks to your last income tax return. What the heck can you ask about data?

Plenty. Here are a few standard questions you should ask any human beings who slap a pile of data in front of you and ask you write about it.

1. Where did the data come from? Always ask this one first. You always want to know who did the research that created the data you're going to write about.

You'd be surprised - sometimes it turns out that the person who is feeding you a bunch of numbers can't tell you where they came from. That should be your first hint that you need to be very skeptical about what you are being told.

Even if your data have an identifiable source, you still want to know what it is. You might have some extra questions to ask about a medical study on the effects of secondhand smoking if you knew it came from a bunch of researchers employed by a tobacco company instead of from, say, a team of research physicians from a major medical school, for example. Or if you knew a study about water safety came from a political interest group that had been lobbying Congress for a ban on pesticides.

Just because a report comes from a group with a vested interest in its results doesn't guarantee the report is a sham. But you should always be extra skeptical when looking at research generated by people with a political agenda. At the least, they have plenty of incentive NOT to tell you about data they found that contradict their organization's position.

Which brings us to the next question:

2. Have the data been peer-reviewed? Major studies that appear in journals like the New England Journal of Medicine undergo a process called "peer review" before they are published. That means that professionals - doctors, statisticians, etc. - have looked at the study before it was published and concluded that the study's authors pretty much followed the rules of good scientific research and didn't torture their data like a middle ages infidel to make the numbers conform to their conclusions.

Always ask if research was formally peer reviewed. If it was, you know that the data you'll be looking at are at least minimally reliable.

And if it wasn't peer-reviewed, ask why. It may be that the research just wasn't interesting to enough people to warrant peer review. Or it could mean that the research had as much chance of standing up to professional scrutiny as a $500 mobile home has of standing up in a tornado.

3. How were the data collected? This one is real important to ask, especially if the data were not peer-reviewed. If the data come from a survey, for example, you want to know that the people who responded to the survey were selected at random.

In 1997, the Orlando Sentinel released the results of a poll in which more than 90 percent of those people who responded said that Orlando's National Basketball Association team, the Orlando Magic, shouldn't re-sign its center, Shaquille O'Neal, for the amount of money he was asking. The results of that poll were widely reported as evidence that Shaq wasn't wanted in Orlando, and in fact, O'Neal signed with the Los Angeles Lakers a few days later.

Unfortunately for Magic fans, that poll was about as trustworthy as one of those cheesy old "Magic 8 Balls." The survey was a call-in poll where anyone who wanted could call a telephone number at the paper and register his or her vote.

This is what statisticians call a "self-selected sample." For all we know, two or three people who got laid off that morning and were ticked off at the idea of someone earning $100 million to play basketball could have flooded the Sentinel's phone lines, making it appear as though the people of Orlando despised Shaq.

Another problem with data is "cherry-picking." This is the social-science equivalent of gerrymandering, where you draw up a legislative district so that all the people who are going to vote for your candidate are included in your district and everyone else is scattered among a bunch of other districts.

Be on the lookout for cherry-picking, for example, in epidemiological (a fancy word for the study of disease that sometimes means: "We didn't go out and collect any data ourselves. We just used someone else's data and played 'connect the dots' with them in an attempt to find something interesting.") studies looking at illnesses in areas surrounding toxic-waste dumps, power lines, high school cafeterias, etc. It is all too easy for a lazy researcher to draw the boundaries of the area he or she is looking at to include several extra cases of the illness in question and exclude many healthy individuals in the same area.

When in doubt, plot the subjects of a study on map and look for yourself to see if the boundaries make sense.

4. Be skeptical when dealing with comparisons. Researchers like to do something called a "regression," a process that compares one thing to another to see if they are statistically related. They will call such a relationship a "correlation." Always remember that a correlation DOES NOT mean causation.

A study might find that an increase in the local birth rate was correlated with the annual migration of storks over the town. This does not mean that the storks brought the babies. Or that the babies brought the storks.

Statisticians call this sort of thing a "spurious correlation," which is a fancy term for "total coincidence."

People who want something from others often use regression studies to try to support their cause. They'll say something along the lines of "a study shows that a new police policy that we want led to a 20 percent drop in crime over a 10-year period in (some city)."

That might be true, but the drop in crime could be due to something other than that new policy. What if, say, the average age of those cities' residents increased significantly over that 10 year period? Since crime is believed to be age-dependent (meaning the more young men you have in an area, the more crime you have), the aging of the population could potentially be the cause of the drop in crime.

The policy change and the drop in crime might have been correlated. But that does not mean that one caused the other.

5. Finally, be aware of numbers taken out of context. Again, data that are "cherry picked" to look interesting might mean something else entirely once it is placed in a different context.

Consider the following example from Eric Meyer, a professional reporter now working at the University of Illinois:

My personal favorite was a habit we use to have years ago, when I was working in Milwaukee. Whenever it snowed heavily, we'd call the sheriff's office, which was responsible for patrolling the freeways, and ask how many fender-benders had been reported that day. Inevitably, we'd have a lede that said something like, "A fierce winter storm dumped 8 inches of snow on Milwaukee, snarled rush-hour traffic and caused 28 fender-benders on county freeways" -- until one day I dared to ask the sheriff's department how many fender-benders were reported on clear, sunny days. The answer -- 48 -- made me wonder whether in the future we'd run stories saying, "A fierce winter snowstorm prevented 20 fender-benders on county freeways today." There may or may not have been more accidents per mile traveled in the snow, but clearly there were fewer accidents when it snowed than when it did not.

It is easy for people to go into brain-lock when they see a stack of papers loaded with numbers, spreadsheets and graphs. (And some sleazy sources are counting on it.) But your readers are depending upon you to make sense of that data for them.

Use what you've learned on this page to look at data with a more critical attitude. (That's critical, not cynical. There is a great deal of excellent data out there.) The worst thing you can do as a writer is to pass along someone else's word about data without any idea whether that person's worth believing or not.

Survey Sample Sizes

The best way to figure this one out is to think about it backwards. Let's say you picked a specific number of people in the United States at random. What then is the chance that the people you picked do not accurately represent the U.S. population as a whole? For example, what is the chance that the percentage of those people you picked who said their favorite color was blue does not match the percentage of people in the entire U.S. who like blue best?

(Of course, our little mental exercise here assumes you didn't do anything sneaky like phrase your question in a way to make people more or less likely to pick blue as their favorite color. Like, say, telling people "You know, the color blue has been linked to cancer. Now that I've told you that, what is your favorite color?" That's called a leading question, and it's a big no-no in surveying.)

Common sense will tell you (if you listen...) that the chance that your sample is off the mark will decrease as you add more people to your sample. In other words, the more people you ask, the more likely you are to get a representative sample. This is easy so far, right?

Okay, enough with the common sense. It's time for some math. (insert smirk here) The formula that describes the relationship I just mentioned is basically this:

The margin of error in a sample = 1 divided by the square root of the number of people in the sample

How did someone come up with that formula, you ask? Like most formulas in statistics, this one can trace it roots back to pathetic gamblers who were so desperate to hit the jackpot that they'd even stoop to mathematics for an "edge." If you really want to know the gory details, the formula is derived from the standard deviation of the proportion of times that a researcher gets a sample "right," given a whole bunch of samples.

Which is mathematical jargon for..."Trust me. It works, okay?"

So a sample of 1,600 people gives you a margin of error of 2.5 percent, which is pretty darn good for a poll. (See Margin of Error for more details on that term, and on polls in general.) Now, remember that the size of the entire population doesn't matter here. You could have a nation of 250,000 people or 250 million and that won't affect how big your sample needs to be to come within your desired margin of error. The Math Gods just don't care.

Of course, sometimes you'll see polls with anywhere from 600 to 1,800 people, all promising the same margin of error. That's because often pollsters want to break down their poll results by the gender, age, race or income of the people in the sample. To do that, the pollster needs to have enough women, for example, in the overall sample to ensure a reasonable margin or error among just the women. And the same goes for young adults, retirees, rich people, poor people, etc. That means that in order to have a poll with a margin of error of five percent among many different subgroups, a survey will need to include many more than the minimum 400 people in the overall sample.

Picking the Right Statistical Test

Congratulations! Most journalists have no clue that there are different tests for different situations, and different types of data.

Here's the best advice I can give you: Talk to a pro. You know how great copy editors can catch errors in syntax, usage and vocabulary that even experienced writers rarely notice? Professional statisticians do the same thing with numbers.

Smart reporters run their words by a copy editor before they hit print. Why not run your data past a statistician before publishing them?

Unfortunately, I don't know of many papers that have people with statistics degrees on their editorial staff. While our managers may not feel that correct numbers and proper analysis are important, our readers do. And the screwups that our collective lack of attention to things like statistics has caused might be part of the reason why readership's dropping at so many U.S. newspapers.

Call the press relations department of your local college or university and ask for a contact in the statistics department. Then talk with that source about what you have and what you want to do. As with any source, it's best to establish the relationship off deadline, when you have time to ask questions and wait for thoughtful answers.

"Okay, that's nice, Robert," you say. "Um, I'm under deadline now for this story/article/paper/homework assignment, and really need to know what to do...." Well, then, here are some tips:

The best resource I've found for figuring out the right test to run is Selecting Statistics, from Bill Trochim at Cornell University. To use this site, you'll need to know a little bit about your data. The site will ask you a series of questions about your data, and pick the right test for you, based on your answers.

If you want to understand why a specific test is the right choice, try Intuitive Biostatistics: Choosing a statistical test, an online chapter to a stats textbook.

When you're ready to conduct your test, you'll find links to several nifty web pages that perform stats calculations at http://statpages.org/. And additional information on testing can be found at David Lane's HyperStat Online.

T tests

If you haven't read them yet, please take a moment to read my pages on standard deviation and margin of error. They lay out a few concepts you need to understand before thinking about t-tests.

You're back? Good. Now let's start.

Often, you haven't the time or money to measure every single item in a collection of stuff. Sometimes, it's just not practical, either. Let's say you want to see how much force it takes to break new laptop computer. If you break them all, you won't have any left to sell. Not a good idea. Or a particularly smart business plan.

That's why you measure a smaller sample. But the standard deviation of a small sample of data doesn't necessarily tell you anything useful about how wildly the larger group's values vary around their average. And that distribution's important. Because sometimes the average of a small sample comes in where you want it to, but the sample's values are so widely spread around that you can't be sure the larger group's average will come in about the same place as the sample's.

That's where you use the t-test. It's a random variable that uses the standard deviation of the sample to help determine interesting stuff about the larger group it represents.

The t-score factors in a bunch of related values. I'll list them here for reference, but please skip the four lines below if you fear that too much detail will cause you to freak out...

  • the average of the values in your sample
  • the supposed average of the larger population your sample is drawn from
  • the standard deviation of your sample's values
  • the number of values in your sample.

start_blog_img