|Why does polling work?
The basic tenet of polling is that if you ask a sufficiently large number of randomly selected people-- something in the range of 500 to 5,000-- you can approximate the opinion of the entire population fairly accurately.
Why does this work? In the end, a poll is a series of loaded coin tosses. Does your respondent support Obama or Romney? It depends on a bunch of characteristics-- age, gender, party, race, education, income and some other stuff.
If you have a perfect random sample, your respondents are just as likely to be black, white, female, young, etc. as the general population, so by extension they're also as likely to support or oppose the President. But clearly, when you just ask one or two people, that's not enough.
What makes polling work is the Central Limit Theorem--- if you take a series of coin tosses, the resulting distribution can be mathematically approximated by the normal distribution (or in layman's terms, the bell curve).
This allows us to calculate the much-misused Margin of Error of a poll.
The formula for that is simple. If you have n respondents, and they have probability p of giving a certain response, then the margin of error around that response will be
p +/- [sqrt(p*(1-p)/n)]*1.96
In other words, if 60% of your 500 respondents support Obama, your margin of error would be 60% +/- [sqrt(.6*(1-.6)/500]*1.96, or 60% +/- about 4.2%.
Pollsters in press releases are usually kind of lazy and assume a p of 50/50 regardless of what the poll result was (the margin of error is greatest there), but if you're looking at, for instance, African-American subgroup crosstabs that actually makes a very large difference.
If you have 100 respondents and they give you a 50/50 response, your MoE would be 9.8%. If they give you a 95/5 response, it would be just 4.3%.
This is fairly intuitive, I think-- what's easier to poll, after all, Ohio or North Korean parliamentary elections?
Some things people usually do wrong/Interesting tidbits
1) It should be clear from the above that the margin of error applies to someone's support.
NOT to the margin between the candidates. If the margin of error of a poll is +/- 4, and the poll result is 48-42, then no, the lead is not outside of the margin of error. The poll is saying that the margin of error for Candidate A spans the range from 44-52 (plus or minus a few hundredth of a percent, because the p here is .48 and not .5), and the margin of error for candidate B spans the range from 38 to 46.
Newspaper write-ups of polls frequently get this wrong. So do a lot of you guys.
2) Not every result in the margin of error is equally likely.
This should go without saying, but it doesn't. How often do you see a poll that has someone leading by three points and then a bunch of people say "Oh, it's within the margin of error. It could go either way!"-- or someone cites the infamous "statistical tie".
No. If someone has a three point lead he's more likely to be ahead than the other candidate-- even by quite a bit. It's just not 95% certain.
You can actually calculate exactly how likely it is that the candidate is ahead.
Remember how the formula for the margin of error was the following:
This has two major parts-- sqrt(p*(1-p)/n) and then the 1.96. The first part calculates the standard deviation of the poll. Standard deviations measure how likely it is that your observed value is within a certain percentage of the true result.
For one standard deviation you can be about 68% certain that the true opinion will be within that range of the poll result. But people want to be more certain, so they use a 95% certainty cutoff (or "confidence interval")-- and that just so happens to correspond to 1.96 standard deviations.
You can use this table to look that up-- just go to the second table and multiply all values in it by 2.
Now let's assume that we have a poll of 500 people, and Obama leads Romney 52-48. What's the chance that he's actually ahead?
The standard deviation of Obama is sqrt(.52*(1-.52)/500), or about 2.2 points. We're interested in figuring out whether he's actually over 50%, and we've measured him at 52%, with a standard deviation of 2.2%. So him being actually under 50% would be a 2/2.2 (or 0.91) standard deviation event.
We can look that up in the first of the two tables in the Wikipedia article and find that a value of 0.91 corresponds to a chance of 81.86% that Obama is actually ahead-- far from 50-50.
Does that mean that Obama actually has a 81.86% chance to win, given just that one poll?
No. The poll has a house effect (ie it is biased towards one candidate), and it also has a design effect, which makes it less accurate (in either direction) than it should be.
That's not nearly as easy to figure out, and this is why we need people like Nate. But just as a rule of thumb, you guys have a fairly good intuition about house effects, and for design effects it generally makes sense to add 30% to the reported margin of error of the poll.