Naive question: What does analyzing big data sets get you that sampling doesn't?

rm999 · on Nov 9, 2012

It's not a naive question, it's an important question. Sampling is almost always the first thing I think of when I have a "big" dataset in front of me, and it requires some careful thought.

It really depends on what you want to do. If you want to compute a statistic over a large amount of data or build a predictive model, some sort of sampling may create virtually identical results with orders of magnitude less time/memory. In modeling, sampling away certain classes may actually be essential for algorithmic stability.

Sometimes sampling is wrong. If you have to create statistics over your entire dataset (let's say, number of transactions aggregated by user), sampling may not be worth the loss of precision, and it won't really save you much time either. In general, you never want to sample away data in a way that the confidence in the statistic you are calculating becomes low.

Almaviva · on Nov 9, 2012

Sampling lowers your confidence resolution, period. When you're testing hypotheses the biggest constraint can be that the effect you're looking for is too small to be within the resolution that your confidence intervals give you. Improving this resolution, even by a little bit, can be worth a lot.

tel · on Nov 9, 2012

Because you stated something absolutely, I feel the need to round off the edge. Sampling can increase your confidence resolution if it allows you to integrate signals from more data sources together using a larger model that is infeasible without sampling.

disgruntledphd2 · on Nov 9, 2012

I think the difference is in the aims. With traditional statistics, you're trying to estimate some quantity of interest in the population, while with "big data", you're typically trying to make predictions for individual users. While this can be done with traditional statistics (in fact, the predict method in R does exactly that) it becomes easier to match participants on what books they might like if you have data for what books everyone in your population likes rather than just a sample.

Now, whether or not the inferential premises of statistics hold up on website data (and population data) that typically is neither random nor representative, that's another story.

phren0logy · on Nov 9, 2012

Headaches? That's glib, and it can certainly be worthwhile, but it's amazing how fast things get ugly when the data cannot be stored on one machine and held in memory. I'm still sad that Incanter didn't really turn out to be the tool I hoped it would be.

disgruntledphd2 · on Nov 9, 2012

Its still a young project. I agree that its currently suboptimal, but the source is one github (along with the abadnoned enfer, which looked really nice but appears to have been abandoned).

I definitely think Clojure has a future in big data - the repl, immutability and the strong links to the Java ecosystem make it an obvious choice for developing interactive (in so far as this is possible) big data applications.

Also, it's fun to code, which should never be discounted as a reason for a language to succeed.

etrain · on Nov 9, 2012

The tails.

yummyfajitas · on Nov 9, 2012

You also often need to scan the whole data set in order to sample properly.

Suppose you are interested in user behavior. You want sample(group_by(user_ident, all_user_events)), not group_by(user_ident, sample(all_user_events)). This involves running a group_by on the full data set.

tel · on Nov 9, 2012

Or counting the number of events for each user as they come in.

viralbajaria · on Nov 9, 2012

In addition to what others said, when you want to analyze multiple datasets the biggest problem is to store all that on a single machine or even a machine with network-attached storage. Eventually you will start hitting the bounds of I/O which is what big data tries to solve by moving processing units closer to the data storage (in theory).

In short, not all problems are big data problems.

johnrgrace · on Nov 9, 2012

Samples can miss small pockets of behavior that are out of the norm but very interesting. As a business people using your product or service in unexpected ways can lead to greatinsights, new markets etc. DRM hating readers of Military science fiction might be a small niche, but it's one that publishers BAEN owns.

vinzclortho · on Nov 9, 2012

Geek cred