why I hate Big Data

June 15, 2015 by Casey Cole

There’s a lot of noise about Big Data, with much of it in the energy space. The rise of the Internet of Things, self-learning thermostats, the success of O-Power – all are linked to the generation and automated processing of massive datasets.

The hype machine promises that Big Data will solve all sorts of problems and make the world a better place.

But hype or no hype, personally I hate Big Data.

I can hear the sharp intake of breath at Guru’s PR company, who’ve done a great job getting press coverage for “Big Data” work that we’ve done for the Department for Energy and Climate Change. Bear with me guys, I’ll explain in this post.

First, a clarification of terms: there’s a tendency to refer to any large dataset as Big Data, even when it isn’t. If you’re not sure what Big Data is, here’s a nice background piece by Tim Harford, the FT’s undercover economist.

So what’s wrong with Big Data? Here are some examples:

1. Overfitting

It’s not enough to simply have a massive data set. To be of value, Big Data has to be mined. And this is where machine learning algorithms come in: automatically searching for patterns in the data, recognising our preferences and online behaviour, grouping us into clusters with other people like us.

I’m fascinated by machine learning. But unless you’re very careful, these algorithms can be too clever and too complex. They might do a brilliant job of spotting patterns in the datasets they’re trained on, but subsequently get it wrong when presented with new data. This is called overfitting, or high variance, and it’s the reason Google Flu Trends failed (pdf).

As a result, data mining can unearth unexpected results. Like strange product recommendations: Bought uranium ore? Why not try an anal douche? And how can an otherwise brilliant neural network be 100% certain there’s a cheetah hiding in TV static?

Clearly the algorithms are not infallible. Maybe not a big deal when the stakes are low – but a much bigger deal when your liberty or safety is at stake.

2. Buy! Buy! Buy!

The biggest civilian application of Big Data is making advertising more effective. Through understanding our preferences, advertisers hope to target the people that are more likely to buy their products.

By allowing sellers to spend their advertising budget on the equivalent of snipers rather than carpet bombs, companies like Google make the process much more efficient, and get rich in the process.

The links you click, the time you spend on each page, the stuff you type in a form (even before you press submit) – all is routinely collected and fed into the Big Data machine so that it guesses what you want, even before you know it yourself.

Whether you know it or not, the apps on your smart phone are also working hard on behalf of advertisers. While you’re crushing candy, the apps are hoovering up your data and selling it to mobile advertising companies like Flurry, so they can better profile you and flog you stuff. Your Sky box is doing the same.

For anyone who values privacy or has sympathy with Bill Hicks’s view on the value of advertising, this dismal application of human ingenuity is bound to make you angry.

3. You are more than a vector of thetas

I want access to the Web, not My Web. I hate the idea of an algorithm deciding what content is served up to me. I don’t want the news headlines I see to be filtered by my previous reading habits or who my friends are. My online media consumption is enough of an echo chamber already, and I’m reasonably benign. Imagine what happens when some of the wackjobs out there start thinking that the whole world shares their views. Keeping an open mind is hard enough without the internet subtly, insidiously trying to close it for me.

4. Linkability and durability

A key aim of Big Data is to make connections between individual pieces of data. The most meaningful connection is where disparate data can be linked to an individual. The common thread may be an email address, location data, a photo posted to Facebook or just patterns in online behaviour. The more complete the profile, the easier it is to predict your preferences, and the higher the value of that profile to the right buyer.

Like Chris Rock said, life is long. As you move through it, you continuously slough off data. Your tweet, whether impulsive or carefully considered, quickly drops off the bottom of your Twitter timeline and, no longer relevant to any human interaction, settles into the ooze at the bottom of the internet. There it’s squeezed under the weight of accumulating data, preserved forever in a fossilised seam, waiting to be mined by the algorithms and someday linked with an electronic model of you.

It makes me queasy.

That’s just a sample of the reasons why I hate Big Data.

But thankfully Big Data has a younger cousin that’s more human, more pleasant and much more promising: medium data. I know it sounds like just another buzzword, but hear me out and I’ll try and make some important distinctions in the next post.

Posted in machine learning | Tagged Big Data, DECC | 2 Comments

2 Responses

on June 15, 2015 at 10:42 am | Reply Marko

“…all are linked to the generation and automated processing of massive datasets…”

Not really. Their *marketing* is linked to machine learning claims. The actual product is based on far simpler (constrained) searches through large datasets. Humans fitting the curves that physics/experience expect to see, not machines gleaning insights from the noise of consumer behaviour.

Search and advertising is a different matter. Nest doesn’t use machine learning to keep your house warm/cool though, and nor do OPower in my view.

Heat networks need “data that isn’t complete rubbish” and an experienced human on the other end who knows which questions to ask.
- on June 16, 2015 at 9:41 am | Reply Casey Cole
  
  We should distinguish between machine learning and Big Data.
  
  OPower and Nest are using machine learning, even beyond the marketing hype. OPower does things like means-clustering to identify customer types and an algorithm to try and work out what makes up an individual house’s load curve. Nest uses ML to guess when you want the heating on. The former might be considered a Big Data application. The latter, not so much. I think where Nest comes into the Big Data picture is by throwing your consumption profile into a massive data pot, with the aim of identifying household types and preferences.

Comments RSS

	Why Ofgem must stick… on how many old homes will get he…
	Marko Cosic on Testing, testing
	Martin Winlow on the myth of stone walls as…
	Martin Winlow on the myth of stone walls as…
	Martin Winlow on the myth of stone walls as…
	AlanDHill@hotmail.co… on the myth of stone walls as…
	Comment: What would… on why we’ll share Open…

carbon limited

low carbon energy and engineering