In the previous post, I gave some of the reasons why I hate Big Data. In this post, I’d like to talk about Big Data’s modest and more human cousin: medium data.
I’m sure someone out there has already coined the term and it’s acquired a particular definition in the argot of IT consultants. But at the time of writing it’s not yet in general use, so I’ll take the opportunity to appropriate the term “medium data” and tell you what it means to me.
First, medium data should be big enough to contain insights but small enough to be human. Medium data still involves mining data using machine learning algorithms – the datasets just can’t be small or there won’t be enough of it to train your algorithms. However, the data and associated analytics mustn’t be too big.
I don’t mean too big in gigabytes. I mean the outputs should be sensible (or at least intelligible) and generally in agreement with the conclusions of relevant experts given the same data. The algorithms should result in increased human understanding, not create impenetrable mystery.
Second, medium data must not be used for advertising. It should be aimed at solving real-world problems and improving people’s quality of life.
Third, the data should be used for the benefit of the users of the systems that generated them. Said another way, medium data should be generated and used in the same (or a closely related) domain. If the data relates to bicycle traffic patterns, then it should be used for something like making cycle traffic safer, not for flogging gluten-free tortillas. Closer to my heart, if it’s data from a district heating network, it should be used to improve heat networks.
Limited by this domain specificity, medium data can’t build a full profile of you and your preferences. In the cycling traffic example, the algorithms shouldn’t try to work out how many kids you’ve got and whether you voted UKIP in the last election. Preferably, unless it’s relevant, they shouldn’t even know who you are.
Because of Big Data, it’s impossible to control your personal information on the internet. It’s on thousands of different servers; it’s been chopped, filtered and resold dozens of times. In contrast, medium data should be contained and deletable, actually deletable, if required by the data owner.
These are some of the key characteristics of medium data (or at least what I’m calling medium data). And in distinguishing medium from Big, I’m also laying out the principles that I hope built environment and energy professionals will stick to as our work increasingly involves collection and analysis of large datasets.
That usage policy all sounds a lot like German data Casey. 😉
Check out Tado’s privacy policy:
https://www.tado.com/gb/privacy-policies
And compare with Nest:
https://nest.com/legal/privacy-statement-for-nest-products-and-services/
(not what the mis-leading preamble says, but the actual usage rights they grant themselves)