Big data is perhaps the biggest hype in the recent years. People from every sector — be they in industry, sports, health care, or national policymaking — are now obsessed with big data and aspire to use them in every bit of lifestyle.
Big data comprises lots of variables, spooled with loads of data — collecting data from any possible source is a fashion nowadays, quite often without having any idea about what to do or what can be done with them. And quite often we do not know how to analyse that data having so many variables with possible complicated and unknown relationships among many of them. The number of possible pairs showing significant correlation increases in the order of the 'square of the number of variables'. Even ‘independent’ pairs of variables might exhibit high correlation; eg divorce rate in Maine, US, during 2000-2009 nicely correlates with the per capita consumption of margarine in these years. The number of such occurrences of ‘spurious’ or ‘nonsense’ correlations also increases in the order of the 'square of the number of variables'. More than five years back, Nassim Nicholas Taleb, the author of the bestseller book, The Black Swan: The Impact of the Highly Improbable, illustrated through a simulation exercise that with 500 'independent' variables, the number of 'significant' spurious correlations is nearly 6,000, whereas this number grows to 140,000 for 2,500 'independent' variables! Certainly, correlation does not imply causation, but in real life, it is almost impossible to identify these 'spurious' ones among millions of correlations involving thousands of variables.
With the ever-expanding horizon of the Internet of Things (IoT), big data is continuously becoming bigger. The growth of data is exponential — the size of the digital universe will be doubled every two years beyond 2020. And we do not know how to leverage that volume of data, for we have neither the statistical expertise of handling thousands of variables and eliminating 'spurious' correlations nor the suitable computational algorithms and equipment to handle billions of data points. Even if algorithms are available, standard computers are inadequate to handle this gigantic volume of data.
However, the ocean of big data contains limitless possibilities, and the aspiration to extract knowledge from the heartbeats of big data is also huge. The problem is that the present technology and expertise is still primitive. Let’s be honest to admit that. Still, our adventure might become successful in some particular cases with special prior knowledge and special expertise in that topic, and of course, by using ‘instinct’ effectively, but certainly not in general. That's why I'm very sceptical about running routine software packages for analysing big data; we need to develop the required tools very carefully instead, in a case-by-case way. And that's a time-consuming research exercise which can only be performed by top statisticians and computer scientists, together.
Some success stories of big data are of course there. In their 2003 book Scoring Points: How Tesco Continues to Win Customer Loyalty, Clive Humby, Terry Hunt and Tim Phillips discussed how the UK-based grocer Tesco fueled rapid growth by analysing data of customer purchase behaviour. Today we have an unprecedented ability to collect and store data. But, we should always be very careful in monitoring infrastructure to understand individuals' life pathways from loads of data.
In May 2017, Cisco reported only 26 per cent of survey respondents were successful with IoT initiatives, indicating a 74 per cent failure rate. In November 2017, Gartner analyst Nick Heudecker inferred that about 85 per cent of the big data projects fails. My personal belief is that the actual failure percentage is even more, as 'success' is not well-defined in most of the situations dealing with big data, making it difficult to gauge the quantum of failures, or even to understand a 'failure'. When an organisation is happy with the apparent 'success' of the strategy framed by big data analytics, they fail to understand what more could have been done, unless the endeavour collapses like the Google Flu Trends experiment. Also, there is serious doubt about data quality in most of the cases -- according to a Harvard Business Review article of September 2017, only 3 per cent of companies’ data meets basic quality standards
In the 1958 Hollywood movie, The Blob, a meteorite landed in a small Pennsylvania town carrying an alien amoeba, which expanded and swallowed up people and structures, threatening to envelop the whole town. Today's 'big data' sometimes resembles that amoeba, which devours everything. In the process, big data is getting bigger, so are our aspirations. However, our capacity to handle data did not grow proportionately. In the six-decade-old movie, finally, the air force had to swoop in and airlift the amoeba to the Arctic. Well, is that the appropriate way to stop 'Blob' until one gets equipped to handle it? The writer is professor of statistics at the Indian Statistical Institute, Kolkata
To read the full story, Subscribe Now at just Rs 249 a month
Already a subscriber? Log in
Subscribe To BS Premium
₹249
Renews automatically
₹1699₹1999
Opt for auto renewal and save Rs. 300 Renews automatically
₹1999
What you get on BS Premium?
- Unlock 30+ premium stories daily hand-picked by our editors, across devices on browser and app.
- Pick your 5 favourite companies, get a daily email with all news updates on them.
- Full access to our intuitive epaper - clip, save, share articles from any device; newspaper archives from 2006.
- Preferential invites to Business Standard events.
- Curated newsletters on markets, personal finance, policy & politics, start-ups, technology, and more.
Need More Information - write to us at assist@bsmail.in