Why Big Data must be used with caution
May 12 2014
Suddenly, social scientists are talking about strategic decision-making based on detecting hidden patterns by mining information from large databases collected through regular and irregular means or by use of algorithms that detect and monitor peoples’ movements. For example, if one was searching the web for a two-door refrigerator, then for next several days, she will be subjected to advertisements relating to this product whenever the mail-box or web is opened. The algorithm does not know that the lady may no longer need the product.
Big Data seeks to render decision-making as a science with the art part to be thrown out. It appears that even the job of CEOs can now be taken over by expert automated systems! Leadership is about dealing with choices, alternatives, and perspectives. If decision-making was completely based on data, then experience and judgment would become redundant. Can Big Data driven decision-making encourage out-of-box thinking? Unlikely! It would encourage more linear-type thinking instead of making the manager attempt to capture the unintended consequences.
There are several pitfalls in data-driven approaches. Prime most is that data may or may not be ‘a fact’ or even factual. Data itself can be spurious, contaminated, and even inherently false (which the researcher may never become aware of). Secondly, all data from different sources when collated together becomes probabilistic — that is, it can predict results within a range of accuracy depending upon context and subject (we say ‘confidence levels’). For example, a patient undergoing a new but painful treatment may report either ‘improvement’ or ‘no improvement’ depending upon his future inclination to take further dosages of the new medicine. A urine sample for pathological testing may be contaminated when collected in a non-sterile bottle or mixed with someone else’s name. This happens in all walks of life and all disciplines, leading to phenomena known as ‘false positives’ (FP) and ‘false negatives’ (FN).
Even if FPs and FNs are only 5 per cent each for a given result, then the total range of wrong conclusions would be as high as 10 per cent for the entire population. This probability of wrong results can lead to disastrous consequences in critical endeavours such as aircraft traffic control, disease management, supply chain and logistics management.
Which is more dangerous or critical between FPs and FNs? It is difficult to say. I posed this question to a number of oncology surgeons and they were unanimous that FNs, in their case, are far more dangerous since the patient may need timely treatment and a FN result would delay it. But the same question to medicine-physicians (non-surgeons) brought forth an opposite answer — FPs are more dangerous since antibiotics would be given when the patient does not need the treatment at all!
In law and statistics, FPs and FNs are also known differently as type 1 and type 2 errors. It is interesting to know the consequences of tightening the range of one type on the other. One may remember a sessions judge in Delhi letting off the rapist-murderer in the famous Mattoo case even though the judge said that ‘he (the judge) knew the accused had committed the crime, but he had no conclusive proof to pass conviction’. (The upper courts however overturned the judgment and sent the murderer to jail).
If jurisprudence is based on the principle that unless proven beyond doubt all accused are innocent, and no innocent should be convicted, then it is possible that at least 50 criminals would be let off (essentially we are trying to avoid type 1 error, but it results in massive type 2 errors). The reverse is also true — if type 2 errors are to be avoided by not insisting on 100 per cent conclusive proof, then many type 1 errors would occur. How does society make a balance in this confusion?
Experienced doctors have already shown that they do not rely on a single piece of statistic, but look at various cues and ‘other’ information about the patient’s health to take the call on treatment. A detected pattern may have been interpreted the way we would like to make sense of it (such as milk drinking Ganeshas and Shivlingas across the country a few years ago).
Therefore, our (human beings’) obsession with patterns recognition can become self-defeating since we may be seeing a pattern even in large data which may otherwise be totally random (it may not be present or detected in another large data population, but the data is difficult to generate).
(The writer is a professor of strategy and corporate governance, IIM-Lucknow)