Philadelphia Reflections: Hadoop and Big Data in Medicine

Hadoop and Big Data in Medicine

Lord Maynard Keynes

A software program for lashing fifty thousand computers together, called Hadoop, is what gave macroeconomics, the study of huge populations, its big push. The aristocratic Maynard Keynes, who invented macroeconomics, would probably not be amused on looking up Hadoop on any search engine, to find it is possible to download it free of charge to anyone who asks. Fifty thousand computers? Anyone can also rent eight hours of time on them from IBM or Amazon, for about ten dollars. Not many great scientific discoveries have become widely available so quickly or so cheaply.

Although the news media will probably concentrate on locating spies in Central Asia, or predicting the outcome of national elections, or telling which dot in the sky is really an approaching asteroid, Hadoop will certainly make it easier to make advance predictions in health insurance. Creating 300 million individual policies is do-able, projections of the gross domestic product are much easier, more accurate and can extend farther into the future. Ideas of preserving privacy in this avalanche are simply swept away by the discovery that much of what we thought was privacy was just a matter of being lost in a forest of data. So let us momentarily feel safe in predicting that a system of individually owned health insurance is entirely practical, cradle to grave, or at least need not be rejected as impractical because of size. If the Federal Reserve can manage a portfolio of $3 trillion, a national piggy bank for health care costs is not beyond our ability to manage. Set aside for a moment whether it is desirable to do such a thing, it is definitely possible to do it. Since small-scale tests seem to show potential savings in American healthcare costs in the range of 5% of annual American GDP, development costs need not stop us. Although the plans of Obamacare could bankrupt the nation, it is also a possibility that what is truly wrong with them is the thinking is too small. Bad implementation is expensive, failure to abort a failing program is worse. But getting the wrong design for the program is fatal.

The general process for getting things right in politics is to do something, and see if something bad happens. If not, do even more of it. But if your monitor shows that something bad is really happening, drop the project. Big Data, the process of monitoring huge amounts of data simultaneously, using Hadoop and fifty thousand computers in the desert, could be a monitor for experimental changes in the health insurance system. The trick is to include automatic monitoring alarms as enormous volumes of data flow past. The incentive for alertness is this data will be there anyway, and somebody in the role of trial attorney can go back in retrospect and show you missed a trend. Presumably, the outcomes to measure are whether health is improving, and costs are going down. Compared with past trends, and other nations. Doing localized experiments, by states perhaps, would allow you to compare that state with others. It's rather like politicians giving speeches, and then watching what happens to their popularity polls. But it can be like counting the number of grains of sand on the beach -- who cares?

When any innovation is this new, powerful and cheap, it is almost impossible to slow the stampede to try it out. Almost anything which can be imagined will be tried out, and a few surprising things will be discovered quickly. But then it can be predicted that things will settle down to using this big machine on statistical issues which were formerly just beyond its reach, leaving acceptance of Hadoop computing to find its niche. Genomics comes readily to mind in medicine. But already a quite different sort of use has appeared in statistics. Statisticians have built up a whole structure around the estimation of large numbers by careful examination of small samples. The science of such approaches is the science of carefully selecting representative samples of a predetermined size, measuring their contents, and then extrapolating the size and composition of the original. Quite often, more time and expense was devoted to assuring the representativeness of the sample, than was spent extrapolating the answer.

Almost overnight, that whole approach has been swept away. With fifty thousand computers, it is easier just to count the whole thing than to bother with samples. The interesting thing for medicine will be the immediate reconsideration of subsets. When a study is conducted, let's say to see if a drug helps high blood pressure, a lot of data is collected. Regardless of whether the drug helped high blood pressure or not, it is possible to see if it helps the blood pressure of Hispanics, or of Chinese, or young women, or old men, or people with diabetes, or, well, you get the idea. In statistics, it is assumed something is true if there is a 95% chance it is true. But 5% of the time, or one time in twenty, it just happened that way by coincidence. So, if you go on splitting the data into a hundred pieces, it will appear to be true in five of them, when it was really only due to chance, and maybe wasn't true in any of them. That error, which is very common, is eliminated by measuring the whole experimental group instead of taking samples and extrapolating from them. So, the long and the short of it is a whole profession of sample analyzers is now out of a job, while the amount of false information is greatly reduced. Now, we can start to see the power of Hadoop emerging, although it is too soon to say what it will be used for.

Originally published: Tuesday, June 11, 2013; most-recently modified: Tuesday, May 21, 2019