With big data comes big responsibility.

One of the battles that Occupy Math fights every week is to convince people of the relevance of math. This week Occupy Math looks at the Pandora’s box of big data analytics. This is not only relevant math, currently being used in Hillary Clinton’s presidential campaign, but also it is a recurring threat to civil liberty (no, not because Hillary is using it!). We’re going to look at what big data is, how it can be used, and how it can be abused. Pandora’s box contained many evils but also hope – and big data has substantial hope on offer. This post updates an earlier Occupy Math about the values and dangers of statistics.

Statistics, and its Godzilla form – big data analytics – are a little like the Force. There is a light side, a dark side, and a lot of hype.

An early and easy-to-follow example of the type of thing you can do with big data analytics is the time that Osco Drugs did an analysis of which items sold together and which were not selling well. They cut low-sale items and rearranged their store, putting items purchased together often near one another, with two interesting outcomes:

  • An increase in sales, and
  • An entirely false perception by consumers that they had increased their offerings.

This study led to the urban legend that you can sell more diapers by putting beer next to diapers, with harassed new fathers as the causative mechanism. The important point is that doing computations on what was in their register tapes let Osco make more money with less goods. The consumer perception of the opposite of the factual situation is particularly interesting.

Big data: not just big but dynamic.

Occupy Math attended a lecture on big data and the presenter made an interesting point. In order to stay consistent, medical tests search for good indicators of a health condition and then, having found them, set them in stone. If better (more accurate, cheaper) indicators are found during the inevitable march of scientific progress, they are often ignored or adopted decades later. The speaker called this the small data approach. Medical test protocols are updated at a glacial rate. Occupy Math sees that consistency might be good – but so is incorporating current information. This brings us to the big data approach.

Big data is usually big in the sense that hundreds, thousands, even billions of data records are involved. This is the least important issue. Big data analytics constantly revisit the data, looking for new patterns and better indicators for patterns already identified as valuable. There are thousands of possible analytical tools – big data uses them all, mixes and matches their results, and uses everything from statistics to evolving artificial intelligence to try to understand the data. The analysis often informs which additional data to gather. This is the big difference:

Big data analytics are dynamic, self-improving, and self-correcting.

So far this sounds like a good thing, but in fact it can be used for any purpose. American law forbids the NSA to spy on American citizens. What does the word “spy” mean, though? The metadata of a call is the information about who called who when, from where, and for how long. The NSA (without oversight) chose to interpret spying as listening to what the citizens were saying, classifying the gathering of metadata as “not spying”. Eventually the US Congress removed the NSA’s power to collect and analyze this data, but the whole situation has an Orwellian, big-brother feel to it. It is incredible how much you can deduce from metadata. Your telephone metadata — from your cellphone — is a 24/7 record of where you were and much of whom you chose to communicate with.

Even really simple big data analytics can work really well.

Since Occupy Math is a professor, he encounters memos and procedures for dealing with student plagiarism all the time. Here’s a simple big-data technique for detecting plagiarism — or locating related documents. Tally how many times each consecutive triple of words occurs in each document you are subjecting to analysis. Throw out the triples that occur at similar rates in most documents. The remaining collections of counts are a digital signature of the document. Using an almost trivial method called cosine similarity, we can measure the degree to which two documents have signatures that point in similar or different directions. This technique has high (though not perfect) accuracy at detecting both common content (plagiarism) and common topics in documents. These techniques are even used for scoring the essay portions of the SATs by comparing to graded essays.

Big data can save lives.

With microbial resistance to antibiotics at a crisis level, big data gives us multiple ways forward. Some are obvious — using GPS data to understand where the problem is and thus efficiently target interventions — but dynamic, adaptive big data analytics can also cut the time and cost of drug discovery and can help in locating alternatives based on imprecise data. The document-processing techniques used to detect plagiarism can also sort unstructured physicians’ notes into groups that are on the same subject. How much is it worth to be able to detect an emerging disease from front-line doctors’ notes weeks or months before it becomes obvious there is a problem? These analytics can find the dozens of relevant sets of doctors’ notes among tens of thousands of irrelevant notes in a split second.

Another medical example, one that hit the internet while Occupy Math was writing this blog, concerns drug cross-reaction. Two drugs, certified as safe and effective individually, may be dangerous when taken together. Using big data analytics to cross-index a government archive of drug reactions with a university hospital database of patient records, researcher’s at Columbia found a previously unknown and deadly drug cross-reaction. In the future this sort of data mining of medical records will become routine and automatic – avoiding many deaths by earlier detection of unsuspected cross-reactions.

Big data can win elections.

Barack Obama’s election was based substantially on innovation, not only in fund raising a few dollars at a time, but in the use of data. A big data center called The Cave was used to guide both election strategy and policy decisions. Hillary Clinton has continued this practice, keeping many of Obama’s people. An excellent example of an unintuitive policy is not trying to paint her opponent as an example of typical Republican extremism. Big data analytics showed that charging Republicans with their unpopular economic policies and failures to govern was more likely to change voter behavior. Do the math, win the election.

Occupy Math has ranted in the past about people that refuse to do the math. This week’s post shows the opposite: the incredible power of doing the math. Big data analytics are a perfect example of a two-edged sword, a value-neutral tool, an informational construct with power like that of atomic physics in the material world. As a mathematician and computer scientist, Occupy Math wonders if we should implement some analog to the Hippocratic Oath a core principle of which is:

This before all else: do no harm.

The dangers and opportunities of big data analytics are fairly new, but science fiction was there decades ago. John Brunner’s prescient novel The Shockwave Rider did a good job of showing what might be in store. Even older, Isaac Asimov’s idea of psychohistory embeds the idea that mathematical analysis and sufficient data could be used to predict the future course of human history. While his immensely popular Foundation novels are about failures of and problems with such predictions, these novels are an early example of using big data with at least the flavor of the challenges we now face.

The chilling aspect of big data is that almost everything about you is knowable unless you go to a great deal of trouble to live off the grid. If you know this is the case, then you can defend yourself to some degree by being mindful, another reason that Math is the Right of all Free People. Mostly, the problem will be people trying to sell you things you will be tempted by, at least in the west, but the use of big data must be circumscribed to prevent it becoming a tool of tyranny and oppression. It would be foolish to abandon the hope that understanding big data can bring in order to prevent its abuse — rational, well-reasoned choices are the order of the day. Do you have big data stories on the dark side or the light side? Occupy Math welcomes your comments and tweets.

I hope to see you here again,
Daniel Ashlock,
University of Guelph,
Department of Mathematics and Statistics


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s