Occupy Math’s employer has launched a Masters Program in Data Science. The pyramid picture at the top of the post shows a major process goal of data science, which is more aspirational than real, as data science often achieves knowledge but seldom gets to wisdom. In graduate school, Occupy Math knew a professor of computer science who was fond of saying “Anything that calls itself a science probably is not one.” Data science is not even close to being a science, though it applied many of the results of science to the problem of understanding and reducing data. The word “reducing” in this definition can mean “concentrating into a useful form”, not “getting rid of”. Occupy Math is simultaneously developing and teaching a course entitled Data Manipulation and Visualization and so there may be several posts that arise from the experience. Another troublesome word: in this case “manipulation” means “transformation into a useful or tractable form”, not “deceptive modification.” Aren’t words fun?
The picture above shows words appearing in documents about data science. It is called a word cloud or a tag cloud. The size of a word, in the picture, is proportional to how often the word showed up in the document(s) used to create the word cloud. It is a good example of data science: it takes many pages of documents and provides a quick, visual summary of what some of the concerns of those documents are and the vocabulary they use. Notice how hard they are trying to be science? If you look at a word cloud on a topic that is new to you, right away you get a list of words whose definitions you should look up, for example.
Occupy Math has been a data scientist for about twenty-five years, since about 1996, which is a little tricky since the term data science was coined in 2008. The job existed well before the name for the job. A data scientist is someone who, given a set (usually a huge pile) of data, tries to figure out what it means. Usually the phrase “figure out what it means” is operationalized as being able to answer factual questions about the domain where the data were gathered.
Examples of data science techniques
A racist photo app made a lot of headlines back in 2015. One of the big successes of data science is deep learning, which is not deep in the usual sense, but is machine leaning. The “deep” (words just keep having their meanings bent) means that the technique uses many layers in an artificial neural net. A artificial neural net is a collection of triggering or connecting functions (called neurons) connected to one another that functions as a type of computer. It is programmed by adjusting the connection strengths. In deep learning, there are many collections of neurons in successive layers, allowing the net to be broken into chunks with distinct functions. Deep learning techniques can look at images and extract features from those images that permit them to do a good job of classifying other images. They can learn what are called class labels, supplied by the person using the deep learning system and then have a remarkable ability to assign reasonable class labels to new images.
The Google app managed to label black folks as gorillas, which sounds pretty racist. The problem was not racism in the app but it might have been neglectful racism over at Google. A deep learning system knows nothing, nada, and nil about the semantics of the data it is working on (unless the class labels are themselves semantic labels). It just learns to bin images accurately relative to its training data. The clever people that built the data set did not include any pictures of black people in the training set and they did include pictures of gorillas. Given that gorillas do look more like human beings than cars, buildings, or sunsets, the identification, in addition to sounding racist and being insulting, was actually pretty close. Google solved the problem by removing the pictures of gorillas from the training set. The big take home lesson is that considerable care is needed in selecting the starting data. Another is that computers do what you tell them to, not what you want them to. It’s also worth noting that leaving different sorts of people out of the initial data mix is closely tied to a risk of insulting those people. Certainly via neglect and, when things go really wrong, by calling them gorillas!
The bag-of-words model for text documents
The simplest version of a vector is an ordered list of numbers, called coordinates. A three-dimensional vector might be (2, 3.5, -1.8), for example. Geometrically a vector points in a direction in n-dimensional space, unless all its coordinates are zero. There is a very simple formula for the angle between the directions that two vectors point in, which leads to an important technique in data science. If you summarize two data items as vectors whose coordinates are extracted from the data item, then the degree to which two vectors point in the same direction is a measure of how similar the objects used to make the vectors are.
Suppose we have several text documents. We create a space of vectors with a coordinate position for each word used in the documents. If one of the documents lacks one of the words, the vector for that document gets a zero in the coordinate for that word. Here comes the data science: two documents whose vectors point in nearly the same direction are usually documents about the same topic. This creates the ability to tell a ‘bot to go search the web for documents on a topic defined by an initial set of documents — which is pretty cool. This vector framework is called the bag-of-words method.
The bag-of-words technique can be used in other ways. If a teacher has a collection of essays on a topic and has given them grades, then we can use the grades as class labels. It turns out that the bag-of-words technique is very good at agreeing with human graders about the grade an essay should get — something that has deeply annoyed people who are employed grading essays, like those on the scholastic aptitude tests. It turns out that the accuracy for grading essays is better if, instead of words, you use groups of three words in the order they appear in sentences. This larger collection of vector coordinates, word triples, capture more information about the quality of the writing.
Generative adversarial networks
A generative adversarial network is a technique for generating more examples of something — often an image — that you already have examples of. The human operator picks a number of descriptive statistics for the type of images being used (the ability of deep leaning to create image classifiers is something that is useful here). The system then operates by competition between two types of software, a set of image generators and a set of image classifiers. The image classifiers say if an image they are presented with is inside of the group of images of interest or outside of the group. The image generators generate images that get the statistics chosen by the human operator correct and which cause the classifiers to say that the image is in the class of images of interest. The two types of software go through multiple rounds of trying to defeat one another, creating better and better fakes of the objects they are trying to imitate. The woman pictured at the top of this section is a deep fake image produced with a generative adversarial network. The person in the picture does not exist.
The image generators are trying to fool the classifiers and the classifiers are learning to identify the images generated as being fake (outside of the class of interest). Both sides of the adversarial network use some form of machine learning; determining which sort of machine learning works well in these roles is an area of active research. The imaginary woman at the top of the section gives an idea of how well the technique can work. The technique has been used in other domains than images. One of Occupy Math’s colleagues uses it to create plausible levels for the Mario Brothers video game, for example. This ability to create very plausible forgeries is potentially remarkably useful and potentially problematic.
The end of privacy, even more so
In many states in the United States it is against the law to ask if a person purchasing car insurance is a woman or to use this information in setting rates even if you know it. The problem with this is that it is almost always possible to tell if a person is a woman from other data that is perfectly reasonable to ask for when selling auto insurance. Because almost all data categories are learn-able, and the category man/woman or man/woman/other is actually a very easy category to learn, attempts to prevent the use of potentially discriminatory information are often pointless. This is one of the huge challenges given the existence of data science and machine learning. Your private information, even if not known or revealed, can often be deduced.
There is not much you can do to protect your privacy, short of dropping off the grid. Something you can do is insist on fair and equitable treatment of all people when you can and keep your eyes wide open. The techniques of data science can also be used to spot the abuse of data, so there is room to create countermeasures in a privacy-deprived future. If you are interested in doing a Masters degree in data science in a program where Occupy Math is one of the faculty, drop him a line at email@example.com.
I hope to see you here again,
So remember to get your Covid vaccination!
University of Guelph
Department of Mathematics and Statistics