Big data and data analytics: I do not think it means what you think it means

With credit and apology to William Goldman

With so many players latching on to the idea of big data these days, it is inconceivable that everyone has the same definition in mind. I’ve heard folks describe big data as just being the combination of existing data sets while others don’t consider data to be big until there are hundreds of thousands of observations. I’ve even seen the idea that big data just means the increasing availability of open data. There is a similar challenge with data analytics. On one end of the spectrum, data analytics is just data analysis, but with a cooler name. On the other, data analytics involves big data (really big data) and machine learning. I needed to get a grasp on the various terms and concepts for my work, so I thought I’d share some of what I learned with you. Prepare to learn.

What is data analytics?

At my age, books are called the internet, so that’s where my story begins. I started with data analytics. The main question in my mind was what is the difference between data analytics and data analysis.

Here is one definition that popped up from Techopedia: “Data analytics refers to qualitative and quantitative techniques and processes used to enhance productivity and business gain. Data is [sic] extracted and categorized to identify and analyze behavioral data and patterns, and techniques vary according to organizational requirements.” This definition is representative of many of the top search returns in that it focuses on the analysis of data for organizational or business purposes. Techopedia goes on to explain data analytics as involving collecting, categorizing, storing and analyzing data for the purpose of decision making.

Here’s another definition — one that includes research uses — from TechTarget: “Data analytics is the process of examining data sets in order to draw conclusions about the information they contain, increasingly with the aid of specialized systems and software. Data analytics technologies and techniques are widely used in commercial industries to enable organizations to make more-informed business decisions and by scientists and researchers to verify or disprove scientific models, theories and hypotheses.”

Data analytics concerns the analysis of data that were not generated for the purpose of data analysis.
What I take away from these definitions and the others I found is that data analytics concerns the analysis of data that were not generated for the purpose of data analysis. That is, we’re not talking about research survey data; we’re talking about data that exist independently from the questions addressed by the data analytics. In public health, this concept is called real world data, which are simply data from outside of clinical trials.

The examples of data analytics I found use administrative data, such as company records on customers or health records on patients, or external data, such as social media, clickstream (data on internet usage) or mobile phone data. It does not appear that the data need to be “big” for the term data analytics to apply, but it does appear that a computer needs to be involved in the analysis. Data analytics can be exploratory or confirmatory, but consistent with the idea that data analytics uses existing or outside data, the examples of confirmatory data analytics I have seen are based on quasi-experimental methodologies. Data analytics can be applied to qualitative data. As TechTarget explains, “the qualitative approach…focuses on understanding the content of non-numerical data like text, images, audio and video…” Does that explain data analytics, or are you all still critics?

What is big data?
Big is not about the absolute size, rather about what is necessary in order to collect, categorize, store and analyze the data sets.
What about big data? Wikipedia says, “big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them.” Big is not about the absolute size, rather about what is necessary in order to collect, categorize, store and analyze the data sets. Big data come from sources like the internet of things, mobile devices, social media and satellite imagery.

Many sources talk about big data in terms of volume, velocity and variety. This blog post by Gil Press, which offers 12 definitions of big data (see especially #10), credits someone named Doug Laney with first suggesting that the three Vs are what yield big data. Volume is the amount of data and often reflects that data are captured from everything or everyone within the relevant space, rather than being sampled. A satellite image, for example, covers all points in the visible area. Velocity means that the data are available in real time, but is also used to mean that observations are captured at high frequency, for example a satellite image is taken each minute. Variety means that big data sets often include many kinds of information. Data about a geographic area might include satellite imagery data plus social media data originating from that site. The three Vs combine to produce data sets that are both large and complex. If a typical research survey data set were a grain of sand, big data would be a universe of beaches.

The next question in my mind was what is the intersection between big data and data analytics. I think it is fair to say that people analyzing big data are using data analytics. They are using specialized systems and software to analyze data coming from outside of control research settings to explore what can be learned for business or research purposes. But as noted above, not all data analytics uses big data. Would you like to see some examples of big-data analytics? As you wish.

I think it is fair to say that people analyzing big data are using data analytics.

The Journal of Infectious Diseases recently published an entire supplement on recent advances in the use of big data for “strengthening disease surveillance, monitoring medical adverse events, informing transmission models, and tracking patient sentiments and mobility.” (Bansal 2016) (Many of the articles in the supplement are open access!) One example from the supplement is the paper by Marcel Salathé exploring the possibilities of combining digital health records with patient-generated data, such as data from online health forums or internet search strings. These big data could help with the detection of adverse events and could be “mined for information on behavior and sentiments” to help understand vaccine acceptance.

A recent example from economics is the research conducted by Neal Jean and co-authors (2016) [gated] using satellite images and machine learning to predict poverty. The use of nighttime satellite images of lights to measure economic activity or development is fairly well known. The classic picture is the nighttime satellite image of North Korea compared to South Korea. Jean, et al. go much further. They use machine learning on survey data on expenditures and wealth combined with daytime high-resolution satellite data that can capture features of the landscape including things like roads and roofing and nightlights satellite data to predict poverty in areas where there are no survey data. They conclude that “common determinants of livelihoods…revealed in imagery…can be leveraged to estimate consumption and asset outcomes with reasonable accuracy.”

At the Empirical Studies of Conflict annual meeting I attended recently, I saw several fascinating examples of data analytics on what folks there called “passive high-frequency data”. These papers are not for distribution yet, but include using mobile phone data to understand how private firms react to violent incidents in Afghanistan (Blumenstock et al. 2017), using Twitter data to test whether community engagement activities in the United States led to reductions in pro-ISIS content (Mitts 2017), and using Twitter data to understand how politicians react to local extremist acts in Colombia (Morales 2017). Skip to the end.

Passive data mined responsibly should allow us to do research with fewer invasive and time-consuming surveys.
My quick review of big data and data analytics (and big-data analytics) has left me excited about the possibilities in store. I agree with many of the authors cited here that there are challenges, even big challenges, that we need to address in the use of big data, not least of which are ethical concerns. But this is not the Pit of Despair. These challenges can be addressed. In fact, passive data mined responsibly should allow to us to do research with fewer invasive and time-consuming surveys. In the meantime, rest well and dream of large data.

Sharing is caring!