With credit and apology to William Goldman
With so many players latching on to the idea of big data these days, it is inconceivable that everyone has the same definition in mind. I’ve heard folks describe big data as just being the combination of existing data sets while others don’t consider data to be big until there are hundreds of thousands of observations. I’ve even seen the idea that big data just means the increasing availability of open data. There is a similar challenge with data analytics. On one end of the spectrum, data analytics is just data analysis, but with a cooler name. On the other, data analytics involves big data (really big data) and machine learning. I needed to get a grasp on the various terms and concepts for my work, so I thought I’d share some of what I learned with you. Prepare to learn.
What is data analytics?
At my age, books are called the internet, so that’s where my story begins. I started with data analytics. The main question in my mind was what is the difference between data analytics and data analysis.
Here is one definition that popped up from Techopedia: “Data analytics refers to qualitative and quantitative techniques and processes used to enhance productivity and business gain. Data is [sic] extracted and categorized to identify and analyze behavioral data and patterns, and techniques vary according to organizational requirements.” This definition is representative of many of the top search returns in that it focuses on the analysis of data for organizational or business purposes. Techopedia goes on to explain data analytics as involving collecting, categorizing, storing and analyzing data for the purpose of decision making.
Here’s another definition — one that includes research uses — from TechTarget: “Data analytics is the process of examining data sets in order to draw conclusions about the information they contain, increasingly with the aid of specialized systems and software. Data analytics technologies and techniques are widely used in commercial industries to enable organizations to make more-informed business decisions and by scientists and researchers to verify or disprove scientific models, theories and hypotheses.”
The examples of data analytics I found use administrative data, such as company records on customers or health records on patients, or external data, such as social media, clickstream (data on internet usage) or mobile phone data. It does not appear that the data need to be “big” for the term data analytics to apply, but it does appear that a computer needs to be involved in the analysis. Data analytics can be exploratory or confirmatory, but consistent with the idea that data analytics uses existing or outside data, the examples of confirmatory data analytics I have seen are based on quasi-experimental methodologies. Data analytics can be applied to qualitative data. As TechTarget explains, “the qualitative approach…focuses on understanding the content of non-numerical data like text, images, audio and video…” Does that explain data analytics, or are you all still critics?
What is big data?
Many sources talk about big data in terms of volume, velocity and variety. This blog post by Gil Press, which offers 12 definitions of big data (see especially #10), credits someone named Doug Laney with first suggesting that the three Vs are what yield big data. Volume is the amount of data and often reflects that data are captured from everything or everyone within the relevant space, rather than being sampled. A satellite image, for example, covers all points in the visible area. Velocity means that the data are available in real time, but is also used to mean that observations are captured at high frequency, for example a satellite image is taken each minute. Variety means that big data sets often include many kinds of information. Data about a geographic area might include satellite imagery data plus social media data originating from that site. The three Vs combine to produce data sets that are both large and complex. If a typical research survey data set were a grain of sand, big data would be a universe of beaches.
The next question in my mind was what is the intersection between big data and data analytics. I think it is fair to say that people analyzing big data are using data analytics. They are using specialized systems and software to analyze data coming from outside of control research settings to explore what can be learned for business or research purposes. But as noted above, not all data analytics uses big data. Would you like to see some examples of big-data analytics? As you wish.
The Journal of Infectious Diseases recently published an entire supplement on recent advances in the use of big data for “strengthening disease surveillance, monitoring medical adverse events, informing transmission models, and tracking patient sentiments and mobility.” (Bansal 2016) (Many of the articles in the supplement are open access!) One example from the supplement is the paper by Marcel Salathé exploring the possibilities of combining digital health records with patient-generated data, such as data from online health forums or internet search strings. These big data could help with the detection of adverse events and could be “mined for information on behavior and sentiments” to help understand vaccine acceptance.
A recent example from economics is the research conducted by Neal Jean and co-authors (2016) [gated] using satellite images and machine learning to predict poverty. The use of nighttime satellite images of lights to measure economic activity or development is fairly well known. The classic picture is the nighttime satellite image of North Korea compared to South Korea. Jean, et al. go much further. They use machine learning on survey data on expenditures and wealth combined with daytime high-resolution satellite data that can capture features of the landscape including things like roads and roofing and nightlights satellite data to predict poverty in areas where there are no survey data. They conclude that “common determinants of livelihoods…revealed in imagery…can be leveraged to estimate consumption and asset outcomes with reasonable accuracy.”
At the Empirical Studies of Conflict annual meeting I attended recently, I saw several fascinating examples of data analytics on what folks there called “passive high-frequency data”. These papers are not for distribution yet, but include using mobile phone data to understand how private firms react to violent incidents in Afghanistan (Blumenstock et al. 2017), using Twitter data to test whether community engagement activities in the United States led to reductions in pro-ISIS content (Mitts 2017), and using Twitter data to understand how politicians react to local extremist acts in Colombia (Morales 2017). Skip to the end.