What is Big Data?

Credit: DOMO
  • Do you have a fitness watch? Nearly 30 million Americans do, each generating about 2 – 5 gigabytes (GB = 109 bytes or 1,000,000,000 bytes) of mostly numeric data each day.
  • Do you post on FaceBook? Billions of people do, producing about 3 billion new postings and 150 million likes, or 100 terabytes (TB = 1012 = 1,000,000,000,000 bytes) of non-numeric data, each day.
  • Do you shop at Walmart? Every hour Walmart processes about 1 million customer transactions, which add up to about 40 petabytes (PB = 1015 bytes = 1,000,000,000,000,000 bytes) of data per day.

These are just a few examples of the huge amount of new data being created, stored and analyzed every day. This is such a new phenomenon that it has a name, Big Data. Read the poster for more information that summarizes Big Data creation from multiple sources for each MINUTE of a day.

Of course, any data estimates are dated by the time it is published as data inputs continue to grow. This graphic is already dated– it is from 2011 when only 2.1 billion people were connected to the Internet—but it can still give you an idea of the immense amount of data being created, stored, and delivered to users. According to Internet World Stats, by Dec 31, 2017 the number of people connected to the Internet was 4.2 billion, or 55% of the entire world’s population. Interestingly, 49% of all people online live in Asia, and North America accounts for only 8%!

This bar graph below shows the rapid rise in the number of Internet users. You can extrapolate it to estimate that 100% of the world’s population will be connected to the Internet by 2032. As will many cars, refrigerators, toasters, heart monitors, and who-knows what else. The idea that many different types of machines will collect data has a name – the Internet of Things (IoT). IoT will be a major source of Big Data, for there are many more things than people.

Internet Users in the World Graph
Credit: Internet Live Stats

All of these things will share data; in fact, that already happens in many places. Consider the example of data sharing within modern airplanes. An aircraft’s onboard computers collect data on wind speed, air temperature and turbulence from aircraft sensors and from the National Weather Service; location and altitude from its GPS; and locations and paths of other planes in the area from the aircraft’s radar and from air traffic control. The plane’s computers automatically consider all of this data to constantly adjust its navigation, speed, and altitude to fly at the optimum altitude and speed to arrive safely at its destination on time, with the lowest fuel use. And then the computer automatically lands the plane. The benefit of IoT in this case is that aircraft computers can ingest vastly more data, instantly analyze it, and act on it faster than a human pilot can. The same is happening now with self-driving cars.

Each Internet user involuntarily contributes data about websites visited, terms searched, items purchased, messages, pictures and emails sent, and videos watched. This often benefits you, causing the ads that your device displays to be of items that you have searched previously. Your credit card company tracks your location when you buy a Blizzard at Dairy Queen so that if 30 minutes later your credit card is presented to make a purchase in Nigeria, the company knows that it is likely to be an illegitimate user.

Is or Are?
The word data is plural (datum is singular) so to speak correctly one should say, “Data are critical for understanding climate.” But many people incorrectly use data as if it were singular, “Data is critical for understanding climate.” The phrase Big Data is considered to be singular because it is a name of a concept. So in this document, we use data as a plural noun, and Big Data is a singular concept.

Cautions in Using Big Data

It is easy to become mesmerized with the potential of Big Data. Surely with larger data sets than ever before, important new biomedical correlations can be discovered, for example, between the likelihood of catching a specific disease and every measurement ever collected about patients.  The National Science Foundation has just published a report of a NSF-NIH workshop on assessing and improving the reliability of such correlations. Many statistical problems exist: How do you integrate diverse data sets with different amounts of data, different means and standard deviations, covering different periods of time, and each with with different biases and different amounts of missing data? In addition, the massive data collection may not even include data needed to answer the question of interest. The workshop speakers warned that analysis of big data may result in misleading correlations and false discoveries.

Big Data is not science, it is analysis. It does not seek specifically to understand why things occur but detects patterns that some human must analyze in terms of a theoretical understanding. A Big Data expert at the Massachusetts Institute of Technology further stated that Big Data gives probabilities not answers. And even if probabilities are high (Hillary Clinton had a 70-90% probability of being elected president in 2016), the opposite may happen.

The NSF report emphasized that the use of Big Data requires scientists, not just statisticians, to better understand statistics than many people now do. It was suggested that probability and statistical concepts be introduced and progressively developed through middle and high schools so that students gradually develop intuition about their use. That is the goal of Pandem-Data, the material you are reading now. Congratulations on being at the forefront of Big Data education!