Data scientists use a mantra of words beginning with the letter V to characterize Big Data: Volume, Variety, Velocity, Veracity, Visualization, Variability, and Value. Here is what each of these V-words mean and why it is important. Note that some data scientists believe that only the first three Vs – volume, variety and velocity – are defining characteristics of Big Data, and the other Vs are important for any size data set.
As indicated above, vast amounts (or volumes) of data are continuously generated every second of every day. In the past, it was harder to acquire data, and studies were often based on limited samples. For example, quality control typically relied on sampling, with every tenth or hundredth or millionth product on an assembly line being tested to see if it was properly made. But now for some products, an automatic scanning can test every single item for failures so that immediate corrective action can be taken.
The existence of vast volumes of data encourages explorations of relationships between different variables (such as SAT scores and air pollution) that might not have been possible before. Vast amounts of data overwhelm traditional analysis tools. How many lines or columns can an Excel file hold? Does your statistics software work with billions of data points instead of thousands? In fact, many new analysis tools have had to be created to work with just the volumes of Big Data projects.
Big Data becomes richer and more likely to reveal important insights if it combines different types or varieties of data. If you have a truly tremendous amount of data about one thing – for example, the street address of every person in the USA – that alone would not yield many insights. But if you combine the addresses with the price of each home, then you could plot home price by location and infer concentrations of wealth or poverty. A government agency could use such maps to identify areas that need help, or corporations could use the maps to select areas to email advertisements about the latest model sports car. If the Big Data file also included family size, then some mailings might be about SUVs rather than two-seater sports cars.
Traditionally, data tended to be numeric, or if alphabetic, such as A, B, C grades, they were transformed to numbers, 1, 2, 3. Excel spreadsheets mostly work with numeric data. But in the last few years huge volumes of unstructured data have been harvested from social media sites such as FaceBook, Twitter, Instagram and YouTube. These data may include words, graphs, images and videos. Unstructured data can’t be analyzed until they are interpreted to reveal intent. For example, a FB message that mentions the Pittsburgh Pirates and includes the words hooray, finally or great can be interpreted to mean that the poster is a Pirates fan who is happy that the team finally won a game. This information could lead to ads for Pirates’ paraphernalia appearing on the fan’s devices.
Another way in which variety becomes a factor concerns the resolution of the data – that is, what is the smallest feature that is represented. For example, if you wanted to know how much corn is being grown in Iowa, you could use high altitude spacecraft images that show nothing smaller than an acre, but which would allow discrimination of farmland from swampy land or towns. Spy-satellite images could show each plant; the spy data has a high enough resolution to determine if a particular field is growing corn, rather than sugar beets or other crops. The spy data gives a much more accurate count of corn, but it is also a much larger data file (a larger volume). Often a data scientist doesn’t have a uniform data type for the entire area being studied. The problem then is in determining how to combine the high resolution data where it exists with the lower resolution images which may be the only data available for many areas.
Big Data is often generated and transmitted nearly instantaneously. For example, a heart monitor can measure and transmit heart rhythms and pressure continuously, many times a second. There are 86,400 seconds in a day so if rhythm and pressure are both sent just once a second, along with a timestamp to know when the measurements were taken (perhaps to correlate with another activity such as sleeping or exercising), then there might be 4 x 86,400 = 345,600 pieces of data each day, each delivered nearly instantaneously. That would fill a pretty big Excel spreadsheet, with 4 columns and 86,400 rows of data! Having data arrive every second requires it to be rapidly analyzed if it is to be of value in triggering a warning of a heart attack. So velocity also impacts the rate at which data processing, analysis and displaying must be done. Big Data often means real-time data, which requires software that can process and analyze it in real-time, too. Some businesses have gone from forecasting to nowcasting.
Veracity, like the words verify and verity, comes from the Latin word meaning truth. To have faith in data you must have knowledge of its veracity. Noisy data must be cleansed. Noise can be an erroneous measurement or misplaced decimal point (a person’s height of 56 feet instead of 5.6 feet) or simply something that is not relevant to the analysis at hand (student height compared to SAT scores). An oft-repeated statistic that one in three business leaders don’t trust the information they use to make decisions indicates that data veracity can be a serious problem.
Visualization in Big Data is the use of charts, graphs and images to present large amounts of complex data in diagrams that allow for easier understanding of results by the user. If the product of Big Data analysis is just spreadsheets or numbers or lists of formulas, the user would have a difficult time synthesizing what the analysis showed.
At first glance, you might think that variability and variety are the same. Remember, variety is the different formats that data can be in, such as graphs, images, videos, photos, social media, messaging, recordings, or sensor data. Variability is when the data’s meanings change. If the same data means something different one day to the next, it presents a challenge for data compilation and analysis. An example is the word “Great” that can be used in a comment evaluating a service to mean that it is really good, or sarcastically to mean that it was less that expected, as in “Oh, great”. Variability has to be addressed by the algorithms programmed into the Big Data software in order for the analysis to be meaningful.
There is a final V that is a desired outcome of using Big Data: Value. Ideally, Big Data increases value by leading to new insights and understanding of the process being investigated. Value is realized when insights lead to action. For a business, the value could be improved decision-making that enhances profit and performance; can a company increase efficiency or sales or customer satisfaction by tweaking their processes? For a doctor, the value in using Big Data – perhaps genome mapping of an individual patient – could be to provide exactly the right amount of the most effective antibiotic which could lower medication costs, reduce side effects, and enhance recovery, all very valuable results for the patient.
Big Data is different from traditional science. Science focuses on discovering causation, why things are the way they are. Analyzing big data is not science, it is like prospecting, hoping to find something by searching. It is not random, but by prudent selection of data it strives to discover meaningful correlation.
Big Data characteristics can be summarized by single words:
- Volume = Size
- Variety = Complexity
- Velocity = Speed
- Veracity = Quality
- Visualization = Increasing understanding
- Variability = Changing value
- Value = Usefulness
Data = Information?
No. Data are facts – for example, July 4 is a simply a piece of data, but when we recognize that it is a national holiday, then we have information. When data are interpreted, they can become meaningful, and then they become information. Information is useful, but the ultimate step is to gain insight. For example, based on lots of data concerning the locations of people who get a disease, it may become clear that they live along a river. That is information, now the question that provides insight is: Does living along a river cause this disease? Adding additional data about pollution in the river may suggest an insightful and testable hypothesis that pollution from an upstream meat-processing plant carries pathogens that cause the disease.