BIG DATA, Data processing techniques, DATA Analytics, Data Storage Strategies
You get to hear and read these terms quite often these days. You are probably in awe of these words but don’t quite know what they exactly mean. But, I am sure it is intriguing for you. Like “What do they mean by Big Data” or “What Big Data technologies are doing” or “Why people are so much worried about data management and analysis?” or “Why suddenly people are so much concerned with the Data which has always been there from ages?” etc. etc.
The answers to such questions are simpler than you think. Today, let me give you a lowdown on all this seemingly Big and heavy concepts. It’s quite interesting, I promise!
What is Big Data?
In lay man terms, Big data is a phrase used to mean a massive volume of both structured and unstructured data. It is so voluminous, that it becomes difficult to process using traditional database and software techniques. This amount of data has the potential to be mined for information.
When we talk about Big Data, the two terms which we need to understand with it are Structured and Unstructured Data.
Structured Data: Structured Data is data that has been organized into a formatted repository, typically a database, so that it’s elements can be made addressable for more effective processing and analysis.
Unstructured Data: Unstructured Data is a generic label for describing data that is not contained in a data base or some other type of data structure. Unstructured Data can be textual or non-textual. Textual unstructured data is generated in media like email messages, PowerPoint presentations, Word Documents, collaboration software and instant messages. Non-textual unstructured data is generated in media like JPEG images, MP3 audio files and Flash Video Files.
Big Data has the potential to help companies improve operations and make faster, more intelligent decisions. This data, when captured, formatted, manipulated, stored, and analyzed can help a company to gain useful insight to increase revenues, get or retain customers, and improve operations.
An example of Big Data might be petabytes (1,024 terabytes) or exabytes (1,024 petabytes) of data consisting of billions to trillions of records of millions of people—all from different sources (e.g. Web, sales, customer contact center, social media, mobile data and so on). The data is typically loosely structured data that is often incomplete and inaccessible.
Big Data: Volume or Technology?
While the term may seem to reference the volume of data, that isn’t always the case. The term Big Data, especially when used by vendors, may refer to the technology (which includes tools and processes) that an organization requires to handle the large amounts of data and storage facilities. The term is believed to have originated with Web search companies who needed to query very large distributed aggregations of loosely-structured data.
Characteristics of Big Data:
Big Data is often characterized by 4Vs: The extreme “Volume” of data, the wide “variety” of data types, the “Velocity” at which the data must be processed and the intrinsic “Value” of data that must be discovered.
Volume: While volume indicates more data, it is the granular nature of the data that is unique. Big data requires processing high volumes of low-density, unstructured Hadoop data—that is, data of unknown value, such as Twitter data feeds, click streams on a web page and a mobile app, network traffic, sensor-enabled equipment capturing data at the speed of light, and many more. It is the task of big data to convert such Hadoop data into valuable information. For some organizations, this might be tens of terabytes, for others it may be hundreds of petabytes.
Variety: Data may also exist in a wide variety of file types. Unstructured and semi-structured data types, such as text, audio, and video require additional processing to both derive meaning and the supporting metadata. Once understood, unstructured data has many of the same requirements as structured data, such as summarization, lineage, auditability, and privacy. Further complexity arises when data from a known source changes without notice. Frequent or real-time schema changes are an enormous burden for both transaction and analytical environments.
Velocity: The fast rate at which data is received and perhaps acted upon. The highest velocity data normally streams directly into memory versus being written to disk. Some Internet of Things (IoT) applications have health and safety ramifications that require real-time evaluation and action. Other internet-enabled smart products operate in real time or near real time. For example, consumer e-commerce applications seek to combine mobile device location and personal preferences to make time-sensitive marketing offers. Operationally, mobile application experiences have large user populations, increased network traffic, and the expectation for immediate response.
Value: There are a range of quantitative and investigative techniques to derive value from data—from discovering a consumer preference or sentiment, to making a relevant offer by location, or for identifying a piece of equipment that is about to fail. The technological breakthrough is that the cost of data storage and compute has exponentially decreased, thus providing an abundance of data from which statistical analysis on the entire data set versus previously only sample. The technological breakthrough makes much more accurate and precise decisions possible. However, finding value also requires new discovery processes involving clever and insightful analysts, business users, and executives. The real big data challenge is a human one, which is learning to ask the right questions, recognizing patterns, making informed assumptions, and predicting behavior.
In my next post, I’ll take you deeper in to the world of Big Data and what the future would be like with or without it. Stay tuned!