In my last post, we discussed about analyzing the business problem and basic steps to design the Big Data Architecture. Today, I am going to talk about different layers in the Big Data Architecture and their functionalities.
Logical Layers of Big Data Reference Architecture
Behind big data architecture, the core idea is to document a right foundation of architecture, infrastructure and applications. Consequently, this allows businesses to use big data more effectively on an everyday basis.
It is created by big data designers/architects before physically implementing a solution. Creating big data architecture generally requires understanding the business/organization and its big data needs. Typically, big data architectures outline the hardware and software components that are necessary to fulfil big data solution. Big data architecture documents may also describe protocols for data sharing, application integrations and information security.
It also entails interconnecting and organizing existing resources to serve big data needs.
The logical layers of the reference architecture are as below:
- Data Source Identification: Knowing where this data is sourced from.
Source profiling is one of the most important steps in deciding the architecture or big data. It involves identifying different source systems and categorizing them, based on their nature and type.
Points to be considered while profiling data sources:
- Identify internal and external sources systems.
- High Level assumption for the amount of data ingested from each source
- Identify mechanism used to get data – push or pull
- Determine the type of data source – Database, File, web service, streams etc.
- Determine the type of data – structured, semi-structured or unstructured
- Data Ingestion Strategy and Acquisition: Process to input data into the system.
Data ingestion is all about the extraction of the data from the above mentioned sources. This data is stored in the storage and then after is transformed for further processing on it.
Points to be considered:
- Determine the frequency at which data would be ingested from each source
- Is there a need to change data semantics?
- Is there any data validation or transformation required before ingestion (Pre-processing)?
- Segregate the data sources based on mode of ingestion – Batch or real-time
- Data Storage: The facility where big data will actually be stored.
One should be able to store large amounts of data of any type and should be able to scale on need basis. We should also consider the number of IOPS (Input output operations per second) that it can provide. Hadoop distributed file system is the most commonly used storage framework in Big Data world, others are the NoSQL data stores – MongoDB, HBase, Cassandra etc.
Things to consider while planning storage methodology:
- Type of data (Historical or Incremental)
- Format of data (structured, semi structured and unstructured)
- Compression requirements
- Frequency of incoming data
- Query pattern on the data
- Consumers of the data
- Data Processing: Tools that provide analysis over big data.
Not only the amount of data being stored but the processing also has increased multifold.
Earlier frequently accessed data was stored in Dynamic RAMs. But now, it is being stored on multiple disks on a number of machines connected via network due to sheer volume. Therefore, instead of gathering data chunks for processing, processing modules are taken to the big data. Thus, significantly reducing network I/O. The Processing methodology is driven by business requirements. It can be categorized into Batch, real-time or Hybrid based on the SLA.
- BatchProcessing – Batch is collecting the input for a specified interval of time and running transformations on it in a scheduled way. Historical data load is a typical batch operation
- Real-time Processing– Real-time processing involves running transformations as and when data is acquired.
- Hybrid Processing– It’s a combination of both batch and real-time processing needs.
- Data consumption/utilization: Users/services utilizing the data analyzed.
This layer consumes output provided by the processing layer. Different users like administrator, Business users, vendor, partners etc. can consume data in different format. Output of analysis can be consumed by recommendation engine or business processes can be triggered based on the analysis.
Different forms of data consumption are:
- Export Data sets – There can be requirements for third party data set generation. Data sets can be generated using hive export or directly from HDFS.
- Reporting and visualization – Different reporting and visualization tool scan connect to Hadoop using JDBC/ODBC connectivity to hive.
- Data Exploration – Data scientist can build models and perform deep exploration in a sandbox environment. Sandbox can be a separate cluster (Recommended approach) or a separate schema within same cluster that contains subset of actual data.
- Adhoc Querying – Adhoc or Interactive querying can be supported by using Hive, Impala or spark SQL.
Functional Layers of the Big Data Architecture:
There could be one more way of defining the architecture i.e. is through the functionality division. But the functionality categories could be grouped together into the logical layer of reference architecture, so, the preferred Architecture is one done using Logical Layers.
The layering based on the Functionalities is as below:
- Data Sources:
Analyzing all the sources from which an organization receives a data and which could help the organization in making its future decisions should be listed in this category. The data sources listed here are irrespective of the fact whether the data is structured, unstructured or semi-structured.
- Data Extraction:
Before you can store, analyze or visualize your data, you’ve got to have some. Data extraction is all about taking something that is unstructured, like a webpage, and turning it into a structured table. Once you’ve got it structured, you can manipulate it in all sorts of ways, using the tools described below, to find insights.
- Data Storage:
The basic necessity while working with big data is to think how to store that data. Part of how Big Data got the distinction as “BIG” is that it became too much for the traditional systems to handle. A good data storage provider should offer you an infrastructure on which to run all your other analytics tools as well as a place to store and query your data.
- Data Cleaning:
A pre necessary step before we actually start to mine the data for insights. It is always a good practice to create a clean, well-structured data set. Data sets can come in all shapes and sizes, especially when coming from web. Choose a tool as per your data requirement.
- Data Mining:
Data mining is the process of discovering insights within a database. The aim of data mining is to make decisions and predictions on the data you have at hand. Choose a software which gives you the best predictions for all types of data and lets you create your own algorithms for mining the data.
- Data Analytics:
While data mining is all about sifting through your data in search of previously unrecognized patterns, data analysis is about breaking that data down and assessing the impact of those patterns overtime. Analytics is about asking specific questions and finding the answers in data. You can even ask questions about what will happen in the future!
- Data Visualization:
Visualizations are a bright and easy way to convey complex data insights. And the best part is that most of them require no coding. Data visualization companies will make your data come to life. Part of the challenge for any data scientist is conveying the insights from that data to the rest of your company. Tools could help you to create charts, maps and other such graphics out of your data insights.
- Data Integration:
Data integration platforms are the glue between each program. They connect the different inferences of the tools with other Softwares. You could share the results of your visualization tools directly on Facebook through these tools.
- Data Languages:
There will be times in your data career when a tool simply won’t cut it. While todays tools are becoming more powerful and easier to use, sometimes it is just better to code it yourself. There are different languages helping you in different aspects like statistical computing and graphics. These languages could work as a supplement for the data mining and statistical Softwares.
The key thing to remember in designing Big Data Architecture are:
- Dynamics of use: There a number of scenarios as illustrated in the article which need to be considered while designing the architecture – form and frequency of data, Type of data, Type of processing and analytics required.
- Myriad of technologies: Proliferation of tools in the market has led to a lot of confusion around what to use and when, there are multiple technologies offering similar features and claiming to be better than the others.
I know you would be thinking about different tools to use to make a full proof Big Data Solution. Well, in my upcoming posts on Big Data, I would be covering some best tools to achieve different tasks in big data architecture.