In my earlier post I talked about the basics of Big Data and how it can become a Future Nightmare, followed by Must Know Facts of Big Data. Today, let us talk about a very important and basic step for working with Big Data, i.e. “Big Data Architecture”.
Big data architecture is the logical and/or physical structure of how big data will be stored, accessed and managed within a big data or IT environment. It logically defines how big data solutions will work based on core components (hardware, database, software, storage) used, flow of information, security, and more. Big data architecture primarily serves as the key design reference for big data infrastructures and solutions.
Big Data Types:
Big data can be stored, acquired, processed, and analyzed in many ways. Every big data source has different characteristics, including frequency, volume, velocity, type, and veracity of data. When big data is processed and stored, additional dimensions come into play, such as governance, security, and policies.
Designing a Big Data architecture is already a complex task. Adding to that is the speed of technological innovations and competitive products in market, and this becomes quite a magnanimous task for any Big Data Architect.
Before designing big data reference architecture, the most vital step is identifying whether a particular business scenario is a Big Data Problem or not. These problems can be further categorized into types. Categorizing big data problems by type, make it easy to determine the individual characteristics of each data type. Big Data types can be categorized as follows:
- Machine-Generated Data
- Web and Social Data
- Transaction Data
- Human Generated
- Biometrics
Classification of Big Data Characteristics using Big Data Type
Data from different sources have different characteristics; for example, social media data can have video, images, and unstructured text such as blog posts. Once data is classified according to its characteristics, it can easily be matched with the appropriate big data pattern. Listed below are some of the common characteristics how data is assessed and categorized.
- Analysis Type: Real time Analysis or Batched Analysis
Give careful consideration to choosing the analysis type, since it affects several other decisions about products, tools, hardware, data sources, and expected data frequency. A mix of both types may be required by the use case:
- Fraud Detection: Real-Time Analysis required
- Trend Analysis / Business Decisions: Batch Mode Analysis
- Processing Methodology: Type of technique to be applied for processing data
Selected methodology helps in choosing the appropriate Tools and Techniques for Big Data Solution.
- Data Frequency and Size: Amount of data and the speed at which it will be obtained.
This characteristic of data helps in deciding the storage mechanism, format and pre-processing tools. Size and Frequency vary for different data sources:
- On Demand – Social Media Data
- Continuous Feed / Real Time – Weather Data, Transactional Data
- Time Series – Time Based Data
- Data Type: Type of Data to be processed.
Knowing the data type helps in segregation of data in the storage.
- Content Format: Format of Incoming Data
Format tells us about how the incoming data needs to be processed and what tools and techniques should be used. Format could be Structured (RDBMS) or Un-Structured (Audio, Video, Images) or Semi-Structured.
- Data Source: Sources of Data Generation
Identifying the Data Sources is vital in determining the scope from a business perspective. E.g. Web and Social Media, machine generated, human generated etc.
- Data Consumers: List of possible consumers of processed data
- Business Processes
- Business Users
- Enterprise Applications
- Individual people in Various Business Roles
- Part of process flows
- Other data repositories or enterprise applications
- Hardware: Hardware on which the Big Data Solution is to be implemented
Understanding the limitations of hardware helps inform the choice of Big Data Solution
6 Basic Steps of Big Data Architecture Designing:
Once we have analyzed the big data scenario of the company, characteristics of the Data and the type of Big Data Pattern, we can move to the planning of Big Data Reference Architecture. We could design the Reference architecture just by following the listed 6 Easy Steps:
- Analyze the Problem:
The task to be performed at this step is similar to what have been explained in the former sections. We need to analyze whether we need the Big Data Solution or not, characteristics of the Data and the type of Big Data Pattern.
- Vendor Selection:
This decision is solely made on the basis of what type of functionality we have to achieve through the tools. There are lot many vendors in the market with a very large range of tools for different tasks. It’s all up to the organization to decide what kind of tool they would like to opt for.
- Deployment Strategy:
It determines whether it will be on premise, cloud based or a mix of both.
- An on premise solution tends to be more secure, however the hardware maintenance would cost a lot more money, effort and time.
- A cloud based solution is more cost effective in terms scalability, procurement and maintenance.
- A mix deployment strategy gives us bit s of both worlds and data storing could be planned as per it’s use.
- Capacity Planning:
At this step we evaluate hardware and infrastructure sizing considering the below factors:
- Data Volume for One-Historical Load
- Daily data ingestion volume
- Retention period of Data
- Data Replication for critical Data
- Time period for which the cluster is sized, after which the cluster is scaled horizontally
- Multi Datacenter deployment
- Infrastructure Sizing:
The inferences from former step helps in infrastructure planning like type of hardware required. It also involves deciding the number of environments required. Important Factors to be considered:
- Types of processing Memory or I/O intensive
- Type of Disk
- No of disks per machine
- Memory Size HDD size
- No of CPU and cores
- Data retained and stored in each environment
- Backup and Disaster Recovery Sizing:
Backup and disaster recovery is a very important part of planning, and involves the following considerations:
- The criticality of data stored
- RPO (Recovery Point Objective) and RTO (Recovery Time Objective) requirements
- Active-Active or Active-Passive Disaster recovery
- Multi datacenter deployment
- Backup Interval (can be different for different types of data)
In my next post I will discuss about the different layer of architecture and functionalities of each one them. Till then let me know if I have left out something in planning steps through comments below.