Cycle of Big Data Management

ISHMEET KAUR
3 min readMay 31, 2020

--

•Capture data: Depending on the problem to be solved, decide on the data sources and the data to be collected.

•Organize: Cleanse, organize and validate data. If data contains sensitive information, implement sufficient levels of security and governance.

•Integrate: Integrate with business rules and other relevant systems like data warehouses, CRMs, etc.

•Analyze: Real time analysis, batch type analysis, reports, visualizations, advanced analytics

•Act: Use analysis to solve the business problem.

Extract Transform Load

1.Extract: Read data from the data source

2.Transform: Convert the format of the extracted data so that it conforms to the requirements of the target database

3.Load: Write data to the target database

Components of a Big Data Infrastructure

Redundant physical infrastructure: Hardware, storage servers, network, etc.

Security infrastructure: Maintaining security and governance on data is extremely critical to protect from misuse of big data.

Data stores: To capture structured, semi-structured, unstructured data. Data stores need to be fast, scalable and durable.

Organize and integrate data: Stage, clean, organize, normalize, transform and integrate data.

  • Analytics: Traditional, including business intelligence and advanced analytics.

Apache Hadoop’s MapReduce and Apache Spark

1.Apache Hadoop’s MapReduce is a popular choice for batch processing large volumes of data in a scalable and resilient style.

2.Apache Spark is more suitable for applying complex analytics using machine learning models in an interactive approach.

DataWarehouse

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.

Properties of a data warehouse:

1.Organized by subject area; it contains an abstracted view of the business

2.Highly transformed and structured

3.Data loaded in the data warehouse is based on a strictly-defined use case

DataLake

A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.

Data lakes can be built on Hadoop’s HDFS or in the cloud using Amazons S3, Azure.

Analyzes of big data

Examples of analytics :

1.Predictive analytics: Using statistical models and machine learning algorithms to predict the outcome of a task

2.Advanced analytics: Applying deep learning for applications like speech recognition, face recognition, genomics

3.Social media analytics: Analyzing social media to determine markets trends, forecast demand, etc.

4.Text analytics: Deriving insights from text using Natural Language Processing

5.Alerts and recommendations: Fraud detection and recommendation engines embedded within e-commerce applications

6.Reports, dashboards and visualizations: Descriptive analytics providing answers on what, when, where type questions.

--

--