Apache Spark

3 min readMay 25, 2020

Fast and general engine for large-scale data processing

Spark, a cluster computing framework for large-scale data processing and analysis

Parallel distributed processing on commodity hardware
Fast in speed and easy to use
Comprehensive unified framework for Big Data Analytics
Open source and Apache’s top level project
Big data use cases like intrusion detection, product recommendations, estimating financial risks, detecting genomic associations to a disease require analysis on large scale data
In depth analysis requires a combination of tools like SQL, statistics and machine learning to gain meaningful insights from data
Historical choices like R, Octave operate only on single node machines and hence are not suitable for large volume data sets
Spark allows rich programming APIs like SQL, machine learning, graph processing to run on clusters of computers to achieve large scale data processing and analysis.

Why to use Spark?

Challenges of distributed processing:

Distributed programming is much more complex than single node programming
Data must be partitioned across machines increasing the latency if data has to be shared between machines over the network.
Chances of failure increases with increase in number of machines

Spark makes distributed processing easy:

Provides distributed and parallel processing framework.

Provides scalability

Provides fault tolerance

Provides a programming paradigm makes it easy to write code in a parallel manner

Spark and speed

Lightning fast speeds due to in-memory caching and DAG-based processing engine.
100 times fast than Hadoop’s MapReduce for in-memory computations and 10 times faster for on disk.
Winner of Daytona GraySort contest, sorting a petabyte 3 times faster and using 10 times less hardware than Hadoop’s MapReduce (https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html)
Well suited for iterative algorithms in machine learning.
Fast, real-time response to user queries on large in-memory data set.
Low latency data analysis applied to processing live data streams

Spark is easy to use:

General purpose programming model using expressive languages like Scala, Python and Java.
Existing libraries and API makes it easy to write programs combining batch, streaming, iterative machine learning and complex queries in a single application.
Interactive shell is available for Python and Scala.
Built for performance and reliability, written in Scala and runs on top of JVM.
Operational and debugging tools from the Java stack are available for programmers.

Spark is a comprehensive unified framework for Big Data Analytics

Collapses the data science pipeline by allowing pre-processing of data to model evaluation in one single system.
Spark provides API for data munging, ETL, machine learning, graph processing, streaming, interactive and batch processing. Can replace several SQL, streaming and complex analytics systems with one unified environment.
Simplifies application development, deployment and maintenance.
Strong integration with variety of tools in the Hadoop ecosystem.
Can read and write to different data formats and data sources including HDFS, Cassandra, S3 and HBase.

Spark is a NOT a data storage system

Spark is not a data store but is versatile in reading from and writing to a variety of data sources.
Can read and write to different data formats and data sources including HDFS, Cassandra, S3 and Hbase.
Can access traditional BI tools using a server mode that provides standard JDBC and ODBC connectivity.
The DataFrame API provides a pluggable mechanism to access structured data using Spark SQL.
API provides tight optimization integration, thus enhancing the speed of the Spark jobs that process vast amounts of data

Summary

•Spark is a lightning fast, general purpose engine for Big Data Analytics

•Spark has risen in popularity due to its speed, sophistication of operation and ease of use.

•Spark is not a data store, but a data processing engine and is versatile in reading from and writing to a variety of data sources

•Spark is open source and a top level Apache project today