Apache Spark
Fast and general engine for large-scale data processing
Spark, a cluster computing framework for large-scale data processing and analysis
- Parallel distributed processing on commodity hardware
- Fast in speed and easy to use
- Comprehensive unified framework for Big Data Analytics
- Open source and Apache’s top level project
- Big data use cases like intrusion detection, product recommendations, estimating financial risks, detecting genomic associations to a disease require analysis on large scale data
- In depth analysis requires a combination of tools like SQL, statistics and machine learning to gain meaningful insights from data
- Historical choices like R, Octave operate only on single node machines and hence are not suitable for large volume data sets
- Spark allows rich programming APIs like SQL, machine learning, graph processing to run on clusters of computers to achieve large scale data processing and analysis.
Why to use Spark?
Challenges of distributed processing:
- Distributed programming is much more complex than single node programming
- Data must be partitioned across machines increasing the latency if data has to be shared between machines over the network.
- Chances of failure increases with increase in number of machines
Spark makes distributed processing easy:
Provides distributed and parallel processing framework.
Provides scalability
Provides fault tolerance
Provides a programming paradigm makes it easy to write code in a parallel manner
Spark and speed
- Lightning fast speeds due to in-memory caching and DAG-based processing engine.
- 100 times fast than Hadoop’s MapReduce for in-memory computations and 10 times faster for on disk.
- Winner of Daytona GraySort contest, sorting a petabyte 3 times faster and using 10 times less hardware than Hadoop’s MapReduce (https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html)
- Well suited for iterative algorithms in machine learning.
- Fast, real-time response to user queries on large in-memory data set.
- Low latency data analysis applied to processing live data streams
Spark is easy to use:
- General purpose programming model using expressive languages like Scala, Python and Java.
- Existing libraries and API makes it easy to write programs combining batch, streaming, iterative machine learning and complex queries in a single application.
- Interactive shell is available for Python and Scala.
- Built for performance and reliability, written in Scala and runs on top of JVM.
- Operational and debugging tools from the Java stack are available for programmers.
Spark is a comprehensive unified framework for Big Data Analytics
- Collapses the data science pipeline by allowing pre-processing of data to model evaluation in one single system.
- Spark provides API for data munging, ETL, machine learning, graph processing, streaming, interactive and batch processing. Can replace several SQL, streaming and complex analytics systems with one unified environment.
- Simplifies application development, deployment and maintenance.
- Strong integration with variety of tools in the Hadoop ecosystem.
- Can read and write to different data formats and data sources including HDFS, Cassandra, S3 and HBase.
Spark is a NOT a data storage system
- Spark is not a data store but is versatile in reading from and writing to a variety of data sources.
- Can read and write to different data formats and data sources including HDFS, Cassandra, S3 and Hbase.
- Can access traditional BI tools using a server mode that provides standard JDBC and ODBC connectivity.
- The DataFrame API provides a pluggable mechanism to access structured data using Spark SQL.
- API provides tight optimization integration, thus enhancing the speed of the Spark jobs that process vast amounts of data
Summary
•Spark is a lightning fast, general purpose engine for Big Data Analytics
•Spark has risen in popularity due to its speed, sophistication of operation and ease of use.
•Spark is not a data store, but a data processing engine and is versatile in reading from and writing to a variety of data sources
•Spark is open source and a top level Apache project today