What and Why

Apache Spark

Apache Spark is the most active Apache project and has displaced Hadoop as the most popular unified analytics engine for big data. This article introduces the reader to Big Data Sciences and this fascinating and powerful tool.

Jean-Georges Perrin is the author of Spark in Action and is an Enteprise Architect for Advance Auto Parts. He also writes and hosts the YouTube channel, Data Fridays.

Introduction

Introduction

Why Apache Spark?

Apache Spark is used by numerous companies such as Amazon, eBay, NASA, and Walmart, and many of these operate clusters with thousands of nodes. According to the Spark FAQ, the largest of these clusters have more than 8,000 nodes.

Some like to call Spark a unified analytics engine for large-scale data processing. It is indeed an Open-Source project being developed by the free software community, and is currently the most active of the Apache projects.

Compared to Hadoop MapReduce, another data processing platform, Spark accelerates the programs operating in memory by more than 100 times, and on drive — by more than 10 times. Furthermore, the code is written faster because in Spark there are more high-level operators. Spark supports Scala, Python, R, and Java. Spark is well integrated with the Hadoop ecosystem and data sources, but Hadoop is not a requirement. With Spark 3, GPU acceleration is also available.

Spark Components

All the Pieces

SQL and DataFrames

A Spark component that supports data querying using either SQL or dataframe API. Both API's help to access a variety of data sources, including CSV, Hive, Avro, Parquet, ORC, JSON, and JDBC in the usual way. Of course, once ingested, you can join data across the mentioned sources, perform aggregations, and many data transformations.

Spark Streaming

Supports real-time streaming processing. Such data can be log files of the working web server (for example, processed by Apache Flume or placed on HDFS / S3), information from social networks (for example, Twitter), as well as various message queues such as Kafka.

MLlib

A machine learning library that provides various algorithms designed for horizontal scaling on a cluster for classification, regression, clustering, co-filtering, etc.

GraphX

A library for manipulating graphs and performing parallel operations with them. The library provides a universal tool for ETL, research analysis and graph-based iterative computing.

Spark Core

The basic engine for a large-scale parallel and distributed data processing. The core is responsible for:

Cluster Managers

Cluster managers are used for the management of the Spark work in a cluster of servers.

Capabilities

What Spark Can Do

The main functions

Spark helps to create reports quickly, perform aggregations of a large amount of both static data and streams.

It solves the problem of machine learning and distributed data integration. It is easy enough to do. By the way, data scientists may use Spark features through R and Python or connect Spark to their favorite flavor of notebook, like Zeppelin or Jupyter.

It copes with the problem of 'everything with everything' integration. There are a huge amount of Spark connectors. Spark can be used as a quick filter to reduce the dimension of the input data. For example, transmit, filtering and aggregating a flow from Kafka, adding it to MySQL.

The potential applications for Apache Spark are vast, but here are some ideas to get your imagination activities:

Did I mention that Apache Spark is an operating system?

Any distributed analytical workload can be processed, the limit is your imagination!

How to get started

You can think that there are always challenges when tackling problems as scale, such as Big Data. Things that work on a small dataset often work differently on big data in production. However, Apache Spark isolates you from most of those difficulties by shielding you from those constraints. Although Apache Spark is written in Scala, there is absolutely no need to learn this cumbersome language. Extend your Python, Java, or R skills. Of course, you can also use Scala if some weird reason you learnt it.

If you’re a SQL aficionado, simply reuse those skills and use Spark SQL to manipulate data.

A good way to get started is, totally unshamefully, Spark in Action, 2nd edition, published by Manning Publications, where, over its 600 pages, you will learn the basics to building complex data pipelines.

Conclusion

Spark helps to simplify non-trivial tasks related to the high computational load, the processing of big data (both in real time and archived), both structured and unstructured. Spark provides seamless integration of complex features — for example, machine learning and algorithms for working with graphs, and provides processing of Big Data to the masses.Personnally, I like to think of Spark as an Analytics Operating System, as I describe it in the first chapter of Spark in Action or in this video.

Why not give it a try in your project ? you will not regret it!

About the Author

Jean-Georges Perrin “jgp” is an enterprise architect focusing on innovation at Advance Auto Parts and the author of Spark in Action, 2nd edition (Manning). He is passionate about software engineering and all things data, small and big data. His latest endeavors bring to more and more data engineering, data governance, and, his favorite theme, the industrialization of data science. He is proud to have been the first in France to be recognized as an IBM Champion and to have been awarded the honor for his 12th consecutive year. Jean-Georges shares his more than 25 years of experience in the IT industry as a presenter and participant at conferences and through publishing articles in print and online media. His blog is visible at http://jgp.ai. When he is not immersed in IT, which he loves, he enjoys exploring his adopted region of North Carolina with his wife and kids.

KFocus Logo