21 Nov 2020 - Jean-Georges Perin
Apache Spark is the most active Apache project and has displaced Hadoop as the most popular unified analytics engine for big data. This article introduces the reader to Big Data Sciences and this fascinating and powerful tool.
Apache Spark is used by numerous companies such as Amazon, eBay, NASA, and Walmart, and many of these operate clusters with thousands of nodes. According to the Spark FAQ, the largest of these clusters have more than 8,000 nodes.
Some like to call Spark a unified analytics engine for large-scale data processing. It is indeed an Open-Source project being developed by the free software community, and is currently the most active of the Apache projects.
Compared to Hadoop MapReduce, another data processing platform, Spark accelerates the programs operating in memory by more than 100 times, and on drive — by more than 10 times. Furthermore, the code is written faster because in Spark there are more high-level operators. Spark supports Scala, Python, R, and Java. Spark is well integrated with the Hadoop ecosystem and data sources, but Hadoop is not a requirement. With Spark 3, GPU acceleration is also available.
A Spark component that supports data querying using either SQL or dataframe API. Both API's help to access a variety of data sources, including CSV, Hive, Avro, Parquet, ORC, JSON, and JDBC in the usual way. Of course, once ingested, you can join data across the mentioned sources, perform aggregations, and many data transformations.
Supports real-time streaming processing. Such data can be log files of the working web server (for example, processed by Apache Flume or placed on HDFS / S3), information from social networks (for example, Twitter), as well as various message queues such as Kafka.
A machine learning library that provides various algorithms designed for horizontal scaling on a cluster for classification, regression, clustering, co-filtering, etc.
A library for manipulating graphs and performing parallel operations with them. The library provides a universal tool for ETL, research analysis and graph-based iterative computing.
The basic engine for a large-scale parallel and distributed data processing. The core is responsible for:
Cluster managers are used for the management of the Spark work in a cluster of servers.
Spark helps to create reports quickly, perform aggregations of a large amount of both static data and streams.
It solves the problem of machine learning and distributed data integration. It is easy enough to do. By the way, data scientists may use Spark features through R and Python or connect Spark to their favorite flavor of notebook, like Zeppelin or Jupyter.
It copes with the problem of 'everything with everything' integration. There are a huge amount of Spark connectors. Spark can be used as a quick filter to reduce the dimension of the input data. For example, transmit, filtering and aggregating a flow from Kafka, adding it to MySQL.
The potential applications for Apache Spark are vast, but here are some ideas to get your imagination activities:
Any distributed analytical workload can be processed, the limit is your imagination!
You can think that there are always challenges when tackling problems as scale, such as Big Data. Things that work on a small dataset often work differently on big data in production. However, Apache Spark isolates you from most of those difficulties by shielding you from those constraints. Although Apache Spark is written in Scala, there is absolutely no need to learn this cumbersome language. Extend your Python, Java, or R skills. Of course, you can also use Scala if some weird reason you learnt it.
If you’re a SQL aficionado, simply reuse those skills and use Spark SQL to manipulate data.
A good way to get started is, totally unshamefully, Spark in Action, 2nd edition, published by Manning Publications, where, over its 600 pages, you will learn the basics to building complex data pipelines.
Spark helps to simplify non-trivial tasks related to the high computational load, the processing of big data (both in real time and archived), both structured and unstructured. Spark provides seamless integration of complex features — for example, machine learning and algorithms for working with graphs, and provides processing of Big Data to the masses.Personnally, I like to think of Spark as an Analytics Operating System, as I describe it in the first chapter of Spark in Action or in this video.
Why not give it a try in your project ? you will not regret it!