This lecture will talk about three open-source frameworks for large scale data processing:
Apache Hadoop,
Apache Spark,
Dask.
All three allow users to take advantage of distributed architectures for data processing.
Learning outcomes
Describe the key features of Hadoop, Spark and Dask.
Recognize the advantages of using distributed file systems for data processing.
Use pyspark to programmatically access Spark’s capabilities.
Explain why we would want to perform stream data processing.
Compare Spark and Dask’s differences and similarities.
Hadoop
Hadoop is a framework with several components.
We will focus on two of them:
The Hadoop Distributed File System (HDFS).
An implementation of MapReduce.
Together, these two elements let users distribute data processing across many nodes without having to worry (too much) about how files are structured or how computation is performed.
Distributed file systems
Distributed file systems store files across many storage nodes.
Instead of a single copy, each file resides in multiple copies across different parts of the system.
Data locality - processing tasks can locally access data they need.
Consistency ensured through synchronization.
Scalable by adding more nodes.
Resistant to failure.
Transparency - the user should not need to know the details of how the file system works.
Hadoop MapReduce
To write a MapReduce application in Hadoop, a user must:
Specify the mapper (how should a small chunk of data be processed?).
Specify the reducer (how should the mapper results be combined?).
Get the data onto HDFS (the cluster nodes).
Launch the job by pointing to the data, mapper and reducer.
Hadoop is Java based but it is possible to specify the mapper and reducer in other languages using the Hadoop Streaming interface - for example in Python.
Once these components are specified and a job is launched, Hadoop takes care of the rest, including the communication between the different stages and determining on which nodes processing occurs.
This can be powerful for handling large volumes of data relatively simply.
While powerful, Hadoop MapReduce is mainly appropriate for large batch processing tasks.
Hadoop MapReduce data flow
The MapReduce model enforces a linear data flow structure:
data chunks are read from disk in the HDFS,
the mapper processes the data chunks,
the reducer combines the mapper outputs,
the results are written back to disk in the HDFS.
This restriction makes Hadoop MapReduce less efficient for iterative algorithms and interactive data exploration, due to the high overhead of repeatedly reading and writing to disk.
Spark
Apache Spark is a framework for processing large amounts of data in memory.
It supports more general workflows than Hadoop MapReduce, and by avoiding writing to disk unless needed, allows more efficient iteration and data exploration.
It can run on top of Hadoop and take advantage of HDFS.
Spark features
Basic features of Spark:
Lazy: does not perform computations until required.
In-memory processing: faster than using disk storage.
Resilience: steps used to produce data tracked to allow reconstruction.
Streams: can handle data in real time as it arrives (“online”).
Spark SQL and datasets
Spark comes with interfaces in a few languages, including Python (pyspark).
These offer programmatic access to its various capabilities.
Spark SQL is a component for structured data processing with support for executing SQL queries.
The key abstractions in Spark SQL are Datasets and DataFrames - pyspark only supports DataFrames.
These both represent collections of items that may be distributed over multiple nodes.
Example: pyspark pandas
DataFrames can be created by specifying the values directly, or by reading from and transforming an existing source (including a Pandas DataFrame).
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/03/13 14:58:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[Stage 0:> (0 + 1) / 1][Stage 0:===========================================================(1 + 0) / 1]
x
y
0
1.0
2.0
1
2.1
4.2
2
3.5
7.0
Spark DataFrames can also be handled like pandas DataFrames:
This allows using Spark’s distributed processing functionality without major changes from code that originally used pandas.
The distributed nature of the data is transparent.
MLlib
Spark includes an extensive Machine Learning library (MLlib), with support for
classification,
regression,
clustering,
reading from different file types,
data pipelines.
Using Spark’s MLlib allows taking advantage of distributed processing more easily than for example scikit-learn.
Example: MLlib logistic regression
To allow us to demonstrate fitting a logistic regression model with MLlib, we first download an example data file and save it in a local temporary file:
Unlike Spark, Dask is written in and primarily used within Python, and is tightly integrated with other packages in the wider Python ecosystem.
Dask does not support SQL but equivalent operation can be performed using its Python interface.
Dask task scheduling
Dask supports a generic task-scheduling paradigm that can make it more flexible than alternatives such as Spark
Image credit: Dask documentation
A graph of the tasks corresponding to a computation is built and can be then be executed on a variety of schedulers working on both single-machines and clusters.
Unlike pandas where operations are evaluated eagerly, operations in Dask are evaluated lazily with an expression graph being built representing the applied operations.
We can visualize the expression graph associated with a Dask DataFrame using the visualize method
results.visualize(rankdir="TD")
We can also optimize the expression graph, for example fusing operations
We can also instead visualize the constructed task graph
optimized.visualize(tasks=True, rankdir="LR")
Calling the compute method on a Dask DataFrame will schedule the attached task graph, and output a pandas DataFrame
optimized.compute().head()
x
y
name
Alice
0.288051
0.577736
Bob
0.289142
0.577752
Charlie
0.288792
0.576825
Dan
0.288814
0.575794
Edith
0.289188
0.577416
Dask defaults to a threaded scheduler to run computations, but distributed schedulers can be used instead.
Summary
Using frameworks makes it much easier to build applications with distributed data processing.
Hadoop and Spark are complementary: Hadoop provides more basic functionality, while Spark can run standalone and is better for data that fits in memory.
Dask is a modern Python based alternative to Spark for distributed data processing that offers a very general task scheduling framework.