Spark on Amazon EMR
I recently had the pleasure to explore Apache Spark for a client engagement, and what I discovered was a whole lot of awesomeness. You could say that it sparked my curiosity (pun absolutely intended). I also got to play around a bit with Apache Parquet, a cool columnar data format made for big data processing. As is customary, I spun-up a repository to share my learnings. Both code examples are written in Python and use the Python API for Spark, otherwise known as PySpark. There is much, much more to explore with Spark, so stay tuned for new blog posts. But you already knew that.
SPARK ON AMAZON EMR
This example script can be can be run against Amazon Elastic MapReduce (EMR) running a Spark cluster via the Apache Zeppelin sandbox environment. Running the script will output the results shown in Figure 1 inside Zeppelin.
So What’s Going on in the Script?
Figure 2 shows the script used to generate the above results. Note: Lines 1 ““ 31 have been excluded for brevity but basically include some setup information and a helper function used in the script output.
Figure 3 shows the Spark UI after running our script.
Lines(s) 32-42 Purpose: Import PySpark modules SparkContext and SQLContext. Create a SQLContext object passing in the SparkContext. When running in the Zeppelin environment this context is set from Zeppelin.
Line(s) 50-59 Purpose: Run a simple query and use a lambda to output results.
Line(s) 61-71 Purpose: Run a more complex slightly complex query and use a lambda to output results.
Running Script Against Local Spark
I have a script configured to run against a local instance of Spark. It is almost identical to the EMR script with a few notable changes. Please refer to the repository README for instructions to run the script locally.
I feel like I really got a decent grasp with the following pieces of tech with my exploration.
- Apache Spark
- Amazon EMR
- Apache Zeppelin
There is so much more to experiment with, so keep your eyes peeled for new posts. Not actually peeled – that sounds painful. In the meantime, dive into the repository and start playing around yourself — ignite the awesomeness!