Spark on Amazon EMR

  • June 17, 2016

    Spark on Amazon EMR

    June 17, 2016


    I recently had the pleasure to explore Apache Spark for a client engagement, and what I discovered was a whole lot of awesomeness. You could say that it sparked my curiosity (pun absolutely intended). I also got to play around a bit with Apache Parquet, a cool columnar data format made for big data processing. As is customary, I spun-up a repository to share my learnings. Both code examples are written in Python and use the Python API for Spark, otherwise known as PySpark. There is much, much more to explore with Spark, so stay tuned for new blog posts. But you already knew that.


    This example script can be can be run against Amazon Elastic MapReduce (EMR) running a Spark cluster via the Apache Zeppelin sandbox environment. Running the script will output the results shown in Figure 1 inside Zeppelin.
    Figure 1: Sample Output


    So What’s Going on in the Script?

    Figure 2 shows the script used to generate the above results. Note: Lines 1 ““ 31 have been excluded for brevity but basically include some setup information and a helper function used in the script output.

    Figure 3 shows the Spark UI after running our script.

    Lines(s) 32-42           Purpose: Import PySpark modules SparkContext and SQLContext. Create a SQLContext object passing in the SparkContext. When running in the Zeppelin environment this context is set from Zeppelin.

    Line(s) 45-48            Purpose: Load a Parquet formatted file from Amazon S3 into a DataFrame and create a temporary table. Standard SQL Queries can be run against this in-memory table.

    Line(s) 50-59            Purpose: Run a simple query and use a lambda to output results.

    Line(s) 61-71            Purpose: Run a more complex slightly complex query and use a lambda to output results.

    “Figure 2:
    “Figure 3: Spark GUI showing our script being run

    Running Script Against Local Spark 

    I have a script configured to run against a local instance of Spark. It is almost identical to the EMR script with a few notable changes. Please refer to the repository README for instructions to run the script locally.

    “Figure 4: Spark Configuration

    I feel like I really got a decent grasp with the following pieces of tech with my exploration.

    1. Apache Spark
    2. Amazon EMR
    3. Apache Zeppelin
    4. PySpark
    5. SparkSQL

    There is so much more to experiment with, so keep your eyes peeled for new posts. Not actually peeled  –  that sounds painful. In the meantime, dive into the repository and start playing around yourself — ignite the awesomeness!


    Leave a comment

    Required fields are marked *