X CLOSE

Contact Us

Thank you for your interest in Redapt. Whether you are a current customer or are interested in becoming one, we are here to help.  Just input a few bits of information, and we'll quickly connect you with a helpful Redapt expert.

Thank you! Your message has been sent!
Oops! Something went wrong while submitting the form

SPARK ON AMAZON EMR

Jonathan Dawson
|
June 17, 2016
|
Posted In:
Resource Posts

OVERVIEW

I recently had the pleasure to explore Apache Spark for a client engagement, and what I discovered was a whole lot of awesomeness. You could say that it sparked my curiosity (pun absolutely intended). I also got to play around a bit with Apache Parquet, a cool columnar data format made for big data processing. As is customary, I spun-up a repository to share my learnings. Both code examples are written in Python and use the Python API for Spark, otherwise known as PySpark. There is much, much more to explore with Spark, so stay tuned for new blog posts. But you already knew that.

SPARK ON AMAZON EMR

This example script can be can be run against Amazon Elastic MapReduce (EMR) running a Spark cluster via the Apache Zeppelin sandbox environment. Running the script will output the results shown in Figure 1 inside Zeppelin.
Figure 1: Sample Output

So What's Going on in the Script?

Figure 2 shows the script used to generate the above results. Note: Lines 1 – 31 have been excluded for brevity but basically include some setup information and a helper function used in the script output.

Figure 3 shows the Spark UI after running our script.

Lines(s) 32-42           Purpose: Import PySpark modules SparkContext and SQLContext. Create a SQLContext object passing in the SparkContext. When running in the Zeppelin environment this context is set from Zeppelin.

Line(s) 45-48            Purpose: Load a Parquet formatted file from Amazon S3 into a DataFrame and create a temporary table. Standard SQL Queries can be run against this in-memory table.

Line(s) 50-59            Purpose: Run a simple query and use a lambda to output results.

Line(s) 61-71            Purpose: Run a more complex slightly complex query and use a lambda to output results.

‍Figure 2: pyspark-s3-parquet-example.py
‍Figure 3: Spark GUI showing our script being run

Running Script Against Local Spark 

I have a script configured to run against a local instance of Spark. It is almost identical to the EMR script with a few notable changes. Please refer to the repository README for instructions to run the script locally.

‍Figure 4: Spark Configuration

I feel like I really got a decent grasp with the following pieces of tech with my exploration.

  1. Apache Spark
  2. Amazon EMR
  3. Apache Zeppelin
  4. PySpark
  5. SparkSQL

There is so much more to experiment with, so keep your eyes peeled for new posts. Not actually peeled — that sounds painful. In the meantime, dive into the repository and start playing around yourself … ignite the awesomeness!

Tagged in:
#
Tech Talk
#
Cloud Native
#
Engineering Services
YOUR VISION; OUR EXPERT ENGINEERS
Get started on your IT journey today!
GET IN TOUCH
X CLOSE

support form

NOTE: Use this form only if you DO NOT have a support login to forge.redapt.com.
Please complete as many fields as possible so we can expedite your support ticket.

Thank you! Your message has been sent!
Oops! Something went wrong while submitting the form