Thank you for your interest in Redapt. Whether you are a current customer or are interested in becoming one, we are here to help. Just input a few bits of information, and we'll quickly connect you with a helpful Redapt expert.
Apache Spark is the Taylor Swift of big data software. The open source technology has been around and popular for a few years. But 2015 was the year Spark went from an ascendant technology to a bona fide superstar".
Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.
Apache Spark is fast, easy to use, and a general engine for large-scale data processing and analysis. It provides parallel distributed processing on commodity hardware. It provides a comprehensive, unified framework for Big Data analytics.
Spark Core is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic I/O functionalities, exposed through an application programming interface (for Java, Python, Scala, and R) centered on the resilient distributed dataset (RDD).
Spark MLlib is a distributed machine learning framework on top of Spark Core. Many common machine learning and statistical algorithms have been implemented and are shipped with MLlib which simplifies large scale machine learning pipelines. This blog post will focus on MLlib.
If your company does not have the resources the above companies have to devote to machine learning, Redapt is here to help!
The official Apache Spark website provides detailed documentation on how to install Apache Spark. In this blog post, I will assume you already have Spark installed. However, if you wish, I have also created a Docker image of everything you will need to start a Docker container running Apache Spark, along with some other tools and packages used in these blog posts (e.g., R, iPython, Pandas, NumPy, matplotlib, etc.).
If you want to follow along with the examples provided, you can either use your local install of Apache Spark, or you can pull down my Docker image like so (assuming you already have Docker installed on your local machine):
Note: The above Docker image size is ~2.66 GB, so it might take a while to download, depending on your Internet speed.
Then start a Docker container from the above image with:
Note: The above container will have access to files in the directory the Docker command was run from and mounted at `/data` inside the container. Since it is not a good idea to store your datasets inside Docker containers, this blog post will assume the datasets used in this article are found in the same directory as where you ran the above Docker command (see below for details on the datasets in question).
Apache Spark will be installed under `/spark` and all of Spark's binaries/scripts will be in the user path. As such, you can start, say, pyspark by simply executing `pyspark` from your current working directory.
A simple example using Apache Spark MLlib
Logistic Regression is part of a class of machine learning problems, generally referred to as function approximation. Function approximation is a subset of problems that are called “supervised learning problems”. Linear regression is the most common (and basic) algorithm in this class. Logistic regression (a classifier cousin of linear regression) is especially useful for text classification. It is fast, simple to use, and provides optimum or near-optimum performance on the vast majority of predictive analytics problems encountered in datasets with categorical targets (i.e., the target of prediction is a yes/no, true/false, present/absent, etc. outcome). For example, in text classification, we can ask Do the documents in the dataset have a word/phrase or not? The following is a very simple example of using logistic regression for text classification.
Let's pretend we have a very simple dataset where we want to do some basic text document classification. Our dataset consists of a comma-separated values (CSV) file with one text document per line in the file. Our example dataset is just lines that either contain the word "spark" or do not. This will be the target of our MLlib training and will be what we use to create our ML model. Our training dataset has three columns: document ID (`id`), the text of the document (`text`), and a `label` value of 1.0 or 0.0 for whether or not the line has our target word "spark".
Start up the pyspark REPL (interactive shell):
Note: The following commands should be run from within the pyspark REPL. The following code is based off of the example scripts provided in the Spark tarball.
The above simple script should return the following for Spark MLlib predictions for the test documents:
The above output contains the results of using logistic regression with Spark MLlib to make text document classifier predictions. Each line in the output provides the following:
So, for this very simple example, Spark MLlib made perfect predictions (i.e., all documents with the word "spark" in them were correctly labeled). Of course, with such a simple dataset, the results are not that interesting (or useful, really). The idea of this first blog post in the series was to provide an introduction to Apache Spark, along with a "Hello, world"-type example.
Part 2 of this blog series on Machine Learning will provide a more complicated (and interesting and useful) example of using Apache Spark for Machine Learning.