Insights > Blog

Apache Spark for Machine Learning - Part 2

By Christoph Champ | Posted on May 15, 2017 | Posted in Emerging Tech, Cloud Native


This post will continue from where Part 1 of this series on Machine Learning left off from (see: "Using Apache Spark for Machine Learning "“ Part 1").

Getting started with Apache Spark

The official Apache Spark website provides detailed documentation on how to install Apache Spark []. In this blog post, I will assume you already have Spark installed. However, if you wish, I have also created a Docker image of everything you will need to start a Docker container running Apache Spark, along with some other tools and packages used in these blog posts (e.g., R, iPython, Pandas, NumPy, matplotlib, etc.).

If you want to follow along with the examples provided, you can either use your local install of Apache Spark, or you can pull down my Docker image like so (assuming you already have Docker installed on your local machine):

Note: The above Docker image size is ~2.66 GB, so it might take a while to download, depending on your Internet speed.

 Then start a Docker container from the above image with:

Note: The above container will have access to files in the directory the Docker command was run from and mounted at `/data` inside the container. Since it is not a good idea to store your datasets inside Docker containers, this blog post will assume the datasets used in this article are found in the same directory as where you ran the above Docker command (see below for details on the datasets in question).

Apache Spark will be installed under `/spark` and all of Spark's binaries/scripts will be in the user path. As such, you can start, say, pyspark by simply executing `pyspark` from your current working directory.

A simple example using Apache Spark MLlib

 Logistic Regression is part of a class of machine learning problems, generally referred to as function approximation. Function approximation is a subset of problems that are called "supervised learning problems". Linear regression is the most common (and basic) algorithm in this class. Logistic regression (a classifier cousin of linear regression) is especially useful for text classification. It is fast, simple to use, and provides optimum or near-optimum performance on the vast majority of predictive analytics problems encountered in datasets with categorical targets (i.e., the target of prediction is a yes/no, true/false, present/absent, etc. outcome). For example, in text classification, we can ask Do the documents in the dataset have a word/phrase or not? The following is a very simple example of using logistic regression for text classification.

 Let's pretend we have a very simple dataset where we want to do some basic text document classification. Our dataset consists of a comma-separated values (CSV) file with one text document per line in the file. Our example dataset is just lines that either contain the word "spark" or do not. This will be the target of our MLlib training and will be what we use to create our ML model. Our training dataset has three columns: document ID (`id`), the text of the document (`text`), and a `label` value of 1.0 or 0.0 for whether or not the line has our target word "spark".

Our test dataset (i.e., the one we will use to test how accurate our ML model is at predicting which lines contain our "spark" word) has the same structure as our training dataset, except there are different text documents (aka "lines") and the `label` column (i.e., whether or not the text document contains the "spark" word or not) is not included, as this is what we are trying to predict.

The text document classification pipeline we will use has the following workflow:

  • Training workflow: Input is a set of text documents, where each document is labelled. Stages while training the ML model are:
  •      Split each text document (or lines in our `train.csv` file) into words;
  •      Convert each document's words into a numerical feature vector; and
  •      Create a prediction model using the feature vectors and labels.
  • Test/prediction workflow: Input is a set of text documents and the goal is to predict a label for each document. Stages while testing or making predictions with the ML model are:
  •      Split each text document (or lines in our `test.csv` file) into words;
  •      Convert each document's words into a numerical feature vector; and
  •      Use the trained model to make predictions on the feature vector.

 Start up the pyspark REPL (interactive shell):

Note: The following commands should be run from within the pyspark REPL. The following code is based off of the example scripts provided in the Spark tarball.

The above simple script should return the following for Spark MLlib predictions for the test documents:

The above output contains the results of using logistic regression with Spark MLlib to make text document classifier predictions. Each line in the output provides the following:

  • The input line (e.g., 4, spark i j k);
  • The probability that the line does not and does contain the target word (e.g., ~16% does not, ~84% does); and
  • The prediction made by Spark (1.0 = true; 0.0 false)

So, for this very simple example, Spark MLlib made perfect predictions (i.e., all documents with the word "spark" in them were correctly labeled). Of course, with such a simple dataset, the results are not that interesting (or useful, really). The idea of this second blog post in the series was to provide an introduction to Apache Spark, along with a "Hello, world"-type example.


Part 3 of this blog series on Machine Learning will provide a more complicated (and interesting and useful) example of using Apache Spark for Machine Learning.