Deploy Tall Arrays to a Spark Enabled Hadoop Cluster
Supported Platform: Linux® only.
This example shows how to deploy a MATLAB® application containing tall arrays to a Spark™ enabled Hadoop® cluster.
Goal: Compute the mean arrival delay and the biggest arrival delays of airlines from the given dataset.
| Dataset: | airlinesmall.csv | 
| Description: | Airline departure and arrival information from 1987-2008. | 
| Location: | To download the  setupExample("matlab/AddKeysValuesExample", pwd)AddKeysValuesExample.mlxlive script file
                                    that is automatically downloaded along with theairlinesmall.csvfile. | 
Note
You can follow the same instructions to deploy tall array Spark applications to CLOUDERA® CDH. To see an example on MATLAB Answers™, click here.
To use CLOUDERA CDH encryption zones, add the JAR file
                        commons-codec-1.9.jar to the static classpath of MATLAB Runtime. Location of the file:
                        $HADOOP_PREFIX/lib/commons-codec-1.9.jar, where
                    $HADOOP_PREFIX is the location where Hadoop is installed.
Note
If you are using Spark version 1.6 or higher, you will need to increase the Java® heap size in MATLAB to at least 512MB. For information on how to increase the Java heap size in MATLAB, see Java Heap Memory Settings.
Prerequisites
- Start this example by creating a new work folder that is visible to the MATLAB search path. 
- Install the MATLAB Runtime in a folder that is accessible by every worker node in the Hadoop cluster. This example uses - /usr/local/MATLAB/MATLAB_Runtime/as the location of the MATLAB Runtime folder.- <ver>- If you don’t have the MATLAB Runtime, you can download it from the website at: - https://www.mathworks.com/products/compiler/mcr.- Note - Replace all references to the MATLAB Runtime version - <ver>
- Copy the file - airlinesmall.csvinto Hadoop Distributed File System (HDFS™) folder- /user/<username>/datasets. Here- <username>refers to your user name in HDFS.- $ ./hadoop fs -copyFromLocal airlinesmall.csv hdfs://host:54310/user/<username>/datasets 
Procedure
- Set up the environment variable, - HADOOP_PREFIXto point at your Hadoop install folder. These properties are necessary for submitting jobs to your Hadoop cluster.- setenv('HADOOP_PREFIX','/usr/lib/hadoop') - The - HADOOP_PREFIXenvironment variable must be set when using the MATLAB- datastorefunction to point to data on HDFS. Setting this environment variable has nothing to do with Spark. See Relationship Between Spark and Hadoop for more information.- If you plan on using a dataset that’s on your local machine as opposed to one on HDFS, then you can skip this step. - Note - This example uses - /usr/lib/hadoopas directory where Hadoop is installed. Your Hadoop installation directory maybe different.
- Specify Spark properties. - Use a - containers.Mapobject to specify Spark properties.- sparkProperties = containers.Map( ... {'spark.executor.cores', ... 'spark.executor.memory', ... 'spark.yarn.executor.memoryOverhead', ... 'spark.dynamicAllocation.enabled', ... 'spark.shuffle.service.enabled', ... 'spark.eventLog.enabled', ... 'spark.eventLog.dir'}, ... {'1', ... '2g', ... '1024', ... 'true', ... 'true', ... 'true', ... 'hdfs://host:54310/user/<username>/log'}); - For more information on Spark properties, expand the - propvalue of the- 'SparkProperties'name-value pair in the Input Arguments section of the- SparkConfclass. The- SparkConfclass is part of the MATLAB API for Spark, which provides an alternate way to deploy MATLAB applications to Spark. For more information, see Deploy Applications Using the MATLAB API for Spark.
- Configure your MATLAB application containing tall arrays with Spark parameters. - Use the class - matlab.mapreduce.DeploySparkMapReducerto configure your MATLAB application containing tall arrays with Spark parameters as key-value pairs.- conf = matlab.mapreduce.DeploySparkMapReducer( ... 'AppName','myTallApp', ... 'Master','yarn-client', ... 'SparkProperties',sparkProperties); - For more information, see - matlab.mapreduce.DeploySparkMapReducer.
- Define the Spark execution environment. - Use the - mapreducerfunction to define the Spark execution environment.- mapreducer(conf) - For more information, see - mapreducer.
- Include your MATLAB application code containing tall arrays. - Use the MATLAB function - datastoreto create a- datastoreobject pointing to the file- airlinesmall.csvin HDFS. Pass the- datastoreobject as an input argument to the- tallfunction. This will create a tall array. You can perform operations on the tall array to compute the mean arrival delay and the biggest arrival delays.- % Create a |datastore| for a collection of tabular text files representing airline data. % Select the variables of interest, specify a categorical data type for the % |Origin| and |Dest| variables. % ds = datastore('airlinesmall.csv') % if using a dataset on your local machine ds = datastore('hdfs:///<username>/datasets/airlinesmall.csv'); ds.TreatAsMissing = 'NA'; ds.SelectedVariableNames = {'Year','Month','ArrDelay','DepDelay','Origin','Dest'}; ds.SelectedFormats(5:6) = {'%C','%C'}; % Create Tall Array % Tall arrays are like normal MATLAB arrays, except that they can have any % number of rows. When a |tall| array is backed by a |datastore|, the underlying class of % the tall array is based on the type of datastore. tt = tall(ds); % Remove Rows with Missing Data or NaN Values idx = any(ismissing(tt),2); tt(idx,:) = []; % Compute Mean Delay meanArrivalDelay = mean(tt.DepDelay,'omitnan'); biggestDelays = topkrows(tt,10,'ArrDelay'); % Gather Results % The |gather| function forces evaluation of all queued operations and % brings the resulting output back into memory. [meanArrivalDelay,biggestDelays] = gather(meanArrivalDelay,biggestDelays) % Delete mapreducer object delete(conf); 
- Create a Spark application. - Use the - mcccommand with the- -vCWoptions to create an application using Spark 3.x.- >> mcc -vCW 'Spark:myTallApp,3' deployTallArrayToSpark.m- The following files are created. - Files - Description - run_myTallApp.sh- Shell script to run application. The script invokes - spark-submitto launch the application on the cluster.- myTallApp.jar- Application JAR. The application JAR contains packaged MATLAB code and other dependencies. - readme.txt- Readme file containing details on how to run the application. - requiredMCRProducts.txt- mccExcludedFiles.log- For more information, see - mcc.
- Run the application from a Linux shell using the following command: - $ ./run_myTallApp.sh /usr/local/MATLAB/MATLAB_Runtime/v## - /usr/local/MATLAB/MATLAB_Runtime/is an argument indicating the location of the MATLAB Runtime.- <ver>
- You will see the following output: - meanArrivalDelay = 7.1201 biggestDelays = 10x5 table Year Month ArrDelay Origin Dest ____ _____ ________ ______ ____ 1995 11 1014 HNL LAX 2007 4 914 JFK DTW 2001 4 887 MCO DTW 2008 7 845 CMH ORD 1988 3 772 ORD LEX 2008 4 710 EWR RDU 1998 10 679 MCI DFW 2006 6 603 ABQ PHX 2008 6 586 PIT LGA 2007 4 568 RNO SLC
Optionally, if you want to analyze or view the results generated by your application in
                    MATLAB, you need to write the results to a file on HDFS using the write function for tall arrays. You
                can then read the file using the datastore function.
To write the results to file on HDFS, add the following line of code to your MATLAB application just before the delete(conf) statement
                and then package your application:
write('hdfs:///user/<username>/results', tall(biggestDelays));Replace <username> with your user name.
You can only save one variable to a file using the write function for tall arrays. Therefore,
you will need to write to multiple files if you want to save multiple
variables.
To view the results in MATLAB after executing the application against a Spark enabled cluster, use the datastore function as
                follows:
>> ds = datastore('hdfs:///user/<username>/results')
>> readall(ds)You may need to set the environment variable HADOOP_PREFIX using
the function setenv in case
you are unable to view the results using the datastore function.
Note
If the tall array application being deployed is a MATLAB function as opposed to a MATLAB script, use the following execution syntax:
$ ./run_<applicationName>.sh \ <MATLAB_Runtime_Location> \ [Spark arguments] \ [Application arguments]
$ ./run_myTallApp.sh \ /usr/local/MATLAB/MATLAB_Runtime/v92 \ yarn-client \ hdfs://host:54310/user/<username>/datasets/airlinesmall.csv \ hdfs://host:54310/user/<username>/result
Code:
