SparkR

SparkR allows users to utilize the ease of data analysis in R while using the speed and capacity of Spark on our Hadoop cluster. Those familiar with R should have no problem utilizing this feature. After opening the SparkR session, simply begin typing out your program in R.

Run this to open a SparkR session, run this:

sparkR --master yarn --queue <your_queue> --num-executors 4 --executor-memory 1g --executor-cores 4

 

The following is an example you can run to get a feel for how SparkR works. This example was taken from the official SparkR documentation, which can be found here, along with other examples.

families <- c("gaussian", "poisson")
train <- function(family) {
 model <- glm(Sepal.Length ~ Sepal.Width + Species, iris, family = family)
 summary(model)
}
# Return a list of model's summaries
model.summaries <- spark.lapply(families, train)

# Print the summary of each model
print(model.summaries)