SparkR allows users to utilize the ease of data analysis in R while using the speed and capacity of Spark on our Hadoop cluster. Those familiar with R should have no problem utilizing this feature. After opening the SparkR session, simply begin typing out your program in R.
Run this to open a SparkR session, run this:
sparkR --master yarn --queue <your_queue> --num-executors 4 --executor-memory 1g --executor-cores 4
The following is an example you can run to get a feel for how SparkR works. This example was taken from the official SparkR documentation, which can be found here, along with other examples.
families <- c("gaussian", "poisson") train <- function(family) { model <- glm(Sepal.Length ~ Sepal.Width + Species, iris, family = family) summary(model) } # Return a list of model's summaries model.summaries <- spark.lapply(families, train) # Print the summary of each model print(model.summaries)