SparkSQL is a way for people to use SQL-like language to query their data with ease while taking advantage of the speed of Spark, a fast, general engine for data processing that runs over Hadoop. I wanted to test this out on a dataset I found from Walmart with their stores’ weekly sales numbers. I put the csv into our cluster’s HDFS (in /var/walmart) making it accessible to all Flux Hadoop users.
XSEDE and the Pittsburgh Supercomputing Center are presenting a one day Big Data workshop. This workshop will focus on topics such as Hadoop and Spark. U-M is one of several sites around the country that will host a telecast of the session. Registration is required as space is limited.
11:25 Intro to Big Data
1:00 Lunch break
4:15 A Big Big Data Platform