If you’re familiar with Spark, you know that a dataframe is essentially a data structure that contains “tabular” data in memory. That is, it consists of rows and columns of data that can, for example, store the results of an SQL-style query. Dataframes can be saved into HDFS as Parquet files. Parquet files not only preserve the schema information of the dataframe, but will also compress the data when it gets written into HDFS. This means that the saved file will take up less space in HDFS and it will load faster if you read the data again later. Therefore, it is a useful storage format for data you may want to analyze multiple times.
The Pyspark example below uses Reddit data which is available to all Flux Hadoop users in HDFS ‘/var/reddit’. This data consists of information about all posts made on the popular website Reddit, including their score, subreddit, text body, author, all of which can make for interesting data analysis.
#First, launch the pyspark shell pyspark --master yarn --queue <your_queue> --num-executors 35 --executor-cores 4 --executor-memory 5g #Load the reddit data into a dataframe >>> reddit = sqlContext.read.json("/var/reddit/RS_2016-0*") #Set compression type to snappy >>> sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy") #Write data into a parquet file - this example puts it into your HDFS home directory as “reddit.parquet” >>> reddit.write.parquet("reddit.parquet") #Create a new dataframe from parquet file >>> parquetFile = sqlContext.read.parquet("reddit.parquet") #Register dataframe as a SQL temporary table >>> parquetFile.registerTempTable(“reddit_table") #Query the table #Can really be any query, but this query will find some of the more highly rated posts >>> ask = sqlContext.sql(“SELECT title FROM reddit_table WHERE score > 1000 and subreddit = ‘AskReddit’”) #Show first 20 lines of results (actually executes query) >>> ask.show() #Since we created the dataframe “ask” with the previous query, we can write it out to HDFS as a parquet file so it can be accessed again later >>> ask.write.parquet(“ask.parquet”) #Exit the pyspark console - you’ll view the contents of your parquet file after >>> exit()
To view the contents of your Parquet file, use Parquet tools. Parquet tools is a command line tool that aids in the inspection of Parquet files, such as viewing its contents or its schema.
#view the output hadoop jar /sw/dsi/dsi/noarch/parquet-tools-1.7.0.jar cat \ ask.parquet #view the schema; in this case, just the “title” of the askreddit thread hadoop jar /sw/dsi/dsi/noarch/parquet-tools-1.7.0.jar schema \ ask.parquet #to get a full list of all of the options when using Parquet tools hadoop jar /sw/dsi/dsi/noarch/parquet-tools-1.7.0.jar -h