Processing the CoreLogic Data on Great Lakes using PySpark
March 17 @ 10:00 am - 12:00 pm
This workshop provides an introduction to processing CoreLogic data using PySpark on the Great Lakes cluster. The CoreLogic dataset contains aggregated data from individual, parcel-level real estate transactions and financial records. U-M has licensed access to Tax, Deed, and Foreclosure data at the parcel level for every county in the United States. We will demonstrate how to request access to the dataset, and how to quickly get started with processing and running some basic analytics on the data with the user-friendly, browser-based Open OnDemand tool and the PySpark language on the cluster.
Note that a Great Lakes account is required for this workshop, and you must have an account before the start of the workshop in order to participate in the exercises. See below for account request information.
Research Data Scientist Intermediate
Information and Technology Services – Advanced Research Computing
Armand Burks, Ph.D., is a research data scientist intermediate for Advanced Research Computing (ARC), a division of Information and Technology Services (ITS). Armand helps researchers with establishing data workflows, transforming data between different formats, programming support, optimizing/parallelizing code, cloud computing with Hadoop, and developing custom code (C++, Java, Python). He earned a B.S. in computer science from Alabama State University in 2008, an M.S. in computer science and engineering from Michigan State University in 2010, and a Ph.D. in computer science from Michigan State University in 2017.
Center for Political Studies, Institute for Social Research and Information and Technology Services – Advanced Research Computing
Jule Krüger, Ph.D., is the ISR Program Manager for Big Data and Data Science, based within the Center for Political Studies at the Institute for Social Research, and a member of the Advanced Research Computing Consulting Services. She has more than 10 years of experience in processing, analyzing and interpreting data for social science research, and automating workflows for scalable, auditable and reproducible analysis.
Prerequisites: Participants will need an active Great Lakes account and login in order to complete hands-on exercises. Some familiarity with PySpark is helpful.
For more information on The Great Lakes cluster, click here https://arc.umich.edu/greatlakes/.
Click here to fill out an account request form https://arc.umich.edu/login-request
Note: 3 business days are needed for creation of accounts
Students should fill in “Workshop” in the “Advisor” section.
Campus VPN access is required for off-campus access to Great Lakes but not from on campus. An SSH client, and Duo will be required during the workshop in order to use Great Lakes. If you do not have this software already, please download and install the Cisco AnyConnect VPN software following these instructions: https://its.umich.edu/enterprise/wifi-networks/vpn/getting-started You will need this to be able to use the ssh client. You will need to use the ‘Campus All traffic’ profile in the Cisco client.
A Zoom link will be provided to the participants the day before the class. Registration is required.
Please note, this session will be recorded.