Jupyter Notebook is available on Cavium ThunderX. With these instructions, you can configure Jupyter Notebook to submit jobs to Spark.
At a high level, the instructions below set up port forwarding between your local workstation and your instance of Jupyter running on Cavium ThunderX, and then launch Pyspark with Jupyter Notebook configured as the pyspark.driver. These steps allow you to interactively run Spark jobs from a Jupyter Notebook.
Select Port to run Jupyter Webserver
Each user’s Jupyter web server needs to run on a different port on Cavium ThunderX. Randomly select a port between `8889` and `8999` for your Jupyter instance to use. The examples below use port `8889`, but you should replace that value with the port number you’ve chosen for your Jupyter instance.
If you launch `pyspark` and see an error similar to `The port 8889 is already in use, trying another port.`, then you will need to select a different random port and restart these steps from the beginning.
SSH to Cavium ThunderX with Port Forwarding
From your local computer, start a ssh session with port forwarding. The example below forwards requests to your local computer’s port 8889 to Cavium ThunderX port 8889. These numbers must match.
user@macbook /home/user $ ssh -l <UNIQNAME> -L localhost:8889:localhost:8889 cavium-thunderx.arc-ts.umich.edu
If you get an error in the terminal similar to `channel N: open failed: connect failed: Connection refused`, you can ignore this error. It is due to the Jupyter Notebook not yet running.
Configure Environment Variables
After authenticating to Cavium ThunderX, set up the environment variables below. Change the port number used with the variable `PYSPARK_DRIVER_PYTHON_OPTS` to the value of the port number you’ve chosen for your Jupyter instance.
user@cavium-thunderx-login01 ~ $ export PYSPARK_PYTHON=/sw/dsi/aarch64/centos7/python/3.7.4/bin/python3 user@cavium-thunderx-login01 ~ $ export PYSPARK_DRIVER_PYTHON=/sw/dsi/aarch64/centos7/python/3.7.4/bin/jupyter user@cavium-thunderx-login01 ~ $ export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --MappingKernelManager.kernel_info_timeout=300 --port=8889'
Launch PySpark configured with your queue name as below. Here you can also specify other options such as `––num–executors`, `––executor–memory`, and `––executor–cores`.
After a few moments (up to 60 seconds is typical), the prompt will return the URL (http://…) for the Jupyter Notebook server. Copy and paste that URL in a web browser.
user@cavium-thunderx-login01 ~ $ pyspark --master yarn --queue <YOUR_QUEUE>
Managing Spark Resources
Note that each Jupyter Notebook you launch will start a new Jupyter kernel which corresponds to a new spark job. Be aware that you may be consuming more resources than you intended if you run multiple notebooks at once. If the resources requested by your Jupyter Notebooks exceed the capacity of your queue limits, the Notebook may fail to launch or you may receive errors such as `Timeout waiting for kernel_info_reply` and `Dead Kernel`. You can view your running jobs in the Cavium ThunderX