Jupyter on Cavium

Jupyter

Jupyter Notebook is available on Cavium ThunderX. With these instructions, you can configure Jupyter Notebook to submit jobs to Spark.

At a high level, the instructions below set up port forwarding between your local workstation and your instance of Jupyter running on Cavium ThunderX, and then launch Pyspark with Jupyter Notebook configured as the pyspark.driver. These steps allow you to interactively run Spark jobs from a Jupyter Notebook.

Select Port to run Jupyter Webserver

Each user’s Jupyter web server needs to run on a different port on Cavium ThunderX. Randomly select a port between `8889` and `8999` for your Jupyter instance to use. The examples below use port `8889`, but you should replace that value with the port number you’ve chosen for your Jupyter instance.

If you launch `pyspark` and see an error similar to `The port 8889 is already in use, trying another port.`, then you will need to select a different random port and restart these steps from the beginning.

SSH to Cavium ThunderX with Port Forwarding

From your local computer, start a ssh session with port forwarding. The example below forwards requests to your local computer’s port 8889 to Cavium ThunderX port 8889. These numbers must match.

user@macbook /home/user $ ssh -l <UNIQNAME> -L localhost:8889:localhost:8889 cavium-thunderx.arc-ts.umich.edu

If you get an error in the terminal similar to `channel N: open failed: connect failed: Connection refused`, you can ignore this error. It is due to the Jupyter Notebook not yet running.

Configure Environment Variables

After authenticating to Cavium ThunderX, set up the environment variables below. Change the port number used with the variable `PYSPARK_DRIVER_PYTHON_OPTS` to the value of the port number you’ve chosen for your Jupyter instance.

user@cavium-thunderx-login01 ~ $ export PYSPARK_PYTHON=/sw/dsi/aarch64/centos7/python/3.7.4/bin/python3
user@cavium-thunderx-login01 ~ $ export PYSPARK_DRIVER_PYTHON=/sw/dsi/aarch64/centos7/python/3.7.4/bin/jupyter
user@cavium-thunderx-login01 ~ $ export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --MappingKernelManager.kernel_info_timeout=300 --port=8889'

Start PySpark

Launch PySpark configured with your queue name as below. Here you can also specify other options such as `––num–executors`, `––executor–memory`, and `––executor–cores`.
After a few moments (up to 60 seconds is typical), the prompt will return the URL (http://…) for the Jupyter Notebook server. Copy and paste that URL in a web browser.

user@cavium-thunderx-login01 ~ $ pyspark --master yarn --queue <YOUR_QUEUE>

Managing Spark Resources

Note that each Jupyter Notebook you launch will start a new Jupyter kernel which corresponds to a new spark job. Be aware that you may be consuming more resources than you intended if you run multiple notebooks at once. If the resources requested by your Jupyter Notebooks exceed the capacity of your queue limits, the Notebook may fail to launch or you may receive errors such as `Timeout waiting for kernel_info_reply` and `Dead Kernel`. You can view your running jobs in the Cavium ThunderX
[Yarn UI]

(http://cavium-rm01.arc-ts.umich.edu:8088/cluster/scheduler).

HIGH PERFORMANCE COMPUTING

Software

ARC Storage Services

U-M Resources for Researchers

ARC Cloud Services

Other Cloud Services

Resource Management Portal

Leaving U-M

Jupyter on Cavium

Jupyter

Select Port to run Jupyter Webserver

SSH to Cavium ThunderX with Port Forwarding

Configure Environment Variables

Start PySpark

Managing Spark Resources

Leave a Reply Cancel Reply

Globus and MiStorage

HIGH PERFORMANCE COMPUTING

Software

ARC Storage Services

U-M Resources for Researchers

ARC Cloud Services

Other Cloud Services

Resource Management Portal

Leaving U-M

Jupyter on Cavium

Jupyter

Select Port to run Jupyter Webserver

SSH to Cavium ThunderX with Port Forwarding

Configure Environment Variables

Start PySpark​

Managing Spark Resources

Leave a Reply Cancel Reply

Globus and MiStorage

HIGH PERFORMANCE COMPUTING

Software

ARC Storage Services

U-M Resources for Researchers

ARC Cloud Services

Other Cloud Services

Resource Management Portal

Leaving U-M

Start PySpark