JupyterHub

A Practical Guide

What is JupyterHub?

Jupyter Notebook is an open source application, used by data scientists and machine learning professionals to author and present code, explanatory text, and visualizations. JupyterHub is an open source tool that lets you host a distributed Jupyter Notebook environment.

With JupyterHub, users can log in to the server, and write Python code in a web browser, without having to install software on their local machine. The Jupyter Notebook and JupyterLab interface provided by JupyterHub is the same as the Jupyter interface running locally. JupyterHub supports web browsers, tablets and smartphones.

JupyterHub should not be confused with cloud-based services for running Jupyter Notebooks—such as Google Colab, Microsoft Azure Notebooks, and Binder. Users or organizations looking for a managed, hosted Jupyter Notebook solution can leverage one of these services. JupyterHub is a do-it-yourself solution that lets you install and manage your own Jupyter Notebook server.

This is part of our series of articles about machine learning engineering.

In this article, you will learn:

Do You Need JupyterHub?

What’s the Difference Between Jupyter Notebook, JupyterLab, and JupyterHub?

Jupyter Notebooks provides a document specification and a graphical user interface for editing documents. Here are several aspects to know about Jupyter Notebooks:

  • A Jupyter Notebook is a .ipynb specification document file—composed of narrative text, code cells, and outputs.
  • A Jupyter Notebook comes with a graphical user interface—which enables you to edit .ipynb documents.

Document editing is not exclusive to the Jupyter Notebook interface. You can also use alternatives like JupyterLab, Google Colab, nteract, and Kaggle.

JupyterLab provides a user interface designed for interactive computing. Here are several aspects to know about JupyterLab:

  • JupyterLab is a user interface—designed to provide extensible and flexible interactive computing.
  • JupyterLab provides extensions—some of which are designed for Jupyter Notebooks. There are also extensions designed for specific parts of the data science pipeline.

JupyterHub provides an application designed for the management of Jupyter Notebooks. Here are several aspects to know about JupyterHub:

  • JupyterHub is an application—designed to help you manage multiple-users sessions of interactive computing.
  • JupyterHub provides connectivity—that enables you to connect users with the infrastructure required for their sessions.
  • JupyterHub enables remote access—to JupyterLab as well as Jupyter Notebooks. You can use this option to let multiple users gain remote access to Jupyter resources.

What Problem Does JupyterHub Solve?

JupyterHub enables collaboration by providing a shared platform for data scientists and relevant stakeholders. You can use JupyterHub to create a data science workflow and deploy it on your infrastructure. This level of flexibility enables you to use the tools of your choice, including Jupyter Notebooks and a python stack, and control access to resources and the environment.

What are the Use Cases of JupyterHub?

There is a wide range of applications for JupyterHub. It is used by large data centers providing computing resources to data scientists, major research labs, large universities serving data science students and researchers, companies with extensive data science operations, and online communities that promote collaborative data science and machine learning.

JupyterHub is usually used to enable collaboration between small and large teams:

  • Small teams—use JupyterHub to enable sharing interactive computing resources and analytics. Small teams include research labs, data science teams, or any collaborative project.
  • Large teams—use JupyterHub for providing multiple users with access to corporate resources like data, hardware, and analytics programs. Lard teams include any large group of remote users like departments and large classes.

JupyterHub Features and Capabilities

JupyterHub provides the following key capabilities:

  • Sets up a Jupyter Notebook or JupyterLab environment for up to tens of thousands of users—supports Kubernetes for large-scale deployments.
  • Supports many different languages, environments, and user interfaces, with a variety of Jupyter kernels developed by the community (see the list of available kerners). You can deliver one or more existing kernels to JupyterHub users, or develop your own.
  • Provides pluggable authentication, enabling flexible authentication for some or all users, using several authentication protocols including OAuth and GitHub.
  • Scales up by sharing the same server with multiple users, or running multiple isolated containers.
  • Can be deployed on any infrastructure, including public cloud providers, virtual machines, or locally on an on-premise laptop or server.

Related content: read our guide to machine learning infrastructure

JupyterHub Architecture

Let’s dive a bit deeper and see how JupyterHub works behind the scenes.

The Subsystems: Hub, Proxy, Single-User Notebook Server

The JupyterHub architecture is designed to supply each single user of a group with a Jupyter Notebook server. To achieve this, the architecture uses the following three main subsystems:

  • Hub—designed to manage user accounts and authentication. The hub uses a Spawner when coordinating single-user notebook servers.
  • Proxy—serves as the public-facing component. This proxy dynamically routes HTTP requests to single-user notebook servers and the Hub.
  • Single-user notebook server—an object called Spawner starts a single-user notebook when a user logs in.
Source: JupyterHub

JupyterHub Tutorial 1: Installing JupyterHub on Local Server

Prerequisites

To install JupyterHub, you need a system with the following requirements:

  • Linux / Unix with Python 3.5 or higher, Node.js and the npm package manager.
  • Install a Pluggable Authentication Module (PAM) with a default authenticator (if not included in your OS distribution).
  • Install Jupyter Notebook 4 or higher.
  • Domain name
  • SSL/TLS certificate to enable secure communication over HTTPS

How the Subsystems Interact

To access JupyterHub from a web browser, a user can either use the domain name or IP address of the server.

The Hub and the proxy

The Hub is responsible for handling logins and spawning single-user notebooks servers. When a user attempts to gain access, the Hub spawns a proxy based on the JupyterHub configuration. The proxy can then forward all requests to the Hub. Only the proxy is allowed to listen on a public interface.

Types of authenticators

There are several authenticators available for controlling access to JupyterHub. PAM is the default authenticator. It uses the user accounts that are located on the same server running JupyterHub. PAM requires creating a user account per each user. Other authenticators can enable users to log in using single-sign-on.

Spawners

A spawner creates a notebook server for each user, and defines how that notebook will be configured. By default, a spawner starts a server on the machine currently running the system username. Alternatively, you can start the notebook server within a separate container. You can use orchestrators like Docker when you opt to use containers.

Installation

Install JupyterHub and test your installation using one of the following commands:

Start the Hub Server

Run the following command to start the Hub Server: jupyterhub

Visit the address https://localhost:8000 in your local browser and log in with your UNIX credentials.

Note that if you want to allow multiple users to log in to the Hub Server, you need to start JupyterHub with root privileges, as follows: sudo jupyterhub

JupyterHub Tutorial 2: Deploying Using Kubernetes

For larger deployments, you can deploy JupyterHub via Kubernetes, the popular container orchestrator. The instructions and code below are abbreviated from the full Zero to JupyterHub Kubernetes tutorial.

Related content: read our guide to Kubernetes architecture for machine learning

Prepare Configuration File

Start by preparing a configuration file called config.yaml. This includes several values used to configure the JupyterHub Helm chart. You can use this chart to deploy a working version of JupyterHub to Kubernetes.

You can keep the Helm values as default, but there is one value that is mandatory to set—the secretToken value which is used as your security token. Generate a random 32 byte hex string and set it in the configuration as follows:

proxy:
 secretToken: ""

Update Repo

Add the JupyterHelm repository to helm, so you can install it without using long URLs. Run the following command (shown together with its output).

Install Helm Chart

Install the Helm charts specified in config.yaml. Run the helm upgrade command from the directory containing the configuration file.

Within the command, specify a RELEASE—a Helm release name, used to distinguish between chart installations and a NAMESPACE—this is the Kubernetes namespace, used to group Kubernetes resources associated with JupyterHub. The official tutorial recommends using the value jhub for both.

Wait for JupyterHub to Deploy and Access It

Wait for the hub and proxy pods to reach Running state. You can check their state by running the command kubectl get pod --namespace jhub

Once they are running, run the command kubectl get service --namespace jhub and see the external IP defined for the proxy-public load balancer:

To use JupyterHub, enter the external IP into your browser. JupyterHub initially runs a default virtual authenticator, so you can use any username and password to access it.

That’s it! You just ran JupyterHub on Kubernetes using the Zero to JupyterHub Helm chart.

Configuring User Environments in JupyterHub

JupyterHub is used to provision a Jupyter Notebook environment to multiple users. In most cases, you will want to customize the Jupyter Notebook user experience. Here are a few ways to achieve this.

Distributing Additional Packages

In many cases, users will need to use additional packages together with Jupyter Notebook. To make these packages available, you’ll typically install it system-wide or in a shared environment.

Make sure that the installation location of any additional packages is the same as the location of jupyterhub-singleuser. This location should be readable and executable by the users. If you want to enable users to install their own packages, make the location writable as well.

Configuring Jupyter and IPython for Users

JupyterHub admins generally need to install and configure the same environment for all JupyterHub users.

Both Jupyter and IPython support “system wide” configuration, which lets you define configuration in one place for all users. It is a best practice to only use system-wide configuration, and avoid placing configuration files in user home directories.

In most cases, the system-wide configuration is located in the /etc/{jupyter|ipython} folder. Environment-wide configuration is located in {sys.prefix}/etc/{jupyter|ipython}

For example, here is how to enable a specific Jupyter Notebook configuration setting for all users, by setting it in the system-wide /etc/jupyter/jupyter_notebook_config.py file:

Note that system-wide configuration can be slightly different depending on how you deploy user environments—a shared system with multi-user hosts, or a container-based system with an isolated environment for each user.

Named Servers

By default, JupyterHub distribution has one server per user. However, if necessary, you can enable multiple servers per user. This is useful in deployments where users are allowed to start servers by requesting resources in a cloud orHPC environment.

You can let users run multiple Jupyter servers simultaneously, using the following command:

c. JupyterHub.allow_named_servers = true

Scaling Machine Learning Infrastructure with Run:ai

When running JupyterHub to serve a large group of data science users, you also need to maintain a machine learning infrastructure, enabling them to run experiments in an efficient and timely manner.

Run:ai automates resource management and orchestration for machine learning infrastructure. With Run:ai, you can automatically run as many compute intensive experiments as needed.

Here are some of the capabilities you gain when using Run:ai:

  • Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
  • No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
  • A higher level of control—Run:ai enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:ai simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.

Learn more about the Run:ai GPU virtualization platform.