Securing your AI/ML Kubernetes Environment

Background

In this guide, we aim to establish a connection between Kubernetes & securing your AI environment.

This guide is a more high level and intended for everyone who wants to expand their Kubernetes security knowledge in the AI world.

In this article, you will learn:

Why is Securing Kubernetes so challenging?
The 4Cs security concept
A day in the life of a data scientist
Risk
Mitigation
Best Practices
Conclusion

Cartoon caption: I'm pretty sure the application is somewhere around here. Sitting on top of Load Balancer, Ingress, Kube-proxy, Service Mesh, Side Car, Application

Why is Securing Kubernetes so challenging?

Artificial Intelligence (AI) has taken the world by storm, revolutionizing various industries with its ability to create original and captivating content. As this technology continues to gain popularity, it becomes increasingly crucial to protect the underlying AI infrastructure, as today AI models become the main IP of today's companies . Today many companies make the transition to Kubernetes to enjoy the flexibility and cloud native initiatives - build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds.

There are several reasons why Kubernetes security is complex:

Distributed architecture: Production Kubernetes environment includes many moving parts from multiple servers, network switches, storage, operating systems which increase the attack surface.
Resource abstraction: Kubernetes abstracts the underlying infrastructure, while this flexibility is advantageous, it also introduces security risks. Misconfigurations or vulnerabilities at the resource level can be propagated across the entire cluster, impacting the security of applications running within it.
Container Security: Kubernetes relies heavily on containers and ensuring the security of container images and runtime environments is crucial.
Third-party Integrations: Kubernetes supports various third-party integrations, such as network plugins, monitoring tools, and storage solutions. Each integration introduces its own security considerations.
Open Source Layers: Complex to choose open source projects - from the OS to the runtime and Kubernetes version, after moving to production protecting all those elements is hard.

The 4Cs security concept

Cloud-native security is about implementing security measures specific to cloud-native environments. The "4C" model includes protecting infrastructure, containers, microservices, and data. By doing so, organizations can ensure the security of their cloud-native applications and infrastructure against unauthorized access and data breaches.

Diagram: Code, Container, Cluster, Cloud and Kubernetes

‍Code Security: Emphasizes secure software development practices, including secure coding, software supply chain integrity, and integrating security throughout the development lifecycle.

Container Security: Focuses on securing container images, runtime environments, and configurations to prevent vulnerabilities and unauthorized access.

Cluster Security: Involves securing the underlying infrastructure and orchestration platforms, such as Kubernetes, through strong authentication, network policies, node security, and effective logging and monitoring. On-prem systems should follow the best practice of operating system security and protection.

Compliance and Governance: Addresses regulatory and organizational requirements by defining and enforcing compliance policies, protecting data, establishing incident response plans, and ensuring continuous compliance monitoring.

A day in the life of a data scientist

The lifecycle of AI development includes steps like data preparation and building and debugging models in a workspace environment, and once code and dataset are ready, model training is done using batch jobs to tune hyperparameters, and finally the trained model moves to production to infer from new data.

How does it look from the data scientist perspective?

A day in the life of a data scientist can include launching a workload with configurations like:

IDE tools like Jupyter notebook, Pycharm or VScode
Resource request (CPU/Memory/GPU)
Dataset located in a fast storage
Code located in Git repository or shared storage
Object storage volumes for results, outputs and checkpoints
Experiment tracking or visualization tools like Weights & Biases, Comet.ml, or Tensorboard

Diagram: POD, Container, and Image - Image with Jupyter Notebook

How does it translate to Kubernetes?

In this example a POD is running on the Kubernetes cluster, providing an interactive workspace environment for the user or a batch training session that runs to completion. In the diagram below, The Orange “BOX” represents the POD and is what we want to protect. As you can see the POD lives inside Kubernetes with many dependencies and connection to Storage, Networking, runtime and hardware resources.

How does it look from the Kubernetes perspective?

The data scientist sends a request to Kubernetes - “This is what I want ! “ - including the configuration and parameters from above
K8s pulls the image from registry artifact, like NVIDIA GPU Cloud (NGC) or Hugging face
Once image is pulled and the container is running, a git command can pull the latest code
Fast storage is mounted to allow the streaming and processing of the dataset
Port forward is optionally applied to allow the user to connect to the environment with an IDE tool
Results and model checkpoints are stored in a mounted storage

Diagram: POD is running on Kubernetes cluster

Risk

Using the example above without any security measurement is very risky. The goal of the Kubernetes administrator is to control and protect the cluster from risks like users getting access to data they are not allowed to access and AI models being leaked from the company.

More generally, and as illustrated in the diagram below, the following risks can potentially be introduced to an unprotected AI cluster:

No authentication method leaves the system vulnerable
Access to unlimited resources can introduce an attack vector
Containers running in privilege mode can increase the risk of system compromise
Downloading unprotected and unscanned containers can introduce malware
Pulling code from unprotected git repos may expose the system to malicious code
Containers running as root can gain control of company’s resources and data

In general - if security measurements are not enforced - Users can gain full access of the whole cluster, and compromise companies AI/ML modules

Mitigation

The following methods are therefore crucial for mitigating risks:

Authentication: Validating the user actually belongs to the organization using authentication tools like SSO and Active Directory
Authorization: User access after login
Monitoring: Log and Audit all system activities for tracking and analysis.

While ensuring the following measurements are being taken:

Limit access to resources and data
Data Scientists cannot run in privilege mode as Root
Limit access to repos and require authentication user/password
Scan docker images for vulnerabilities
Limit access to git repos and protect using user/password secrets for “git pull”
Verify the POD can see only specific files and directories, usually with UID/GID permissions
Limit access to S3 Buckets by using secrets created in kubernetes and secured as IAM Role

Best Practices

Mitigation strategies are suggested to protect the Kubernetes cluster and associated resources. These include implementing authentication methods, limiting resource access, avoiding privilege mode, protecting Git repositories with authentication and user/password secrets, and enforcing specific file and directory permissions for PODs. More advanced solution are listed below,

Implement robust authentication: Enforce strong password policies, employ multi-factor authentication, and regularly update credentials to prevent unauthorized access.

Employ Role-Based Access Control (RBAC): Kubernetes RBAC enables fine-grained access control, granting appropriate privileges to users or service accounts. By defining specific roles and associated permissions, you can restrict access to critical resources, preventing unauthorized modification or exposure of sensitive data. Implementing RBAC ensures that only authorized entities have access to your pods and clusters.

Enable Pod Security Contexts: Pod Security Contexts allow you to set security-related attributes at the pod level, such as user and group IDs, filesystem permissions, and SELinux policies. By configuring appropriate security contexts, you can ensure that pods operate with the necessary privileges and access restrictions, reducing the attack surface within the cluster.

Limit root access: Restrict root access to only those who require it, and utilize tools that enable granular control over administrative privileges. Regularly review and audit privileged accounts to minimize the risk of unauthorized access.

Regularly Update Container Images: Keeping your container images up to date is crucial for maintaining pod security. Regularly check for security patches and updates provided by the image maintainers and promptly incorporate them into your deployment process. By using the latest versions of container images, you mitigate the risk of known vulnerability

Secure Git repositories: Verify the authenticity of Git repositories before pulling code. Use secure and trusted sources, employ access controls, and regularly update repositories to prevent the introduction of malicious code.

Keep systems up to date: Regularly apply security patches, updates, and fixes to the operating system, software, and applications. Vulnerabilities can be exploited by attackers, so timely updates are crucial to maintain system security.

Utilize Network Policies: Kubernetes Network Policies allow you to define rules governing inbound and outbound network traffic for pods. By configuring network policies, you can segment and isolate pods, controlling communication between them and external entities. This helps to mitigate the risk of lateral movement and potential attacks within the cluster.

Educate users: Conduct regular security awareness training for employees to educate them about common threats, safe computing practices, and the importance of maintaining a secure environment.

Conclusion

In conclusion, protecting AI in Kubernetes is important to safeguard valuable AI modules, mitigate security risks associated with the distributed architecture and resource abstraction, secure containers and third-party integrations, and address the complexities of open-source components. By implementing robust security measures, organizations can ensure the confidentiality, integrity, and availability of their AI workloads while leveraging the benefits of Kubernetes for scalable and dynamic environments.