Kubernetes (Helm) Installation
This guide covers installing DataChain Studio on Kubernetes using Helm charts.
Prerequisites
Kubernetes Cluster
- Kubernetes version: 1.19+
- Node requirements:
- Minimum: 2 nodes with 8GB RAM, 4 vCPUs each
- Recommended: 3+ nodes with 16GB RAM, 8 vCPUs each
- Storage: 100GB persistent storage
- Networking: Cluster networking with ingress controller
Required Tools
kubectlconfigured to access your clusterhelm3.0+- Access to DataChain Studio container images
Access Requirements
- Container registry access (provided by DataChain team)
- Valid DNS domain for DataChain Studio
- SSL certificates for HTTPS
Installation Steps
1. Add DataChain Helm Repository
2. Create Namespace
3. Configure Container Registry Access
Create a secret for accessing DataChain Studio container images:
kubectl create secret docker-registry datachain-registry \
--namespace datachain-studio \
--docker-server=registry.datachain.ai \
--docker-username=<provided-username> \
--docker-password=<provided-password>
4. Configure SSL Certificates
Create TLS secret for your domain:
kubectl create secret tls studio-tls \
--namespace datachain-studio \
--cert=path/to/tls.crt \
--key=path/to/tls.key
5. Create Configuration File
Create a values.yaml file with your configuration:
# Basic configuration
global:
domain: studio.yourcompany.com
storageClass: gp2 # or your preferred storage class
# Image pull secrets
imagePullSecrets:
- name: datachain-registry
# SSL/TLS configuration
ingress:
enabled: true
className: nginx # or your ingress class
tls:
enabled: true
secretName: studio-tls
annotations:
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
# Database configuration
postgresql:
enabled: true
auth:
postgresPassword: "secure-postgres-password"
database: "datachain_studio"
primary:
persistence:
enabled: true
size: 50Gi
storageClass: gp2
# Redis configuration
redis:
enabled: true
auth:
enabled: true
password: "secure-redis-password"
# Storage configuration
storage:
type: s3
s3:
bucket: your-studio-bucket
region: us-east-1
accessKey: your-access-key
secretKey: your-secret-key
# Git integrations
git:
github:
enabled: true
appId: "your-github-app-id"
privateKey: |
-----BEGIN RSA PRIVATE KEY-----
your-github-private-key-content
-----END RSA PRIVATE KEY-----
gitlab:
enabled: true
url: "https://gitlab.com"
clientId: "your-gitlab-client-id"
clientSecret: "your-gitlab-client-secret"
# Resource limits
resources:
frontend:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
backend:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1000m"
worker:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1000m"
# Autoscaling
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
6. Install DataChain Studio
helm install datachain-studio datachain/studio \
--namespace datachain-studio \
--values values.yaml \
--wait --timeout=10m
7. Verify Installation
Check pod status:
Check services:
Check ingress:
Configuration Options
Database Options
External PostgreSQL
postgresql:
enabled: false
externalDatabase:
type: postgresql
host: your-postgres-host
port: 5432
database: datachain_studio
username: studio_user
password: your-password
External Redis
Storage Options
AWS S3
storage:
type: s3
s3:
bucket: your-bucket
region: us-east-1
accessKey: your-access-key
secretKey: your-secret-key
Google Cloud Storage
storage:
type: gcs
gcs:
bucket: your-bucket
projectId: your-project-id
keyFile: |
{
"type": "service_account",
"project_id": "your-project-id",
...
}
Azure Blob Storage
storage:
type: azure
azure:
accountName: your-account-name
accountKey: your-account-key
containerName: your-container
High Availability Configuration
# Multiple replicas
replicaCount:
frontend: 3
backend: 3
worker: 2
# Pod disruption budgets
podDisruptionBudget:
enabled: true
minAvailable: 1
# Node affinity
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- datachain-studio
topologyKey: kubernetes.io/hostname
Upgrading
Check Current Version
Upgrade to Latest Version
helm repo update
helm upgrade datachain-studio datachain/studio \
--namespace datachain-studio \
--values values.yaml \
--wait
Rollback if Needed
Monitoring and Logging
Enable Monitoring
monitoring:
enabled: true
serviceMonitor:
enabled: true
prometheus:
enabled: true
grafana:
enabled: true
adminPassword: your-grafana-password
Log Configuration
logging:
level: INFO
format: json
# External log aggregation
fluentd:
enabled: true
host: your-log-aggregator
port: 24224
Security Considerations
Network Policies
networkPolicy:
enabled: true
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
egress:
- to:
- namespaceSelector: {}
Security Context
Pod Security Standards
Troubleshooting
Common Issues
Pods stuck in Pending:
Database connection issues:
SSL certificate problems:
Debug Commands
# Check all resources
kubectl get all -n datachain-studio
# Check events
kubectl get events -n datachain-studio --sort-by='.lastTimestamp'
# Check logs
kubectl logs -f deployment/datachain-studio-backend -n datachain-studio
# Port forward for local access
kubectl port-forward service/datachain-studio-frontend 8080:80 -n datachain-studio