502 Bad Gateway Errors
Getting HTTP 502 Bad Gateway errors when accessing DataChain Studio indicates that the web server cannot connect to the backend application services. This guide covers diagnosing and resolving these issues.
Overview
502 Bad Gateway errors occur when: - Backend services are not running or accessible - Network connectivity issues between components - Resource constraints preventing service startup - Configuration problems with load balancers or ingress
Initial Diagnosis
Check Service Status
Kubernetes Deployments
# Check pod status
kubectl get pods -n datachain-studio
# Check service status
kubectl get services -n datachain-studio
# Check ingress status
kubectl get ingress -n datachain-studio
# Look for events
kubectl get events -n datachain-studio --sort-by='.lastTimestamp'
AMI Deployments
# SSH to the instance first
ssh -i your-key.pem ubuntu@your-instance-ip
# Check system service status
sudo systemctl status datachain-studio
# Check container status
sudo docker ps -a
# Check logs
sudo journalctl -u datachain-studio -f
Identify the Problem
Common pod statuses indicating issues:
ImagePullBackOff/ErrImagePull- Container image issuesCrashLoopBackOff- Application startup failuresPending- Resource or scheduling issuesCreateContainerConfigError- Configuration problems
Container Image Issues
Image Pull Problems
If pods show ImagePullBackOff or ErrImagePull:
For Cloud Deployments
# Check image pull secrets
kubectl get secrets -n datachain-studio | grep registry
# Recreate registry secret if needed
kubectl delete secret datachain-registry -n datachain-studio
kubectl create secret docker-registry datachain-registry \
--namespace datachain-studio \
--docker-server=registry.datachain.ai \
--docker-username=your-username \
--docker-password=your-password
# Restart deployments
kubectl rollout restart deployment/datachain-studio-backend -n datachain-studio
kubectl rollout restart deployment/datachain-studio-frontend -n datachain-studio
kubectl rollout restart deployment/datachain-studio-worker -n datachain-studio
For Air-gapped Deployments
# Check if images exist in internal registry
kubectl describe pod POD_NAME -n datachain-studio | grep -i image
# Verify internal registry connectivity
kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- \
curl -I https://registry.internal.company.com
# Re-tag and push images if needed
docker tag datachain/studio-backend:VERSION registry.internal.company.com/datachain/studio-backend:VERSION
docker push registry.internal.company.com/datachain/studio-backend:VERSION
Image Version Mismatches
# Check configured image versions
kubectl get deployment -n datachain-studio -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].image}{"\n"}{end}'
# Update to correct versions if needed
kubectl set image deployment/datachain-studio-backend \
datachain-studio-backend=registry.datachain.ai/studio-backend:CORRECT_VERSION \
-n datachain-studio
Application Startup Issues
Configuration Problems
Check Configuration
# Review configuration
kubectl get configmap datachain-studio-config -n datachain-studio -o yaml
# Check for missing environment variables
kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- env | grep -i studio
# Validate secrets
kubectl get secrets -n datachain-studio
kubectl describe secret datachain-studio-secrets -n datachain-studio
Database Connection Issues
# Test database connectivity
kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- \
python -c "
import os
import psycopg2
try:
conn = psycopg2.connect(os.environ['DATABASE_URL'])
print('Database connection: OK')
except Exception as e:
print(f'Database connection failed: {e}')
"
# Check database pod status
kubectl get pods -l app=postgres -n datachain-studio
# Check database logs
kubectl logs -f deployment/datachain-studio-postgres -n datachain-studio
Redis Connection Issues
# Test Redis connectivity
kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- \
python -c "
import redis
import os
try:
r = redis.from_url(os.environ.get('REDIS_URL', 'redis://localhost:6379'))
r.ping()
print('Redis connection: OK')
except Exception as e:
print(f'Redis connection failed: {e}')
"
# Check Redis pod status
kubectl get pods -l app=redis -n datachain-studio
# Check Redis logs
kubectl logs -f deployment/datachain-studio-redis -n datachain-studio
Resource Constraints
Check Resource Usage
# Check node resources
kubectl describe nodes | grep -A 5 "Allocated resources"
# Check pod resource requests and limits
kubectl describe pod POD_NAME -n datachain-studio | grep -A 10 -i resources
# Check actual resource usage
kubectl top nodes
kubectl top pods -n datachain-studio
Resolve Resource Issues
# Scale down other workloads temporarily
kubectl scale deployment other-deployment --replicas=0 -n other-namespace
# Increase resource limits in Helm values
# values.yaml
resources:
backend:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1000m"
# Apply changes
helm upgrade datachain-studio datachain/studio \
--namespace datachain-studio \
--values values.yaml
Network Connectivity Issues
Service Discovery Problems
# Check service endpoints
kubectl get endpoints -n datachain-studio
# Test internal service connectivity
kubectl exec -it deployment/datachain-studio-frontend -n datachain-studio -- \
curl -I http://datachain-studio-backend:8000/health
# Check DNS resolution
kubectl exec -it deployment/datachain-studio-frontend -n datachain-studio -- \
nslookup datachain-studio-backend.datachain-studio.svc.cluster.local
Ingress Configuration Issues
# Check ingress configuration
kubectl describe ingress datachain-studio-ingress -n datachain-studio
# Check ingress controller logs
kubectl logs -f deployment/nginx-ingress-controller -n ingress-nginx
# Test ingress rules
curl -H "Host: studio.yourcompany.com" http://INGRESS_IP/health
Load Balancer Issues
# Check load balancer status
kubectl get service datachain-studio-lb -n datachain-studio
# Check load balancer endpoints
kubectl describe service datachain-studio-lb -n datachain-studio
# Test load balancer connectivity
curl -I http://LOAD_BALANCER_IP:80/health
SSL/TLS Related Issues
Certificate Problems
# Check TLS secret
kubectl describe secret datachain-studio-tls -n datachain-studio
# Verify certificate validity
kubectl get secret datachain-studio-tls -n datachain-studio -o jsonpath='{.data.tls\.crt}' | \
base64 -d | openssl x509 -dates -noout
# Test SSL connectivity
openssl s_client -connect studio.yourcompany.com:443 -servername studio.yourcompany.com
SSL Termination Issues
# Check if SSL is terminated at ingress
kubectl describe ingress datachain-studio-ingress -n datachain-studio | grep -i tls
# Test without SSL (if applicable)
curl -I http://studio.yourcompany.com/health
# Check SSL redirect configuration
curl -I -L http://studio.yourcompany.com/health
Advanced Troubleshooting
Deep Dive Debugging
Application Logs Analysis
# Get detailed application logs
kubectl logs -f deployment/datachain-studio-backend -n datachain-studio --previous
# Search for specific error patterns
kubectl logs deployment/datachain-studio-backend -n datachain-studio | grep -i error
kubectl logs deployment/datachain-studio-backend -n datachain-studio | grep -i "502\|bad gateway"
# Check application startup sequence
kubectl logs deployment/datachain-studio-backend -n datachain-studio | head -50
Network Packet Analysis
# Capture network traffic (requires privileged access)
kubectl exec -it deployment/datachain-studio-frontend -n datachain-studio -- \
tcpdump -i any -n port 8000
# Test specific network paths
kubectl exec -it deployment/datachain-studio-frontend -n datachain-studio -- \
traceroute datachain-studio-backend.datachain-studio.svc.cluster.local
Health Check Validation
# Test health endpoints directly
kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- \
curl -f http://localhost:8000/health
# Test with verbose output
kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- \
curl -v http://localhost:8000/health
# Check health endpoint response time
kubectl exec -it deployment/datachain-studio-backend -n datachain-studio -- \
time curl -f http://localhost:8000/health
AMI-Specific Troubleshooting
Docker Container Issues
# Check container status
sudo docker ps -a
# Check container logs
sudo docker logs datachain-studio-backend
sudo docker logs datachain-studio-frontend
# Restart containers
sudo docker restart datachain-studio-backend
sudo docker restart datachain-studio-frontend
# Check container health
sudo docker exec datachain-studio-backend curl -f http://localhost:8000/health
System Service Issues
# Check systemd service status
sudo systemctl status datachain-studio
sudo systemctl status docker
# Restart services
sudo systemctl restart datachain-studio
sudo systemctl restart docker
# Check service logs
sudo journalctl -u datachain-studio -f
sudo journalctl -u docker -f
# Check service configuration
sudo systemctl cat datachain-studio
Nginx Configuration
# Check nginx configuration
sudo nginx -t
# Check nginx logs
sudo tail -f /var/log/nginx/error.log
sudo tail -f /var/log/nginx/access.log
# Restart nginx
sudo systemctl restart nginx
# Test nginx upstream
curl -I http://localhost:8000/health # Direct backend test
Recovery Procedures
Quick Recovery Steps
-
Restart all services:
-
Check and fix resource constraints:
-
Verify configuration:
-
Test connectivity:
Full Recovery Process
- Stop all services
- Check system resources and fix constraints
- Verify configuration files
- Check network connectivity
- Start services in order: Database → Redis → Backend → Frontend → Worker
- Validate each component before starting the next
- Test full application functionality
Prevention
Monitoring and Alerting
Set up monitoring to catch 502 errors early:
# Prometheus alert example
- alert: High502ErrorRate
expr: rate(nginx_ingress_controller_requests{status="502"}[5m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "High rate of 502 errors"
description: "502 error rate is {{ $value }} per second"
- alert: BackendServiceDown
expr: up{job="datachain-studio-backend"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Backend service is down"
Health Checks
Implement comprehensive health checks:
# Kubernetes readiness probe
readinessProbe:
httpGet:
path: /health/ready
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# Kubernetes liveness probe
livenessProbe:
httpGet:
path: /health/live
port: 8000
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
Regular Maintenance
- Monitor resource usage trends
- Review logs regularly for warnings
- Keep services updated
- Test failover procedures
- Document configuration changes
Next Steps
If 502 errors persist after trying these solutions:
- Generate a support bundle with diagnostic information
- Review recent changes to configuration or infrastructure
- Check the main troubleshooting guide for other common issues
- Contact support with detailed error information and logs
For other issues: - Configuration problems - Installation issues - Upgrade problems