Skip to content

Configuration

This section covers how to configure your self-hosted DataChain Studio instance for optimal performance, security, and integration with your infrastructure.

Overview

DataChain Studio configuration involves several key areas:

Basic Configuration

Environment Variables

DataChain Studio can be configured using environment variables:

global:
  envVars:
    # Basic settings
    DATACHAIN_STUDIO_URL: "https://studio.yourcompany.com"
    DATACHAIN_STUDIO_SECRET_KEY: "your-secret-key"

    # Database settings
    DATABASE_URL: "postgresql://user:pass@host:5432/datachain_studio"
    REDIS_URL: "redis://host:6379"

    # Storage settings
    STORAGE_TYPE: "s3"
    S3_BUCKET: "your-studio-bucket"
    S3_REGION: "us-east-1"

    # Git integration
    GITHUB_APP_ID: "your-github-app-id"
    GITLAB_CLIENT_ID: "your-gitlab-client-id"

Configuration File

For more complex configurations, use a YAML configuration file:

# values.yaml
global:
  domain: studio.yourcompany.com

  # Security settings
  security:
    secretKey: "your-long-random-secret-key"
    sessionTimeout: 3600
    csrfProtection: true

  # Feature flags
  features:
    webhooks: true
    apiAccess: true
    teamCollaboration: true
    ssoIntegration: true

# Database configuration
database:
  type: postgresql
  host: postgres.yourcompany.com
  port: 5432
  name: datachain_studio
  user: studio_user
  password: secure-password
  sslMode: require

  # Connection pooling
  pool:
    minConnections: 5
    maxConnections: 20

# Cache configuration
cache:
  type: redis
  host: redis.yourcompany.com
  port: 6379
  password: redis-password
  database: 0

  # TTL settings
  ttl:
    sessions: 3600
    apiCache: 300
    dataCache: 1800

# Storage configuration
storage:
  type: s3
  config:
    bucket: datachain-studio-storage
    region: us-east-1
    accessKey: your-access-key
    secretKey: your-secret-key
    endpoint: s3.amazonaws.com

  # Alternative: Google Cloud Storage
  # type: gcs
  # config:
  #   bucket: datachain-studio-storage
  #   projectId: your-project-id
  #   keyFile: /path/to/service-account.json

# Logging configuration
logging:
  level: INFO
  format: json
  outputs:
    - console
    - file

  # Log rotation
  rotation:
    maxSize: 100MB
    maxAge: 30
    maxBackups: 10

Advanced Configuration

Performance Tuning

# Performance settings
performance:
  # Worker processes
  workers:
    frontend: 4
    backend: 8
    jobProcessor: 2

  # Memory limits
  memory:
    frontend: "1Gi"
    backend: "2Gi"
    jobProcessor: "4Gi"

  # CPU limits
  cpu:
    frontend: "500m"
    backend: "1000m"
    jobProcessor: "2000m"

  # Caching
  cache:
    enabled: true
    size: "512Mi"
    evictionPolicy: "lru"

Security Configuration

# Security settings
security:
  # Authentication
  auth:
    methods:
      - local
      - oauth
      - saml

    # Password policy
    passwordPolicy:
      minLength: 8
      requireUppercase: true
      requireLowercase: true
      requireNumbers: true
      requireSpecialChars: true

    # Session management
    sessions:
      timeout: 3600
      renewalThreshold: 300
      maxConcurrent: 5

  # Network security
  network:
    allowedIPs:
      - "10.0.0.0/8"
      - "192.168.0.0/16"

    rateLimiting:
      enabled: true
      requestsPerMinute: 100
      burstSize: 20

  # Data encryption
  encryption:
    atRest:
      enabled: true
      algorithm: "AES-256-GCM"

    inTransit:
      enabled: true
      minTlsVersion: "1.2"

Integration Configuration

# External integrations
integrations:
  # Git forges
  git:
    github:
      enabled: true
      appId: "123456"
      privateKeyPath: "/etc/ssl/private/github.pem"
      webhookSecret: "github-webhook-secret"

    gitlab:
      enabled: true
      url: "https://gitlab.yourcompany.com"
      clientId: "gitlab-client-id"
      clientSecret: "gitlab-client-secret"
      webhookSecret: "gitlab-webhook-secret"

    bitbucket:
      enabled: true
      clientId: "bitbucket-client-id"
      clientSecret: "bitbucket-client-secret"

  # Monitoring
  monitoring:
    prometheus:
      enabled: true
      endpoint: "/metrics"
      port: 9090

    grafana:
      enabled: true
      url: "https://grafana.yourcompany.com"

    alerts:
      slack:
        enabled: true
        webhookUrl: "https://hooks.slack.com/..."
        channel: "#datachain-alerts"

      email:
        enabled: true
        smtpHost: "smtp.yourcompany.com"
        smtpPort: 587
        from: "[email protected]"

Backup Configuration

# Backup settings
backup:
  enabled: true

  # Database backups
  database:
    enabled: true
    schedule: "0 2 * * *"  # Daily at 2 AM
    retention: 30  # days
    compression: true

    destination:
      type: s3
      bucket: datachain-studio-backups
      path: database/

  # Storage backups
  storage:
    enabled: true
    schedule: "0 3 * * 0"  # Weekly on Sunday at 3 AM
    retention: 12  # weeks

    destination:
      type: s3
      bucket: datachain-studio-backups
      path: storage/

Monitoring Configuration

Metrics and Alerting

# Monitoring configuration
monitoring:
  # Metrics collection
  metrics:
    enabled: true
    interval: 30s

    collectors:
      - system
      - application
      - database
      - cache
      - storage

  # Health checks
  healthChecks:
    enabled: true
    interval: 10s
    timeout: 5s

    endpoints:
      - /health/live
      - /health/ready
      - /health/database
      - /health/cache

  # Alerting rules
  alerts:
    rules:
      - name: "High CPU Usage"
        condition: "cpu_usage > 80"
        duration: "5m"
        severity: "warning"

      - name: "Database Connection Failed"
        condition: "database_health == 0"
        duration: "1m"
        severity: "critical"

      - name: "Storage Full"
        condition: "storage_usage > 90"
        duration: "5m"
        severity: "critical"

Validation

Configuration Validation

Validate your configuration before deployment:

# For Helm deployments
helm template datachain-studio ./chart \
  --values values.yaml \
  --dry-run

# For direct deployments
datachain-studio validate-config config.yaml

Health Checks

Monitor your configuration post-deployment:

# Check service health
curl https://studio.yourcompany.com/health

# Check database connectivity
curl https://studio.yourcompany.com/health/database

# Check storage connectivity
curl https://studio.yourcompany.com/health/storage

Troubleshooting

Common Configuration Issues

Database connection failures: - Verify connection string format - Check network connectivity - Confirm credentials and permissions

SSL/TLS certificate issues: - Validate certificate chain - Check certificate expiration - Verify domain name matches

Storage access problems: - Confirm bucket permissions - Check access key validity - Verify network connectivity

Configuration Testing

# Test configuration
test:
  enabled: true

  # Unit tests
  unit:
    database: true
    cache: true
    storage: true
    auth: true

  # Integration tests
  integration:
    gitForges: true
    webhooks: true
    api: true

  # Load tests
  load:
    enabled: false
    users: 100
    duration: "10m"

Next Steps