Skip to content

Architecture

DataChain architecture

A dataset is the unit of work: a named, versioned result of a pipeline step like [email protected]. Every .save() registers one.

DataChain is split into two parts, matching the two containers in the diagram.

Python Library runs and queries data:

  • Python Data Engine runs your Python over heavy files and tables in parallel, with async prefetch and checkpoints.
  • Data Memory is the typed, versioned dataset registry. Every chain deposits its results here.
  • Query Engine filters, joins, and runs similarity search across Data Memory at warehouse speed.

Skill and MCP serves agents:

  • Knowledge Base is a structured reflection of Data Memory enriched by LLMs: markdown files agents read before generating code. Always accurate because it's derived.