Organizing Datasets with Namespace and Project
DataChain allows you to organize datasets using namespaces and projects. These provide an additional structure for managing data across different workflows, use cases, or organizational structures.
A dataset in DataChain is organized as:
For example:
Default Namespace and Project
If no namespace or project is specified, DataChain uses defaults depending on whether you're using Studio or the CLI.
Studio
- Namespace:
users
- Project: your username (e.g.
jondoe
) - Saving without namespace/project:
CLI
- Namespace:
local
- Project:
local
- Saving without namespace/project:
In the CLI, you cannot create or use any namespaces or projects other than the default local.local
.
Creating a Project (Studio only)
In Studio, you can explicitly create a project and namespace using:
This creates the dev
namespace (if it doesn't exist) and a project called analytics
inside it.
Note: Creating custom namespaces and projects is only supported in Studio. In the CLI, only the default local
namespace and local
project are available.
Saving a Dataset Using a Fully Qualified Name
You can implicitly create and use namespaces and projects by saving a dataset using a fully qualified name:
In Studio, this automatically creates the namespace and project if they donβt already exist.
In CLI, only local.local.<dataset>
is supported. Using any other namespace or project will result in an error.
Using .settings()
to Set Namespace and Project
You can also set the namespace and project using .settings()
:
dc.read_values(scores=[1.2, 3.4, 2.5])
.settings(namespace="dev", project="analytics")
.save("metrics")
This is equivalent to saving to dev.analytics.metrics
.
In CLI, .settings()
is only supported when both namespace
and project
are set to "local"
.
Reading a Dataset from a Project
To read a dataset from a specific namespace and project:
In CLI, this only works for datasets saved in the default local.local
project.
Example (Studio)
import datachain as dc
dc.create_project("prod", "analytics")
dc.read_csv("gs://bucket/metrics.csv") \
.save("prod.analytics.metrics")
ds = dc.read_dataset("prod.analytics.metrics")
ds.show()