Background

The catalog builder project is a “python community package ecosystem” that allows you to generate data catalogs compatible with intake-esm. Available as a Conda package.

See our Github repository here. We have contributing guidelines and code of conduct documented in our GitHub repo. We welcome your contributions.

Brief overview on data catalogs

Data catalogs enable “data discoverability” regardless of the data format (zarr, netcdf). We acknowledge the different community collaborations (Pangeo/ESGF Cloud Data working group) that led us to explore this further.

Data catalogs in this project have 3 components. One of those is the “intake-esm” API that makes use of the specifications and catalogs, generated by the catalog builder API. Read more about Intake-ESM here.

Catalog Specification

  • What we expect to find inside and how to open the “datasets”/objects?

  • Provides metadata about the catalog

  • Identifies how multiple files can be aggregated into a single “dataset”

  • Support for extensible metadata

  • Single JSON file

Catalogs

  • Tells us more about the data collection

  • Path to the files (objects), and associated metadata.

  • CSV file

  • User-defined granularity

Intake-ESM API

  • Opens possibilities to QUERY and ANALYZE

  • Provides a pythonic way to “query” for information in the catalogs

  • Loads the results in an xarray dataset object