Files
infrastructure/data-storage/data-catalog

Data Catalog

Purpose: Unified data catalog for tracking and discovering datasets Status: 🚧 Planned


Overview

The data catalog provides a centralized registry for all datasets across the workspace, enabling discovery, access control, and metadata management.


Features

  • Dataset registration
  • Metadata management
  • Search and discovery
  • Access control
  • Schema tracking
  • Lineage tracking

Schema

See metadata-schema.json for the complete metadata schema.

Key Fields

  • id: Unique dataset identifier
  • name: Human-readable name
  • source: Source system/project
  • storage: Storage location details
  • schema: Data schema definition
  • tags: Categorization tags
  • access: Access control settings

Implementation Options

Option 1: Custom API

  • Build custom API using shared packages
  • Use PostgreSQL for metadata storage
  • Implement search using PostgreSQL full-text search

Option 2: DataHub

  • Deploy DataHub (open-source)
  • Use existing metadata models
  • Leverage built-in features

Option 3: Amundsen

  • Deploy Amundsen (open-source)
  • Use existing metadata models
  • Leverage built-in features

Usage

Register Dataset

{
  "id": "user-events-2025",
  "name": "User Events 2025",
  "description": "User interaction events for 2025",
  "source": "analytics-service",
  "storage": {
    "type": "minio",
    "bucket": "analytics",
    "path": "events/2025/"
  },
  "format": "parquet",
  "tags": ["events", "analytics", "2025"],
  "owner": "analytics-team",
  "access": {
    "level": "internal",
    "permissions": ["read"]
  }
}

Search Datasets

# Search by tag
GET /api/catalog/datasets?tag=analytics

# Search by source
GET /api/catalog/datasets?source=analytics-service

# Full-text search
GET /api/catalog/datasets?q=user+events

Next Steps

  1. Choose implementation option
  2. Set up metadata storage
  3. Implement registration API
  4. Implement search functionality
  5. Set up access control
  6. Integrate with projects

Status: 🚧 Planned - Schema and design complete, implementation pending