2.1 KiB
2.1 KiB
Data Catalog
Purpose: Unified data catalog for tracking and discovering datasets Status: 🚧 Planned
Overview
The data catalog provides a centralized registry for all datasets across the workspace, enabling discovery, access control, and metadata management.
Features
- Dataset registration
- Metadata management
- Search and discovery
- Access control
- Schema tracking
- Lineage tracking
Schema
See metadata-schema.json for the complete metadata schema.
Key Fields
- id: Unique dataset identifier
- name: Human-readable name
- source: Source system/project
- storage: Storage location details
- schema: Data schema definition
- tags: Categorization tags
- access: Access control settings
Implementation Options
Option 1: Custom API
- Build custom API using shared packages
- Use PostgreSQL for metadata storage
- Implement search using PostgreSQL full-text search
Option 2: DataHub
- Deploy DataHub (open-source)
- Use existing metadata models
- Leverage built-in features
Option 3: Amundsen
- Deploy Amundsen (open-source)
- Use existing metadata models
- Leverage built-in features
Usage
Register Dataset
{
"id": "user-events-2025",
"name": "User Events 2025",
"description": "User interaction events for 2025",
"source": "analytics-service",
"storage": {
"type": "minio",
"bucket": "analytics",
"path": "events/2025/"
},
"format": "parquet",
"tags": ["events", "analytics", "2025"],
"owner": "analytics-team",
"access": {
"level": "internal",
"permissions": ["read"]
}
}
Search Datasets
# Search by tag
GET /api/catalog/datasets?tag=analytics
# Search by source
GET /api/catalog/datasets?source=analytics-service
# Full-text search
GET /api/catalog/datasets?q=user+events
Next Steps
- Choose implementation option
- Set up metadata storage
- Implement registration API
- Implement search functionality
- Set up access control
- Integrate with projects
Status: 🚧 Planned - Schema and design complete, implementation pending