113 lines
2.1 KiB
Markdown
113 lines
2.1 KiB
Markdown
# Data Catalog
|
|
|
|
**Purpose**: Unified data catalog for tracking and discovering datasets
|
|
**Status**: 🚧 Planned
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
The data catalog provides a centralized registry for all datasets across the workspace, enabling discovery, access control, and metadata management.
|
|
|
|
---
|
|
|
|
## Features
|
|
|
|
- Dataset registration
|
|
- Metadata management
|
|
- Search and discovery
|
|
- Access control
|
|
- Schema tracking
|
|
- Lineage tracking
|
|
|
|
---
|
|
|
|
## Schema
|
|
|
|
See `metadata-schema.json` for the complete metadata schema.
|
|
|
|
### Key Fields
|
|
|
|
- **id**: Unique dataset identifier
|
|
- **name**: Human-readable name
|
|
- **source**: Source system/project
|
|
- **storage**: Storage location details
|
|
- **schema**: Data schema definition
|
|
- **tags**: Categorization tags
|
|
- **access**: Access control settings
|
|
|
|
---
|
|
|
|
## Implementation Options
|
|
|
|
### Option 1: Custom API
|
|
- Build custom API using shared packages
|
|
- Use PostgreSQL for metadata storage
|
|
- Implement search using PostgreSQL full-text search
|
|
|
|
### Option 2: DataHub
|
|
- Deploy DataHub (open-source)
|
|
- Use existing metadata models
|
|
- Leverage built-in features
|
|
|
|
### Option 3: Amundsen
|
|
- Deploy Amundsen (open-source)
|
|
- Use existing metadata models
|
|
- Leverage built-in features
|
|
|
|
---
|
|
|
|
## Usage
|
|
|
|
### Register Dataset
|
|
|
|
```json
|
|
{
|
|
"id": "user-events-2025",
|
|
"name": "User Events 2025",
|
|
"description": "User interaction events for 2025",
|
|
"source": "analytics-service",
|
|
"storage": {
|
|
"type": "minio",
|
|
"bucket": "analytics",
|
|
"path": "events/2025/"
|
|
},
|
|
"format": "parquet",
|
|
"tags": ["events", "analytics", "2025"],
|
|
"owner": "analytics-team",
|
|
"access": {
|
|
"level": "internal",
|
|
"permissions": ["read"]
|
|
}
|
|
}
|
|
```
|
|
|
|
### Search Datasets
|
|
|
|
```bash
|
|
# Search by tag
|
|
GET /api/catalog/datasets?tag=analytics
|
|
|
|
# Search by source
|
|
GET /api/catalog/datasets?source=analytics-service
|
|
|
|
# Full-text search
|
|
GET /api/catalog/datasets?q=user+events
|
|
```
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
1. Choose implementation option
|
|
2. Set up metadata storage
|
|
3. Implement registration API
|
|
4. Implement search functionality
|
|
5. Set up access control
|
|
6. Integrate with projects
|
|
|
|
---
|
|
|
|
**Status**: 🚧 Planned - Schema and design complete, implementation pending
|
|
|