Files
explorer-monorepo/docs/specs/database/data-lake-schema.md

7.7 KiB

Data Lake Schema Specification

Overview

This document specifies the data lake schema for long-term storage of blockchain data in S3-compatible object storage using Parquet format for analytics, ML, and compliance purposes.

Storage Structure

Directory Layout

s3://explorer-data-lake/
├── raw/
│   ├── chain_id=138/
│   │   ├── year=2024/
│   │   │   ├── month=01/
│   │   │   │   ├── day=01/
│   │   │   │   │   ├── blocks.parquet
│   │   │   │   │   ├── transactions.parquet
│   │   │   │   │   └── logs.parquet
│   │   │   │   └── ...
│   │   │   └── ...
│   │   └── ...
│   └── ...
├── processed/
│   ├── chain_id=138/
│   │   ├── daily_aggregates/
│   │   │   ├── year=2024/
│   │   │   │   └── month=01/
│   │   │   │       └── day=01.parquet
│   │   └── ...
│   └── ...
└── archived/
    └── ...

Partitioning Strategy

Partition Keys:

  • chain_id: Chain identifier
  • year: Year (YYYY)
  • month: Month (MM)
  • day: Day (DD)

Benefits:

  • Efficient query pruning
  • Parallel processing
  • Easy data management (delete by partition)

Parquet Schema

Blocks Parquet Schema

{
  "type": "struct",
  "fields": [
    {"name": "chain_id", "type": "integer", "nullable": false},
    {"name": "number", "type": "long", "nullable": false},
    {"name": "hash", "type": "string", "nullable": false},
    {"name": "parent_hash", "type": "string", "nullable": false},
    {"name": "timestamp", "type": "timestamp", "nullable": false},
    {"name": "miner", "type": "string", "nullable": true},
    {"name": "gas_used", "type": "long", "nullable": true},
    {"name": "gas_limit", "type": "long", "nullable": true},
    {"name": "transaction_count", "type": "integer", "nullable": true},
    {"name": "size", "type": "integer", "nullable": true}
  ]
}

Transactions Parquet Schema

{
  "type": "struct",
  "fields": [
    {"name": "chain_id", "type": "integer", "nullable": false},
    {"name": "hash", "type": "string", "nullable": false},
    {"name": "block_number", "type": "long", "nullable": false},
    {"name": "transaction_index", "type": "integer", "nullable": false},
    {"name": "from_address", "type": "string", "nullable": false},
    {"name": "to_address", "type": "string", "nullable": true},
    {"name": "value", "type": "string", "nullable": false}, // Decimal as string
    {"name": "gas_price", "type": "long", "nullable": true},
    {"name": "gas_used", "type": "long", "nullable": true},
    {"name": "gas_limit", "type": "long", "nullable": false},
    {"name": "status", "type": "integer", "nullable": true},
    {"name": "timestamp", "type": "timestamp", "nullable": false}
  ]
}

Logs Parquet Schema

{
  "type": "struct",
  "fields": [
    {"name": "chain_id", "type": "integer", "nullable": false},
    {"name": "transaction_hash", "type": "string", "nullable": false},
    {"name": "block_number", "type": "long", "nullable": false},
    {"name": "log_index", "type": "integer", "nullable": false},
    {"name": "address", "type": "string", "nullable": false},
    {"name": "topic0", "type": "string", "nullable": true},
    {"name": "topic1", "type": "string", "nullable": true},
    {"name": "topic2", "type": "string", "nullable": true},
    {"name": "topic3", "type": "string", "nullable": true},
    {"name": "data", "type": "string", "nullable": true},
    {"name": "timestamp", "type": "timestamp", "nullable": false}
  ]
}

Token Transfers Parquet Schema

{
  "type": "struct",
  "fields": [
    {"name": "chain_id", "type": "integer", "nullable": false},
    {"name": "transaction_hash", "type": "string", "nullable": false},
    {"name": "block_number", "type": "long", "nullable": false},
    {"name": "token_address", "type": "string", "nullable": false},
    {"name": "token_type", "type": "string", "nullable": false},
    {"name": "from_address", "type": "string", "nullable": false},
    {"name": "to_address", "type": "string", "nullable": false},
    {"name": "amount", "type": "string", "nullable": true},
    {"name": "token_id", "type": "string", "nullable": true},
    {"name": "timestamp", "type": "timestamp", "nullable": false}
  ]
}

Data Ingestion

ETL Pipeline

Process:

  1. Extract: Query PostgreSQL for daily data
  2. Transform: Convert to Parquet format
  3. Load: Upload to S3 with partitioning

Schedule: Daily batch job after day ends

Tools: Apache Spark, AWS Glue, or custom ETL scripts

Compression

Format: Snappy compression (good balance of speed and compression ratio)

Alternative: Gzip (better compression, slower)

File Sizing

Target Size: 100-500 MB per Parquet file

  • Smaller files: Better parallelism
  • Larger files: Better compression

Strategy: Write files of target size, or split by time ranges

Query Interface

AWS Athena / Presto

Table Definition:

CREATE EXTERNAL TABLE blocks_138 (
  chain_id int,
  number bigint,
  hash string,
  parent_hash string,
  timestamp timestamp,
  miner string,
  gas_used bigint,
  gas_limit bigint,
  transaction_count int,
  size int
)
STORED AS PARQUET
LOCATION 's3://explorer-data-lake/raw/chain_id=138/'
TBLPROPERTIES (
  'projection.enabled' = 'true',
  'projection.year.type' = 'integer',
  'projection.year.range' = '2020,2030',
  'projection.month.type' = 'integer',
  'projection.month.range' = '1,12',
  'projection.day.type' = 'integer',
  'projection.day.range' = '1,31'
);

Query Examples

Daily Transaction Count:

SELECT 
  DATE(timestamp) as date,
  COUNT(*) as transaction_count
FROM transactions_138
WHERE year = 2024 AND month = 1
GROUP BY DATE(timestamp)
ORDER BY date;

Token Transfer Analytics:

SELECT 
  token_address,
  COUNT(*) as transfer_count,
  SUM(CAST(amount AS DECIMAL(78, 0))) as total_volume
FROM token_transfers_138
WHERE year = 2024 AND month = 1
GROUP BY token_address
ORDER BY total_volume DESC
LIMIT 100;

Data Retention

Retention Policies

Raw Data: 7 years (compliance requirement) Processed Aggregates: Indefinite Archived Data: Move to Glacier after 1 year

Lifecycle Policies

S3 Lifecycle Rules:

  1. Move to Infrequent Access after 30 days
  2. Move to Glacier after 1 year
  3. Delete after 7 years (raw data)

Data Processing

Aggregation Jobs

Daily Aggregates:

  • Transaction counts by hour
  • Gas usage statistics
  • Token transfer volumes
  • Address activity metrics

Monthly Aggregates:

  • Network growth metrics
  • Token distribution changes
  • Protocol usage statistics

ML/Analytics Workflows

Use Cases:

  • Anomaly detection
  • Fraud detection
  • Market analysis
  • Network health monitoring

Tools: Spark, Pandas, Jupyter notebooks

Security and Access Control

Access Control

IAM Policies: Restrict access to specific prefixes Encryption: Server-side encryption (SSE-S3 or SSE-KMS) Audit Logging: Enable S3 access logging

Data Classification

Public Data: Blocks, transactions (public blockchain data) Sensitive Data: User addresses, labels (requires authentication) Compliance Data: Banking/transaction data (strict access control)

Cost Optimization

Storage Optimization

Strategies:

  • Use appropriate storage classes (Standard, IA, Glacier)
  • Compress data (Parquet + Snappy)
  • Delete old data per retention policy
  • Use intelligent tiering

Query Optimization

Strategies:

  • Partition pruning (query only relevant partitions)
  • Column pruning (select only needed columns)
  • Predicate pushdown (filter early)

References

  • Database Schema: See postgres-schema.md
  • Analytics: See ../observability/metrics-monitoring.md