7.7 KiB
Data Lake Schema Specification
Overview
This document specifies the data lake schema for long-term storage of blockchain data in S3-compatible object storage using Parquet format for analytics, ML, and compliance purposes.
Storage Structure
Directory Layout
s3://explorer-data-lake/
├── raw/
│ ├── chain_id=138/
│ │ ├── year=2024/
│ │ │ ├── month=01/
│ │ │ │ ├── day=01/
│ │ │ │ │ ├── blocks.parquet
│ │ │ │ │ ├── transactions.parquet
│ │ │ │ │ └── logs.parquet
│ │ │ │ └── ...
│ │ │ └── ...
│ │ └── ...
│ └── ...
├── processed/
│ ├── chain_id=138/
│ │ ├── daily_aggregates/
│ │ │ ├── year=2024/
│ │ │ │ └── month=01/
│ │ │ │ └── day=01.parquet
│ │ └── ...
│ └── ...
└── archived/
└── ...
Partitioning Strategy
Partition Keys:
chain_id: Chain identifieryear: Year (YYYY)month: Month (MM)day: Day (DD)
Benefits:
- Efficient query pruning
- Parallel processing
- Easy data management (delete by partition)
Parquet Schema
Blocks Parquet Schema
{
"type": "struct",
"fields": [
{"name": "chain_id", "type": "integer", "nullable": false},
{"name": "number", "type": "long", "nullable": false},
{"name": "hash", "type": "string", "nullable": false},
{"name": "parent_hash", "type": "string", "nullable": false},
{"name": "timestamp", "type": "timestamp", "nullable": false},
{"name": "miner", "type": "string", "nullable": true},
{"name": "gas_used", "type": "long", "nullable": true},
{"name": "gas_limit", "type": "long", "nullable": true},
{"name": "transaction_count", "type": "integer", "nullable": true},
{"name": "size", "type": "integer", "nullable": true}
]
}
Transactions Parquet Schema
{
"type": "struct",
"fields": [
{"name": "chain_id", "type": "integer", "nullable": false},
{"name": "hash", "type": "string", "nullable": false},
{"name": "block_number", "type": "long", "nullable": false},
{"name": "transaction_index", "type": "integer", "nullable": false},
{"name": "from_address", "type": "string", "nullable": false},
{"name": "to_address", "type": "string", "nullable": true},
{"name": "value", "type": "string", "nullable": false}, // Decimal as string
{"name": "gas_price", "type": "long", "nullable": true},
{"name": "gas_used", "type": "long", "nullable": true},
{"name": "gas_limit", "type": "long", "nullable": false},
{"name": "status", "type": "integer", "nullable": true},
{"name": "timestamp", "type": "timestamp", "nullable": false}
]
}
Logs Parquet Schema
{
"type": "struct",
"fields": [
{"name": "chain_id", "type": "integer", "nullable": false},
{"name": "transaction_hash", "type": "string", "nullable": false},
{"name": "block_number", "type": "long", "nullable": false},
{"name": "log_index", "type": "integer", "nullable": false},
{"name": "address", "type": "string", "nullable": false},
{"name": "topic0", "type": "string", "nullable": true},
{"name": "topic1", "type": "string", "nullable": true},
{"name": "topic2", "type": "string", "nullable": true},
{"name": "topic3", "type": "string", "nullable": true},
{"name": "data", "type": "string", "nullable": true},
{"name": "timestamp", "type": "timestamp", "nullable": false}
]
}
Token Transfers Parquet Schema
{
"type": "struct",
"fields": [
{"name": "chain_id", "type": "integer", "nullable": false},
{"name": "transaction_hash", "type": "string", "nullable": false},
{"name": "block_number", "type": "long", "nullable": false},
{"name": "token_address", "type": "string", "nullable": false},
{"name": "token_type", "type": "string", "nullable": false},
{"name": "from_address", "type": "string", "nullable": false},
{"name": "to_address", "type": "string", "nullable": false},
{"name": "amount", "type": "string", "nullable": true},
{"name": "token_id", "type": "string", "nullable": true},
{"name": "timestamp", "type": "timestamp", "nullable": false}
]
}
Data Ingestion
ETL Pipeline
Process:
- Extract: Query PostgreSQL for daily data
- Transform: Convert to Parquet format
- Load: Upload to S3 with partitioning
Schedule: Daily batch job after day ends
Tools: Apache Spark, AWS Glue, or custom ETL scripts
Compression
Format: Snappy compression (good balance of speed and compression ratio)
Alternative: Gzip (better compression, slower)
File Sizing
Target Size: 100-500 MB per Parquet file
- Smaller files: Better parallelism
- Larger files: Better compression
Strategy: Write files of target size, or split by time ranges
Query Interface
AWS Athena / Presto
Table Definition:
CREATE EXTERNAL TABLE blocks_138 (
chain_id int,
number bigint,
hash string,
parent_hash string,
timestamp timestamp,
miner string,
gas_used bigint,
gas_limit bigint,
transaction_count int,
size int
)
STORED AS PARQUET
LOCATION 's3://explorer-data-lake/raw/chain_id=138/'
TBLPROPERTIES (
'projection.enabled' = 'true',
'projection.year.type' = 'integer',
'projection.year.range' = '2020,2030',
'projection.month.type' = 'integer',
'projection.month.range' = '1,12',
'projection.day.type' = 'integer',
'projection.day.range' = '1,31'
);
Query Examples
Daily Transaction Count:
SELECT
DATE(timestamp) as date,
COUNT(*) as transaction_count
FROM transactions_138
WHERE year = 2024 AND month = 1
GROUP BY DATE(timestamp)
ORDER BY date;
Token Transfer Analytics:
SELECT
token_address,
COUNT(*) as transfer_count,
SUM(CAST(amount AS DECIMAL(78, 0))) as total_volume
FROM token_transfers_138
WHERE year = 2024 AND month = 1
GROUP BY token_address
ORDER BY total_volume DESC
LIMIT 100;
Data Retention
Retention Policies
Raw Data: 7 years (compliance requirement) Processed Aggregates: Indefinite Archived Data: Move to Glacier after 1 year
Lifecycle Policies
S3 Lifecycle Rules:
- Move to Infrequent Access after 30 days
- Move to Glacier after 1 year
- Delete after 7 years (raw data)
Data Processing
Aggregation Jobs
Daily Aggregates:
- Transaction counts by hour
- Gas usage statistics
- Token transfer volumes
- Address activity metrics
Monthly Aggregates:
- Network growth metrics
- Token distribution changes
- Protocol usage statistics
ML/Analytics Workflows
Use Cases:
- Anomaly detection
- Fraud detection
- Market analysis
- Network health monitoring
Tools: Spark, Pandas, Jupyter notebooks
Security and Access Control
Access Control
IAM Policies: Restrict access to specific prefixes Encryption: Server-side encryption (SSE-S3 or SSE-KMS) Audit Logging: Enable S3 access logging
Data Classification
Public Data: Blocks, transactions (public blockchain data) Sensitive Data: User addresses, labels (requires authentication) Compliance Data: Banking/transaction data (strict access control)
Cost Optimization
Storage Optimization
Strategies:
- Use appropriate storage classes (Standard, IA, Glacier)
- Compress data (Parquet + Snappy)
- Delete old data per retention policy
- Use intelligent tiering
Query Optimization
Strategies:
- Partition pruning (query only relevant partitions)
- Column pruning (select only needed columns)
- Predicate pushdown (filter early)
References
- Database Schema: See
postgres-schema.md - Analytics: See
../observability/metrics-monitoring.md