Files

defiQUG 903c03c65b Add full monorepo: virtual-banker, backend, frontend, docs, scripts, deployment

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-02-10 11:32:49 -08:00

7.7 KiB

Raw Permalink Blame History

Data Lake Schema Specification

Overview

This document specifies the data lake schema for long-term storage of blockchain data in S3-compatible object storage using Parquet format for analytics, ML, and compliance purposes.

Storage Structure

Directory Layout

s3://explorer-data-lake/
├── raw/
│   ├── chain_id=138/
│   │   ├── year=2024/
│   │   │   ├── month=01/
│   │   │   │   ├── day=01/
│   │   │   │   │   ├── blocks.parquet
│   │   │   │   │   ├── transactions.parquet
│   │   │   │   │   └── logs.parquet
│   │   │   │   └── ...
│   │   │   └── ...
│   │   └── ...
│   └── ...
├── processed/
│   ├── chain_id=138/
│   │   ├── daily_aggregates/
│   │   │   ├── year=2024/
│   │   │   │   └── month=01/
│   │   │   │       └── day=01.parquet
│   │   └── ...
│   └── ...
└── archived/
    └── ...

Partitioning Strategy

Partition Keys:

chain_id: Chain identifier
year: Year (YYYY)
month: Month (MM)
day: Day (DD)

Benefits:

Efficient query pruning
Parallel processing
Easy data management (delete by partition)

Parquet Schema

Blocks Parquet Schema

{
  "type": "struct",
  "fields": [
    {"name": "chain_id", "type": "integer", "nullable": false},
    {"name": "number", "type": "long", "nullable": false},
    {"name": "hash", "type": "string", "nullable": false},
    {"name": "parent_hash", "type": "string", "nullable": false},
    {"name": "timestamp", "type": "timestamp", "nullable": false},
    {"name": "miner", "type": "string", "nullable": true},
    {"name": "gas_used", "type": "long", "nullable": true},
    {"name": "gas_limit", "type": "long", "nullable": true},
    {"name": "transaction_count", "type": "integer", "nullable": true},
    {"name": "size", "type": "integer", "nullable": true}
  ]
}

Transactions Parquet Schema

{
  "type": "struct",
  "fields": [
    {"name": "chain_id", "type": "integer", "nullable": false},
    {"name": "hash", "type": "string", "nullable": false},
    {"name": "block_number", "type": "long", "nullable": false},
    {"name": "transaction_index", "type": "integer", "nullable": false},
    {"name": "from_address", "type": "string", "nullable": false},
    {"name": "to_address", "type": "string", "nullable": true},
    {"name": "value", "type": "string", "nullable": false}, // Decimal as string
    {"name": "gas_price", "type": "long", "nullable": true},
    {"name": "gas_used", "type": "long", "nullable": true},
    {"name": "gas_limit", "type": "long", "nullable": false},
    {"name": "status", "type": "integer", "nullable": true},
    {"name": "timestamp", "type": "timestamp", "nullable": false}
  ]
}

Logs Parquet Schema

{
  "type": "struct",
  "fields": [
    {"name": "chain_id", "type": "integer", "nullable": false},
    {"name": "transaction_hash", "type": "string", "nullable": false},
    {"name": "block_number", "type": "long", "nullable": false},
    {"name": "log_index", "type": "integer", "nullable": false},
    {"name": "address", "type": "string", "nullable": false},
    {"name": "topic0", "type": "string", "nullable": true},
    {"name": "topic1", "type": "string", "nullable": true},
    {"name": "topic2", "type": "string", "nullable": true},
    {"name": "topic3", "type": "string", "nullable": true},
    {"name": "data", "type": "string", "nullable": true},
    {"name": "timestamp", "type": "timestamp", "nullable": false}
  ]
}

Token Transfers Parquet Schema

{
  "type": "struct",
  "fields": [
    {"name": "chain_id", "type": "integer", "nullable": false},
    {"name": "transaction_hash", "type": "string", "nullable": false},
    {"name": "block_number", "type": "long", "nullable": false},
    {"name": "token_address", "type": "string", "nullable": false},
    {"name": "token_type", "type": "string", "nullable": false},
    {"name": "from_address", "type": "string", "nullable": false},
    {"name": "to_address", "type": "string", "nullable": false},
    {"name": "amount", "type": "string", "nullable": true},
    {"name": "token_id", "type": "string", "nullable": true},
    {"name": "timestamp", "type": "timestamp", "nullable": false}
  ]
}

Data Ingestion

ETL Pipeline

Process:

Extract: Query PostgreSQL for daily data
Transform: Convert to Parquet format
Load: Upload to S3 with partitioning

Schedule: Daily batch job after day ends

Tools: Apache Spark, AWS Glue, or custom ETL scripts

Compression

Format: Snappy compression (good balance of speed and compression ratio)

Alternative: Gzip (better compression, slower)

File Sizing

Target Size: 100-500 MB per Parquet file

Smaller files: Better parallelism
Larger files: Better compression

Strategy: Write files of target size, or split by time ranges

Query Interface

AWS Athena / Presto

Table Definition:

CREATE EXTERNAL TABLE blocks_138 (
  chain_id int,
  number bigint,
  hash string,
  parent_hash string,
  timestamp timestamp,
  miner string,
  gas_used bigint,
  gas_limit bigint,
  transaction_count int,
  size int
)
STORED AS PARQUET
LOCATION 's3://explorer-data-lake/raw/chain_id=138/'
TBLPROPERTIES (
  'projection.enabled' = 'true',
  'projection.year.type' = 'integer',
  'projection.year.range' = '2020,2030',
  'projection.month.type' = 'integer',
  'projection.month.range' = '1,12',
  'projection.day.type' = 'integer',
  'projection.day.range' = '1,31'
);

Query Examples

Daily Transaction Count:

SELECT 
  DATE(timestamp) as date,
  COUNT(*) as transaction_count
FROM transactions_138
WHERE year = 2024 AND month = 1
GROUP BY DATE(timestamp)
ORDER BY date;

Token Transfer Analytics:

SELECT 
  token_address,
  COUNT(*) as transfer_count,
  SUM(CAST(amount AS DECIMAL(78, 0))) as total_volume
FROM token_transfers_138
WHERE year = 2024 AND month = 1
GROUP BY token_address
ORDER BY total_volume DESC
LIMIT 100;

Data Retention

Retention Policies

Raw Data: 7 years (compliance requirement) Processed Aggregates: Indefinite Archived Data: Move to Glacier after 1 year

Lifecycle Policies

S3 Lifecycle Rules:

Move to Infrequent Access after 30 days
Move to Glacier after 1 year
Delete after 7 years (raw data)

Data Processing

Aggregation Jobs

Daily Aggregates:

Transaction counts by hour
Gas usage statistics
Token transfer volumes
Address activity metrics

Monthly Aggregates:

Network growth metrics
Token distribution changes
Protocol usage statistics

ML/Analytics Workflows

Use Cases:

Anomaly detection
Fraud detection
Market analysis
Network health monitoring

Tools: Spark, Pandas, Jupyter notebooks

Security and Access Control

Access Control

IAM Policies: Restrict access to specific prefixes Encryption: Server-side encryption (SSE-S3 or SSE-KMS) Audit Logging: Enable S3 access logging

Data Classification

Public Data: Blocks, transactions (public blockchain data) Sensitive Data: User addresses, labels (requires authentication) Compliance Data: Banking/transaction data (strict access control)

Cost Optimization

Storage Optimization

Strategies:

Use appropriate storage classes (Standard, IA, Glacier)
Compress data (Parquet + Snappy)
Delete old data per retention policy
Use intelligent tiering

Query Optimization