Files
explorer-monorepo/docs/specs/database/search-index-schema.md

9.5 KiB

Search Index Schema Specification

Overview

This document specifies the Elasticsearch/OpenSearch index schema for full-text search and faceted querying across blocks, transactions, addresses, tokens, and contracts.

Architecture

flowchart LR
    PG[(PostgreSQL<br/>Canonical Data)]
    Transform[Data Transformer]
    ES[(Elasticsearch<br/>Search Index)]
    
    PG --> Transform
    Transform --> ES
    
    Query[Search Query]
    Query --> ES
    ES --> Results[Search Results]

Index Structure

Blocks Index

Index Name: blocks-{chain_id} (e.g., blocks-138)

Document Structure:

{
  "block_number": 12345,
  "hash": "0x...",
  "timestamp": "2024-01-01T00:00:00Z",
  "miner": "0x...",
  "transaction_count": 100,
  "gas_used": 15000000,
  "gas_limit": 20000000,
  "chain_id": 138,
  "parent_hash": "0x...",
  "size": 1024
}

Field Mappings:

  • block_number: long (not analyzed, for sorting/filtering)
  • hash: keyword (exact match)
  • timestamp: date
  • miner: keyword (exact match)
  • transaction_count: integer
  • gas_used: long
  • gas_limit: long
  • chain_id: integer
  • parent_hash: keyword

Searchable Fields:

  • Hash (exact match)
  • Miner address (exact match)

Transactions Index

Index Name: transactions-{chain_id}

Document Structure:

{
  "hash": "0x...",
  "block_number": 12345,
  "transaction_index": 5,
  "from_address": "0x...",
  "to_address": "0x...",
  "value": "1000000000000000000",
  "gas_price": "20000000000",
  "gas_used": 21000,
  "status": "success",
  "timestamp": "2024-01-01T00:00:00Z",
  "chain_id": 138,
  "input_data_length": 100,
  "is_contract_creation": false,
  "contract_address": null
}

Field Mappings:

  • hash: keyword
  • block_number: long
  • transaction_index: integer
  • from_address: keyword
  • to_address: keyword
  • value: text (for full-text search on large numbers)
  • value_numeric: long (for range queries)
  • gas_price: long
  • gas_used: long
  • status: keyword
  • timestamp: date
  • chain_id: integer
  • input_data_length: integer
  • is_contract_creation: boolean
  • contract_address: keyword

Searchable Fields:

  • Hash (exact match)
  • From/to addresses (exact match)
  • Value (range queries)

Addresses Index

Index Name: addresses-{chain_id}

Document Structure:

{
  "address": "0x...",
  "chain_id": 138,
  "label": "My Wallet",
  "tags": ["wallet", "exchange"],
  "token_count": 10,
  "transaction_count": 500,
  "first_seen": "2024-01-01T00:00:00Z",
  "last_seen": "2024-01-15T00:00:00Z",
  "is_contract": true,
  "contract_name": "MyToken",
  "balance_eth": "1.5",
  "balance_usd": "3000"
}

Field Mappings:

  • address: keyword
  • chain_id: integer
  • label: text (analyzed) + keyword (exact match)
  • tags: keyword (array)
  • token_count: integer
  • transaction_count: long
  • first_seen: date
  • last_seen: date
  • is_contract: boolean
  • contract_name: text + keyword
  • balance_eth: double
  • balance_usd: double

Searchable Fields:

  • Address (exact match, prefix match)
  • Label (full-text search)
  • Contract name (full-text search)
  • Tags (facet filter)

Tokens Index

Index Name: tokens-{chain_id}

Document Structure:

{
  "address": "0x...",
  "chain_id": 138,
  "name": "My Token",
  "symbol": "MTK",
  "type": "ERC20",
  "decimals": 18,
  "total_supply": "1000000000000000000000000",
  "holder_count": 1000,
  "transfer_count": 50000,
  "logo_url": "https://...",
  "verified": true,
  "description": "A token description"
}

Field Mappings:

  • address: keyword
  • chain_id: integer
  • name: text (analyzed) + keyword (exact match)
  • symbol: keyword (uppercase normalized)
  • type: keyword
  • decimals: integer
  • total_supply: text (for large numbers)
  • total_supply_numeric: double (for sorting)
  • holder_count: integer
  • transfer_count: long
  • logo_url: keyword
  • verified: boolean
  • description: text (analyzed)

Searchable Fields:

  • Name (full-text search)
  • Symbol (exact match, prefix match)
  • Address (exact match)

Contracts Index

Index Name: contracts-{chain_id}

Document Structure:

{
  "address": "0x...",
  "chain_id": 138,
  "name": "MyContract",
  "verification_status": "verified",
  "compiler_version": "0.8.19",
  "source_code": "contract MyContract {...}",
  "abi": [...],
  "verified_at": "2024-01-01T00:00:00Z",
  "transaction_count": 1000,
  "created_at": "2024-01-01T00:00:00Z"
}

Field Mappings:

  • address: keyword
  • chain_id: integer
  • name: text + keyword
  • verification_status: keyword
  • compiler_version: keyword
  • source_code: text (analyzed, indexed but not stored in full for large contracts)
  • abi: object (nested, for structured queries)
  • verified_at: date
  • transaction_count: long
  • created_at: date

Searchable Fields:

  • Name (full-text search)
  • Address (exact match)
  • Source code (full-text search, limited)

Indexing Pipeline

Data Transformation

Purpose: Transform canonical PostgreSQL data into search-optimized documents.

Transformation Steps:

  1. Fetch Data: Query PostgreSQL for entities to index
  2. Enrich Data: Add computed fields (balances, counts, etc.)
  3. Normalize Data: Normalize addresses, format values
  4. Index Document: Send to Elasticsearch/OpenSearch

Indexing Strategy

Initial Indexing:

  • Bulk index existing data
  • Process in batches (1000 documents per batch)
  • Use bulk API for efficiency

Incremental Indexing:

  • Index new entities as they're created
  • Update entities when changed
  • Delete entities when removed

Update Frequency:

  • Real-time: Index immediately after database insert/update
  • Batch: Bulk update every N minutes for efficiency

Index Aliases

Purpose: Enable zero-downtime index updates.

Strategy:

  • Write to new index (e.g., blocks-138-v2)
  • Build index in background
  • Switch alias when ready
  • Delete old index after switch

Alias Names:

  • blocks-{chain_id} → points to latest version
  • transactions-{chain_id} → points to latest version
  • etc.

Query Patterns

Blocks Search:

{
  "query": {
    "match": {
      "hash": "0x123..."
    }
  }
}

Address Search:

{
  "query": {
    "bool": {
      "should": [
        { "match": { "label": "wallet" } },
        { "prefix": { "address": "0x123" } }
      ]
    }
  }
}

Token Search:

{
  "query": {
    "bool": {
      "should": [
        { "match": { "name": "My Token" } },
        { "match": { "symbol": "MTK" } }
      ]
    }
  }
}

Filter by Multiple Criteria:

{
  "query": {
    "bool": {
      "must": [
        { "term": { "chain_id": 138 } },
        { "term": { "type": "ERC20" } },
        { "range": { "holder_count": { "gte": 100 } } }
      ]
    }
  },
  "aggs": {
    "by_type": {
      "terms": { "field": "type" }
    }
  }
}

Cross-Entity Search:

  • Search across blocks, transactions, addresses, tokens
  • Use _index field to filter by entity type
  • Combine results with relevance scoring

Multi-Index Query:

{
  "query": {
    "multi_match": {
      "query": "0x123",
      "fields": ["hash", "address", "from_address", "to_address"],
      "type": "best_fields"
    }
  }
}

Index Configuration

Analysis Settings

Custom Analyzer:

  • Address analyzer: Lowercase, no tokenization
  • Symbol analyzer: Uppercase, no tokenization
  • Text analyzer: Standard analyzer with lowercase

Example Configuration:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "address_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": ["lowercase"]
        }
      }
    }
  }
}

Sharding and Replication

Sharding:

  • Number of shards: Based on index size
  • Large indices (> 50GB): Multiple shards
  • Small indices: Single shard

Replication:

  • Replica count: 1-2 (for high availability)
  • Increase replicas for read-heavy workloads

Performance Optimization

Index Optimization

Refresh Interval:

  • Default: 1 second
  • For bulk indexing: Increase to 30 seconds, then reset

Bulk Indexing:

  • Batch size: 1000-5000 documents
  • Use bulk API
  • Disable refresh during bulk indexing

Query Optimization

Query Caching:

  • Enable query cache for repeated queries
  • Cache filter results

Field Data:

  • Use doc_values for sorting/aggregations
  • Avoid fielddata for text fields

Maintenance

Index Monitoring

Metrics:

  • Index size
  • Document count
  • Query performance (p50, p95, p99)
  • Index lag (time behind database)

Index Cleanup

Strategy:

  • Delete old indices (after alias switch)
  • Archive old indices to cold storage
  • Compress indices for storage efficiency

Integration with PostgreSQL

Data Sync

Sync Strategy:

  • Real-time: Listen to database changes (CDC, triggers, or polling)
  • Batch: Periodic sync jobs
  • Hybrid: Real-time for recent data, batch for historical

Change Detection:

  • Use updated_at timestamp
  • Use database triggers to queue changes
  • Use CDC (Change Data Capture) if available

Consistency

Eventual Consistency:

  • Search index is eventually consistent with database
  • Small lag acceptable (< 1 minute)
  • Critical queries can fall back to database

References

  • Database Schema: See postgres-schema.md
  • Indexer Architecture: See ../indexing/indexer-architecture.md
  • Unified Search: See ../multichain/unified-search.md