# Search Index Schema Specification ## Overview This document specifies the Elasticsearch/OpenSearch index schema for full-text search and faceted querying across blocks, transactions, addresses, tokens, and contracts. ## Architecture ```mermaid flowchart LR PG[(PostgreSQL
Canonical Data)] Transform[Data Transformer] ES[(Elasticsearch
Search Index)] PG --> Transform Transform --> ES Query[Search Query] Query --> ES ES --> Results[Search Results] ``` ## Index Structure ### Blocks Index **Index Name**: `blocks-{chain_id}` (e.g., `blocks-138`) **Document Structure**: ```json { "block_number": 12345, "hash": "0x...", "timestamp": "2024-01-01T00:00:00Z", "miner": "0x...", "transaction_count": 100, "gas_used": 15000000, "gas_limit": 20000000, "chain_id": 138, "parent_hash": "0x...", "size": 1024 } ``` **Field Mappings**: - `block_number`: `long` (not analyzed, for sorting/filtering) - `hash`: `keyword` (exact match) - `timestamp`: `date` - `miner`: `keyword` (exact match) - `transaction_count`: `integer` - `gas_used`: `long` - `gas_limit`: `long` - `chain_id`: `integer` - `parent_hash`: `keyword` **Searchable Fields**: - Hash (exact match) - Miner address (exact match) ### Transactions Index **Index Name**: `transactions-{chain_id}` **Document Structure**: ```json { "hash": "0x...", "block_number": 12345, "transaction_index": 5, "from_address": "0x...", "to_address": "0x...", "value": "1000000000000000000", "gas_price": "20000000000", "gas_used": 21000, "status": "success", "timestamp": "2024-01-01T00:00:00Z", "chain_id": 138, "input_data_length": 100, "is_contract_creation": false, "contract_address": null } ``` **Field Mappings**: - `hash`: `keyword` - `block_number`: `long` - `transaction_index`: `integer` - `from_address`: `keyword` - `to_address`: `keyword` - `value`: `text` (for full-text search on large numbers) - `value_numeric`: `long` (for range queries) - `gas_price`: `long` - `gas_used`: `long` - `status`: `keyword` - `timestamp`: `date` - `chain_id`: `integer` - `input_data_length`: `integer` - `is_contract_creation`: `boolean` - `contract_address`: `keyword` **Searchable Fields**: - Hash (exact match) - From/to addresses (exact match) - Value (range queries) ### Addresses Index **Index Name**: `addresses-{chain_id}` **Document Structure**: ```json { "address": "0x...", "chain_id": 138, "label": "My Wallet", "tags": ["wallet", "exchange"], "token_count": 10, "transaction_count": 500, "first_seen": "2024-01-01T00:00:00Z", "last_seen": "2024-01-15T00:00:00Z", "is_contract": true, "contract_name": "MyToken", "balance_eth": "1.5", "balance_usd": "3000" } ``` **Field Mappings**: - `address`: `keyword` - `chain_id`: `integer` - `label`: `text` (analyzed) + `keyword` (exact match) - `tags`: `keyword` (array) - `token_count`: `integer` - `transaction_count`: `long` - `first_seen`: `date` - `last_seen`: `date` - `is_contract`: `boolean` - `contract_name`: `text` + `keyword` - `balance_eth`: `double` - `balance_usd`: `double` **Searchable Fields**: - Address (exact match, prefix match) - Label (full-text search) - Contract name (full-text search) - Tags (facet filter) ### Tokens Index **Index Name**: `tokens-{chain_id}` **Document Structure**: ```json { "address": "0x...", "chain_id": 138, "name": "My Token", "symbol": "MTK", "type": "ERC20", "decimals": 18, "total_supply": "1000000000000000000000000", "holder_count": 1000, "transfer_count": 50000, "logo_url": "https://...", "verified": true, "description": "A token description" } ``` **Field Mappings**: - `address`: `keyword` - `chain_id`: `integer` - `name`: `text` (analyzed) + `keyword` (exact match) - `symbol`: `keyword` (uppercase normalized) - `type`: `keyword` - `decimals`: `integer` - `total_supply`: `text` (for large numbers) - `total_supply_numeric`: `double` (for sorting) - `holder_count`: `integer` - `transfer_count`: `long` - `logo_url`: `keyword` - `verified`: `boolean` - `description`: `text` (analyzed) **Searchable Fields**: - Name (full-text search) - Symbol (exact match, prefix match) - Address (exact match) ### Contracts Index **Index Name**: `contracts-{chain_id}` **Document Structure**: ```json { "address": "0x...", "chain_id": 138, "name": "MyContract", "verification_status": "verified", "compiler_version": "0.8.19", "source_code": "contract MyContract {...}", "abi": [...], "verified_at": "2024-01-01T00:00:00Z", "transaction_count": 1000, "created_at": "2024-01-01T00:00:00Z" } ``` **Field Mappings**: - `address`: `keyword` - `chain_id`: `integer` - `name`: `text` + `keyword` - `verification_status`: `keyword` - `compiler_version`: `keyword` - `source_code`: `text` (analyzed, indexed but not stored in full for large contracts) - `abi`: `object` (nested, for structured queries) - `verified_at`: `date` - `transaction_count`: `long` - `created_at`: `date` **Searchable Fields**: - Name (full-text search) - Address (exact match) - Source code (full-text search, limited) ## Indexing Pipeline ### Data Transformation **Purpose**: Transform canonical PostgreSQL data into search-optimized documents. **Transformation Steps**: 1. **Fetch Data**: Query PostgreSQL for entities to index 2. **Enrich Data**: Add computed fields (balances, counts, etc.) 3. **Normalize Data**: Normalize addresses, format values 4. **Index Document**: Send to Elasticsearch/OpenSearch ### Indexing Strategy **Initial Indexing**: - Bulk index existing data - Process in batches (1000 documents per batch) - Use bulk API for efficiency **Incremental Indexing**: - Index new entities as they're created - Update entities when changed - Delete entities when removed **Update Frequency**: - Real-time: Index immediately after database insert/update - Batch: Bulk update every N minutes for efficiency ### Index Aliases **Purpose**: Enable zero-downtime index updates. **Strategy**: - Write to new index (e.g., `blocks-138-v2`) - Build index in background - Switch alias when ready - Delete old index after switch **Alias Names**: - `blocks-{chain_id}` → points to latest version - `transactions-{chain_id}` → points to latest version - etc. ## Query Patterns ### Full-Text Search **Blocks Search**: ```json { "query": { "match": { "hash": "0x123..." } } } ``` **Address Search**: ```json { "query": { "bool": { "should": [ { "match": { "label": "wallet" } }, { "prefix": { "address": "0x123" } } ] } } } ``` **Token Search**: ```json { "query": { "bool": { "should": [ { "match": { "name": "My Token" } }, { "match": { "symbol": "MTK" } } ] } } } ``` ### Faceted Search **Filter by Multiple Criteria**: ```json { "query": { "bool": { "must": [ { "term": { "chain_id": 138 } }, { "term": { "type": "ERC20" } }, { "range": { "holder_count": { "gte": 100 } } } ] } }, "aggs": { "by_type": { "terms": { "field": "type" } } } } ``` ### Unified Search **Cross-Entity Search**: - Search across blocks, transactions, addresses, tokens - Use `_index` field to filter by entity type - Combine results with relevance scoring **Multi-Index Query**: ```json { "query": { "multi_match": { "query": "0x123", "fields": ["hash", "address", "from_address", "to_address"], "type": "best_fields" } } } ``` ## Index Configuration ### Analysis Settings **Custom Analyzer**: - Address analyzer: Lowercase, no tokenization - Symbol analyzer: Uppercase, no tokenization - Text analyzer: Standard analyzer with lowercase **Example Configuration**: ```json { "settings": { "analysis": { "analyzer": { "address_analyzer": { "type": "custom", "tokenizer": "keyword", "filter": ["lowercase"] } } } } } ``` ### Sharding and Replication **Sharding**: - Number of shards: Based on index size - Large indices (> 50GB): Multiple shards - Small indices: Single shard **Replication**: - Replica count: 1-2 (for high availability) - Increase replicas for read-heavy workloads ## Performance Optimization ### Index Optimization **Refresh Interval**: - Default: 1 second - For bulk indexing: Increase to 30 seconds, then reset **Bulk Indexing**: - Batch size: 1000-5000 documents - Use bulk API - Disable refresh during bulk indexing ### Query Optimization **Query Caching**: - Enable query cache for repeated queries - Cache filter results **Field Data**: - Use `doc_values` for sorting/aggregations - Avoid `fielddata` for text fields ## Maintenance ### Index Monitoring **Metrics**: - Index size - Document count - Query performance (p50, p95, p99) - Index lag (time behind database) ### Index Cleanup **Strategy**: - Delete old indices (after alias switch) - Archive old indices to cold storage - Compress indices for storage efficiency ## Integration with PostgreSQL ### Data Sync **Sync Strategy**: - Real-time: Listen to database changes (CDC, triggers, or polling) - Batch: Periodic sync jobs - Hybrid: Real-time for recent data, batch for historical **Change Detection**: - Use `updated_at` timestamp - Use database triggers to queue changes - Use CDC (Change Data Capture) if available ### Consistency **Eventual Consistency**: - Search index is eventually consistent with database - Small lag acceptable (< 1 minute) - Critical queries can fall back to database ## References - Database Schema: See `postgres-schema.md` - Indexer Architecture: See `../indexing/indexer-architecture.md` - Unified Search: See `../multichain/unified-search.md`