# Search Index Schema Specification

## Overview

This document specifies the Elasticsearch/OpenSearch index schema for full-text search and faceted querying across blocks, transactions, addresses, tokens, and contracts.

## Architecture

```mermaid
flowchart LR
    PG[(PostgreSQL<br/>Canonical Data)]
    Transform[Data Transformer]
    ES[(Elasticsearch<br/>Search Index)]
    
    PG --> Transform
    Transform --> ES
    
    Query[Search Query]
    Query --> ES
    ES --> Results[Search Results]
```

## Index Structure

### Blocks Index

**Index Name**: `blocks-{chain_id}` (e.g., `blocks-138`)

**Document Structure**:
```json
{
  "block_number": 12345,
  "hash": "0x...",
  "timestamp": "2024-01-01T00:00:00Z",
  "miner": "0x...",
  "transaction_count": 100,
  "gas_used": 15000000,
  "gas_limit": 20000000,
  "chain_id": 138,
  "parent_hash": "0x...",
  "size": 1024
}
```

**Field Mappings**:
- `block_number`: `long` (not analyzed, for sorting/filtering)
- `hash`: `keyword` (exact match)
- `timestamp`: `date`
- `miner`: `keyword` (exact match)
- `transaction_count`: `integer`
- `gas_used`: `long`
- `gas_limit`: `long`
- `chain_id`: `integer`
- `parent_hash`: `keyword`

**Searchable Fields**:
- Hash (exact match)
- Miner address (exact match)

### Transactions Index

**Index Name**: `transactions-{chain_id}`

**Document Structure**:
```json
{
  "hash": "0x...",
  "block_number": 12345,
  "transaction_index": 5,
  "from_address": "0x...",
  "to_address": "0x...",
  "value": "1000000000000000000",
  "gas_price": "20000000000",
  "gas_used": 21000,
  "status": "success",
  "timestamp": "2024-01-01T00:00:00Z",
  "chain_id": 138,
  "input_data_length": 100,
  "is_contract_creation": false,
  "contract_address": null
}
```

**Field Mappings**:
- `hash`: `keyword`
- `block_number`: `long`
- `transaction_index`: `integer`
- `from_address`: `keyword`
- `to_address`: `keyword`
- `value`: `text` (for full-text search on large numbers)
- `value_numeric`: `long` (for range queries)
- `gas_price`: `long`
- `gas_used`: `long`
- `status`: `keyword`
- `timestamp`: `date`
- `chain_id`: `integer`
- `input_data_length`: `integer`
- `is_contract_creation`: `boolean`
- `contract_address`: `keyword`

**Searchable Fields**:
- Hash (exact match)
- From/to addresses (exact match)
- Value (range queries)

### Addresses Index

**Index Name**: `addresses-{chain_id}`

**Document Structure**:
```json
{
  "address": "0x...",
  "chain_id": 138,
  "label": "My Wallet",
  "tags": ["wallet", "exchange"],
  "token_count": 10,
  "transaction_count": 500,
  "first_seen": "2024-01-01T00:00:00Z",
  "last_seen": "2024-01-15T00:00:00Z",
  "is_contract": true,
  "contract_name": "MyToken",
  "balance_eth": "1.5",
  "balance_usd": "3000"
}
```

**Field Mappings**:
- `address`: `keyword`
- `chain_id`: `integer`
- `label`: `text` (analyzed) + `keyword` (exact match)
- `tags`: `keyword` (array)
- `token_count`: `integer`
- `transaction_count`: `long`
- `first_seen`: `date`
- `last_seen`: `date`
- `is_contract`: `boolean`
- `contract_name`: `text` + `keyword`
- `balance_eth`: `double`
- `balance_usd`: `double`

**Searchable Fields**:
- Address (exact match, prefix match)
- Label (full-text search)
- Contract name (full-text search)
- Tags (facet filter)

### Tokens Index

**Index Name**: `tokens-{chain_id}`

**Document Structure**:
```json
{
  "address": "0x...",
  "chain_id": 138,
  "name": "My Token",
  "symbol": "MTK",
  "type": "ERC20",
  "decimals": 18,
  "total_supply": "1000000000000000000000000",
  "holder_count": 1000,
  "transfer_count": 50000,
  "logo_url": "https://...",
  "verified": true,
  "description": "A token description"
}
```

**Field Mappings**:
- `address`: `keyword`
- `chain_id`: `integer`
- `name`: `text` (analyzed) + `keyword` (exact match)
- `symbol`: `keyword` (uppercase normalized)
- `type`: `keyword`
- `decimals`: `integer`
- `total_supply`: `text` (for large numbers)
- `total_supply_numeric`: `double` (for sorting)
- `holder_count`: `integer`
- `transfer_count`: `long`
- `logo_url`: `keyword`
- `verified`: `boolean`
- `description`: `text` (analyzed)

**Searchable Fields**:
- Name (full-text search)
- Symbol (exact match, prefix match)
- Address (exact match)

### Contracts Index

**Index Name**: `contracts-{chain_id}`

**Document Structure**:
```json
{
  "address": "0x...",
  "chain_id": 138,
  "name": "MyContract",
  "verification_status": "verified",
  "compiler_version": "0.8.19",
  "source_code": "contract MyContract {...}",
  "abi": [...],
  "verified_at": "2024-01-01T00:00:00Z",
  "transaction_count": 1000,
  "created_at": "2024-01-01T00:00:00Z"
}
```

**Field Mappings**:
- `address`: `keyword`
- `chain_id`: `integer`
- `name`: `text` + `keyword`
- `verification_status`: `keyword`
- `compiler_version`: `keyword`
- `source_code`: `text` (analyzed, indexed but not stored in full for large contracts)
- `abi`: `object` (nested, for structured queries)
- `verified_at`: `date`
- `transaction_count`: `long`
- `created_at`: `date`

**Searchable Fields**:
- Name (full-text search)
- Address (exact match)
- Source code (full-text search, limited)

## Indexing Pipeline

### Data Transformation

**Purpose**: Transform canonical PostgreSQL data into search-optimized documents.

**Transformation Steps**:
1. **Fetch Data**: Query PostgreSQL for entities to index
2. **Enrich Data**: Add computed fields (balances, counts, etc.)
3. **Normalize Data**: Normalize addresses, format values
4. **Index Document**: Send to Elasticsearch/OpenSearch

### Indexing Strategy

**Initial Indexing**:
- Bulk index existing data
- Process in batches (1000 documents per batch)
- Use bulk API for efficiency

**Incremental Indexing**:
- Index new entities as they're created
- Update entities when changed
- Delete entities when removed

**Update Frequency**:
- Real-time: Index immediately after database insert/update
- Batch: Bulk update every N minutes for efficiency

### Index Aliases

**Purpose**: Enable zero-downtime index updates.

**Strategy**:
- Write to new index (e.g., `blocks-138-v2`)
- Build index in background
- Switch alias when ready
- Delete old index after switch

**Alias Names**:
- `blocks-{chain_id}` → points to latest version
- `transactions-{chain_id}` → points to latest version
- etc.

## Query Patterns

### Full-Text Search

**Blocks Search**:
```json
{
  "query": {
    "match": {
      "hash": "0x123..."
    }
  }
}
```

**Address Search**:
```json
{
  "query": {
    "bool": {
      "should": [
        { "match": { "label": "wallet" } },
        { "prefix": { "address": "0x123" } }
      ]
    }
  }
}
```

**Token Search**:
```json
{
  "query": {
    "bool": {
      "should": [
        { "match": { "name": "My Token" } },
        { "match": { "symbol": "MTK" } }
      ]
    }
  }
}
```

### Faceted Search

**Filter by Multiple Criteria**:
```json
{
  "query": {
    "bool": {
      "must": [
        { "term": { "chain_id": 138 } },
        { "term": { "type": "ERC20" } },
        { "range": { "holder_count": { "gte": 100 } } }
      ]
    }
  },
  "aggs": {
    "by_type": {
      "terms": { "field": "type" }
    }
  }
}
```

### Unified Search

**Cross-Entity Search**:
- Search across blocks, transactions, addresses, tokens
- Use `_index` field to filter by entity type
- Combine results with relevance scoring

**Multi-Index Query**:
```json
{
  "query": {
    "multi_match": {
      "query": "0x123",
      "fields": ["hash", "address", "from_address", "to_address"],
      "type": "best_fields"
    }
  }
}
```

## Index Configuration

### Analysis Settings

**Custom Analyzer**:
- Address analyzer: Lowercase, no tokenization
- Symbol analyzer: Uppercase, no tokenization
- Text analyzer: Standard analyzer with lowercase

**Example Configuration**:
```json
{
  "settings": {
    "analysis": {
      "analyzer": {
        "address_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": ["lowercase"]
        }
      }
    }
  }
}
```

### Sharding and Replication

**Sharding**:
- Number of shards: Based on index size
- Large indices (> 50GB): Multiple shards
- Small indices: Single shard

**Replication**:
- Replica count: 1-2 (for high availability)
- Increase replicas for read-heavy workloads

## Performance Optimization

### Index Optimization

**Refresh Interval**:
- Default: 1 second
- For bulk indexing: Increase to 30 seconds, then reset

**Bulk Indexing**:
- Batch size: 1000-5000 documents
- Use bulk API
- Disable refresh during bulk indexing

### Query Optimization

**Query Caching**:
- Enable query cache for repeated queries
- Cache filter results

**Field Data**:
- Use `doc_values` for sorting/aggregations
- Avoid `fielddata` for text fields

## Maintenance

### Index Monitoring

**Metrics**:
- Index size
- Document count
- Query performance (p50, p95, p99)
- Index lag (time behind database)

### Index Cleanup

**Strategy**:
- Delete old indices (after alias switch)
- Archive old indices to cold storage
- Compress indices for storage efficiency

## Integration with PostgreSQL

### Data Sync

**Sync Strategy**:
- Real-time: Listen to database changes (CDC, triggers, or polling)
- Batch: Periodic sync jobs
- Hybrid: Real-time for recent data, batch for historical

**Change Detection**:
- Use `updated_at` timestamp
- Use database triggers to queue changes
- Use CDC (Change Data Capture) if available

### Consistency

**Eventual Consistency**:
- Search index is eventually consistent with database
- Small lag acceptable (< 1 minute)
- Critical queries can fall back to database

## References

- Database Schema: See `postgres-schema.md`
- Indexer Architecture: See `../indexing/indexer-architecture.md`
- Unified Search: See `../multichain/unified-search.md`