Skip to content

Monitoring & Observability

Kaizen provides comprehensive observability through Prometheus metrics, Grafana dashboards, Loki log aggregation, and Sentry error tracking.

Docker Monitoring Stack (Recommended)

The easiest way to get full observability is using the Docker Compose stack:

docker compose up -d

This starts the complete monitoring stack:

ServicePortDescriptionAccess
Prometheus9090Metrics collection and storagehttp://localhost:9090
Grafana3003Dashboards and visualizationhttp://localhost:3003 (no login)
Loki3100Log aggregation(via Grafana)
Promtail-Log collection from Docker-

Pre-configured Dashboards

Grafana comes with a Kaizen Overview dashboard that includes:

  • Overview: Block height, TXs/min, active connections, uptime
  • Block Production: Latency percentiles, TXs per block
  • RPC & Network: Request rates, latency by method, WebSocket activity
  • Sync Status: Node heights comparison, sync latency
  • Trading (RFQ): Thesis rates, amounts, duration
  • Bridge: Deposit/withdrawal activity and amounts
  • Storage: Database size, pruning metrics
  • Logs: Real-time error/warning logs from all services

Log Queries (Loki)

Access logs via Grafana Explore or the Logs panel:

# All errors from write-node
{service="write-node"} |= "error"

# Transaction execution logs
{service="write-node"} | json | msg=~".*executed.*"

# Filter by level
{service=~"write-node|read-node.*"} | json | level="error"

# Search by tx_hash
{service="write-node"} |= "0x1234..."

Manual Setup (Without Docker)

Enable Metrics

# config.toml
[metrics]
enabled = true
path = "/metrics"
 
[sentry]
enabled = true
dsn = "https://your-dsn@sentry.io/project"
environment = "production"
traces_sample_rate = 0.1

Scrape Metrics

# Fetch metrics
curl http://localhost:8545/metrics
 
# Watch metrics in real-time
while true; do date; curl -s localhost:8545/metrics | grep -Ev '^(#|$)' | sort; sleep 10; done

Install (macOS)

brew install prometheus grafana
brew services start prometheus
brew services start grafana

Prometheus Configuration

# /opt/homebrew/etc/prometheus.yml
scrape_configs:
  - job_name: "kaizen"
    static_configs:
      - targets: ["localhost:8545"]
    metrics_path: "/metrics"
    scrape_interval: 5s

Access Dashboards


Metrics Reference

Process Metrics

MetricTypeDescription
kaizen_infoGaugeNode version and build info (labels: version, mode)
kaizen_uptime_secondsGaugeNode uptime in seconds
kaizen_process_memory_bytesGaugeCurrent memory usage
kaizen_process_cpu_seconds_totalGaugeTotal CPU time consumed

Block Production

MetricTypeLabelsDescription
kaizen_blocks_produced_totalCounter-Total blocks produced
kaizen_empty_blocks_totalCounter-Empty blocks (no txs)
kaizen_block_errors_totalCounter-Block production errors
kaizen_block_heightGauge-Current block height
kaizen_block_timestampGauge-Latest block timestamp
kaizen_block_tx_countHistogram-Transactions per block
kaizen_block_production_duration_secondsHistogram-Total block production time
kaizen_block_validation_duration_secondsHistogram-Block validation time
kaizen_block_execution_duration_secondsHistogram-Block execution time
kaizen_block_commit_duration_secondsHistogram-State commit time

Mempool

MetricTypeLabelsDescription
kaizen_mempool_pending_txsGauge-Current pending transactions
kaizen_mempool_high_priority_txsGauge-High priority queue size
kaizen_mempool_normal_priority_txsGauge-Normal priority queue size
kaizen_mempool_unique_sendersGauge-Unique senders in mempool
kaizen_mempool_txs_added_totalCounter-Transactions added
kaizen_mempool_txs_rejected_totalCounterreasonTransactions rejected
kaizen_mempool_txs_evicted_totalCounterreasonTransactions evicted
kaizen_mempool_tx_size_bytesHistogram-Transaction sizes

Rejection reasons: duplicate, mempool_full, too_many_from_sender, invalid_signature, invalid_timestamp

Transaction Execution

MetricTypeLabelsDescription
kaizen_txs_executed_totalCountertype, statusTransactions executed
kaizen_tx_execution_duration_secondsHistogramtypeExecution duration
kaizen_tx_validation_duration_secondsHistogramtypeValidation duration

Transaction types: Transfer, Withdraw, RfqSubmit, RfqSettle, OracleFeed, Deposit, etc.

RPC

MetricTypeLabelsDescription
kaizen_rpc_requests_totalCountermethod, statusTotal RPC requests
kaizen_rpc_request_duration_secondsHistogrammethodRequest duration
kaizen_rpc_active_requestsGauge-Active requests

Status: success, error

WebSocket

MetricTypeLabelsDescription
kaizen_ws_connections_activeGauge-Active connections
kaizen_ws_subscriptions_totalCounterevent_typeSubscriptions created
kaizen_ws_messages_sent_totalCounterevent_typeMessages sent
kaizen_ws_messages_received_totalCounter-Messages received

Sync (Read Node)

MetricTypeDescription
kaizen_blocks_synced_totalCounterBlocks synced from write node
kaizen_sync_heightGaugeCurrent sync height
kaizen_sync_connectedGaugeConnection status (1=connected)
kaizen_sync_errors_totalCounterSync errors
kaizen_sync_latency_secondsHistogramBlock sync latency
kaizen_sync_block_download_secondsHistogramBlock download time
kaizen_sync_block_apply_secondsHistogramBlock apply time

gRPC Sync Server (Write Node)

MetricTypeDescription
kaizen_grpc_clients_connectedGaugeConnected read nodes
kaizen_grpc_blocks_served_totalCounterBlocks served

State/Storage

MetricTypeDescription
kaizen_state_commit_duration_secondsHistogramState commit duration
kaizen_state_read_duration_secondsHistogramState read duration
kaizen_state_write_duration_secondsHistogramState write duration
kaizen_state_reads_totalCounterTotal state reads
kaizen_state_writes_totalCounterTotal state writes
kaizen_state_cache_sizeGaugeState cache entries
kaizen_state_cache_hit_ratioGaugeCache hit ratio (0-1)

RocksDB

MetricTypeDescription
kaizen_rocksdb_compactions_totalCounterTotal compactions
kaizen_rocksdb_compaction_duration_secondsHistogramCompaction duration
kaizen_rocksdb_size_bytesGaugeDatabase size
kaizen_rocksdb_reads_totalCounterRead operations
kaizen_rocksdb_writes_totalCounterWrite operations
kaizen_rocksdb_read_duration_secondsHistogramRead duration
kaizen_rocksdb_write_duration_secondsHistogramWrite duration

Oracle

MetricTypeLabelsDescription
kaizen_oracle_updates_totalCounterpairPrice updates
kaizen_oracle_stale_totalCounterpairStaleness detections
kaizen_oracle_gaps_totalCounterpairData gaps detected
kaizen_oracle_backfills_totalCounterpairBackfill operations
kaizen_oracle_latest_timestampGauge-Latest timestamp
kaizen_oracle_price_age_secondsHistogram-Price age when used

RFQ Trading

MetricTypeLabelsDescription
kaizen_rfq_submitted_totalCounter-Orders submitted
kaizen_rfq_settled_totalCounterstatusOrders settled
kaizen_rfq_cancelled_totalCounterreasonOrders cancelled
kaizen_rfq_expired_totalCounter-Orders expired
kaizen_rfq_active_ordersGauge-Active orders
kaizen_rfq_bet_amountHistogram-Bet amounts
kaizen_rfq_payoutHistogram-Payouts
kaizen_rfq_duration_secondsHistogram-Thesis lifecycle

Bridge

MetricTypeDescription
kaizen_deposits_totalCounterDeposits processed
kaizen_deposit_amountHistogramDeposit amounts
kaizen_withdrawals_totalCounterWithdrawals requested
kaizen_withdrawal_amountHistogramWithdrawal amounts
kaizen_withdrawals_processed_totalCounterWithdrawals processed
kaizen_withdrawals_pendingGaugePending withdrawals

Accounts

MetricTypeDescription
kaizen_transfers_totalCounterTransfers executed
kaizen_transfer_amountHistogramTransfer amounts
kaizen_accounts_activeGaugeActive accounts
kaizen_api_wallets_created_totalCounterAPI wallets created
kaizen_api_wallets_revoked_totalCounterAPI wallets revoked

Admin

MetricTypeDescription
kaizen_blacklist_additions_totalCounterAddresses blacklisted
kaizen_blacklist_removals_totalCounterAddresses removed
kaizen_system_pauses_totalCounterSystem pauses
kaizen_system_resumes_totalCounterSystem resumes

Authentication

MetricTypeDescription
kaizen_signature_verifications_totalCounterSignature verifications
kaizen_signature_failures_totalCounterVerification failures

Sentry Integration

Configuration

[sentry]
enabled = true
dsn = "https://your-key@sentry.io/project-id"
environment = "production"  # or "staging", "development"
traces_sample_rate = 0.1    # 10% of transactions traced
release = "kaizen@1.0.0"    # optional version tag

Error Tracking

Errors are automatically captured and sent to Sentry. You can also manually capture:

use kaizen_metrics::{capture_error, capture_message, add_breadcrumb};
 
// Capture an error
capture_error(&some_error);
 
// Capture a message
capture_message("Something happened", sentry::Level::Warning);
 
// Add breadcrumb for debugging context
add_breadcrumb("block_producer", "Started block production");

User Context

use kaizen_metrics::set_user;
 
// Set user context for error tracking
set_user(Some(user_id), Some(address.to_string()));

Grafana Dashboards

Recommended Panels

Node Overview

  • Uptime, memory, CPU usage
  • Block height (gauge)
  • Blocks produced rate

Block Production

  • Block production duration (p50, p95, p99)
  • Transactions per block
  • Empty blocks ratio

Mempool Health

  • Pending transactions
  • Rejection rate by reason
  • Queue distribution (high/normal priority)

RPC Performance

  • Request rate by method
  • Latency percentiles
  • Error rate

Sync Status (Read Nodes)

  • Sync height vs write node
  • Sync latency
  • Connection status

Example PromQL Queries

# Block production rate (per minute)
rate(kaizen_blocks_produced_total[1m]) * 60
 
# Average transactions per block
histogram_quantile(0.5, rate(kaizen_block_tx_count_bucket[5m]))
 
# RPC error rate
sum(rate(kaizen_rpc_requests_total{status="error"}[5m]))
/ sum(rate(kaizen_rpc_requests_total[5m])) * 100
 
# P99 RPC latency by method
histogram_quantile(0.99, rate(kaizen_rpc_request_duration_seconds_bucket[5m]))
 
# Mempool rejection rate
rate(kaizen_mempool_txs_rejected_total[5m])
 
# Sync lag (write height - read height)
kaizen_block_height - kaizen_sync_height

Alerting Rules

Example Prometheus Alerts

groups:
  - name: kaizen
    rules:
      - alert: HighBlockProductionLatency
        expr: histogram_quantile(0.99, rate(kaizen_block_production_duration_seconds_bucket[5m])) > 0.09
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Block production taking too long"
 
      - alert: MempoolBacklog
        expr: kaizen_mempool_pending_txs > 5000
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Mempool backlog growing"
 
      - alert: SyncDisconnected
        expr: kaizen_sync_connected == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Read node disconnected from write node"
 
      - alert: HighRPCErrorRate
        expr: |
          sum(rate(kaizen_rpc_requests_total{status="error"}[5m])) 
          / sum(rate(kaizen_rpc_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "RPC error rate above 5%"
 
      - alert: OracleStale
        expr: increase(kaizen_oracle_stale_total[5m]) > 0
        labels:
          severity: warning
        annotations:
          summary: "Oracle staleness detected"