Monitoring & Observability
Kaizen provides comprehensive observability through Prometheus metrics, Grafana dashboards, Loki log aggregation, and Sentry error tracking.
Docker Monitoring Stack (Recommended)
The easiest way to get full observability is using the Docker Compose stack:
docker compose up -dThis starts the complete monitoring stack:
| Service | Port | Description | Access |
|---|---|---|---|
| Prometheus | 9090 | Metrics collection and storage | http://localhost:9090 |
| Grafana | 3003 | Dashboards and visualization | http://localhost:3003 (no login) |
| Loki | 3100 | Log aggregation | (via Grafana) |
| Promtail | - | Log collection from Docker | - |
Pre-configured Dashboards
Grafana comes with a Kaizen Overview dashboard that includes:
- Overview: Block height, TXs/min, active connections, uptime
- Block Production: Latency percentiles, TXs per block
- RPC & Network: Request rates, latency by method, WebSocket activity
- Sync Status: Node heights comparison, sync latency
- Trading (RFQ): Thesis rates, amounts, duration
- Bridge: Deposit/withdrawal activity and amounts
- Storage: Database size, pruning metrics
- Logs: Real-time error/warning logs from all services
Log Queries (Loki)
Access logs via Grafana Explore or the Logs panel:
# All errors from write-node
{service="write-node"} |= "error"
# Transaction execution logs
{service="write-node"} | json | msg=~".*executed.*"
# Filter by level
{service=~"write-node|read-node.*"} | json | level="error"
# Search by tx_hash
{service="write-node"} |= "0x1234..."
Manual Setup (Without Docker)
Enable Metrics
# config.toml
[metrics]
enabled = true
path = "/metrics"
[sentry]
enabled = true
dsn = "https://your-dsn@sentry.io/project"
environment = "production"
traces_sample_rate = 0.1Scrape Metrics
# Fetch metrics
curl http://localhost:8545/metrics
# Watch metrics in real-time
while true; do date; curl -s localhost:8545/metrics | grep -Ev '^(#|$)' | sort; sleep 10; doneInstall (macOS)
brew install prometheus grafana
brew services start prometheus
brew services start grafanaPrometheus Configuration
# /opt/homebrew/etc/prometheus.yml
scrape_configs:
- job_name: "kaizen"
static_configs:
- targets: ["localhost:8545"]
metrics_path: "/metrics"
scrape_interval: 5sAccess Dashboards
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (default: admin/admin)
Metrics Reference
Process Metrics
| Metric | Type | Description |
|---|---|---|
kaizen_info | Gauge | Node version and build info (labels: version, mode) |
kaizen_uptime_seconds | Gauge | Node uptime in seconds |
kaizen_process_memory_bytes | Gauge | Current memory usage |
kaizen_process_cpu_seconds_total | Gauge | Total CPU time consumed |
Block Production
| Metric | Type | Labels | Description |
|---|---|---|---|
kaizen_blocks_produced_total | Counter | - | Total blocks produced |
kaizen_empty_blocks_total | Counter | - | Empty blocks (no txs) |
kaizen_block_errors_total | Counter | - | Block production errors |
kaizen_block_height | Gauge | - | Current block height |
kaizen_block_timestamp | Gauge | - | Latest block timestamp |
kaizen_block_tx_count | Histogram | - | Transactions per block |
kaizen_block_production_duration_seconds | Histogram | - | Total block production time |
kaizen_block_validation_duration_seconds | Histogram | - | Block validation time |
kaizen_block_execution_duration_seconds | Histogram | - | Block execution time |
kaizen_block_commit_duration_seconds | Histogram | - | State commit time |
Mempool
| Metric | Type | Labels | Description |
|---|---|---|---|
kaizen_mempool_pending_txs | Gauge | - | Current pending transactions |
kaizen_mempool_high_priority_txs | Gauge | - | High priority queue size |
kaizen_mempool_normal_priority_txs | Gauge | - | Normal priority queue size |
kaizen_mempool_unique_senders | Gauge | - | Unique senders in mempool |
kaizen_mempool_txs_added_total | Counter | - | Transactions added |
kaizen_mempool_txs_rejected_total | Counter | reason | Transactions rejected |
kaizen_mempool_txs_evicted_total | Counter | reason | Transactions evicted |
kaizen_mempool_tx_size_bytes | Histogram | - | Transaction sizes |
Rejection reasons: duplicate, mempool_full, too_many_from_sender, invalid_signature, invalid_timestamp
Transaction Execution
| Metric | Type | Labels | Description |
|---|---|---|---|
kaizen_txs_executed_total | Counter | type, status | Transactions executed |
kaizen_tx_execution_duration_seconds | Histogram | type | Execution duration |
kaizen_tx_validation_duration_seconds | Histogram | type | Validation duration |
Transaction types: Transfer, Withdraw, RfqSubmit, RfqSettle, OracleFeed, Deposit, etc.
RPC
| Metric | Type | Labels | Description |
|---|---|---|---|
kaizen_rpc_requests_total | Counter | method, status | Total RPC requests |
kaizen_rpc_request_duration_seconds | Histogram | method | Request duration |
kaizen_rpc_active_requests | Gauge | - | Active requests |
Status: success, error
WebSocket
| Metric | Type | Labels | Description |
|---|---|---|---|
kaizen_ws_connections_active | Gauge | - | Active connections |
kaizen_ws_subscriptions_total | Counter | event_type | Subscriptions created |
kaizen_ws_messages_sent_total | Counter | event_type | Messages sent |
kaizen_ws_messages_received_total | Counter | - | Messages received |
Sync (Read Node)
| Metric | Type | Description |
|---|---|---|
kaizen_blocks_synced_total | Counter | Blocks synced from write node |
kaizen_sync_height | Gauge | Current sync height |
kaizen_sync_connected | Gauge | Connection status (1=connected) |
kaizen_sync_errors_total | Counter | Sync errors |
kaizen_sync_latency_seconds | Histogram | Block sync latency |
kaizen_sync_block_download_seconds | Histogram | Block download time |
kaizen_sync_block_apply_seconds | Histogram | Block apply time |
gRPC Sync Server (Write Node)
| Metric | Type | Description |
|---|---|---|
kaizen_grpc_clients_connected | Gauge | Connected read nodes |
kaizen_grpc_blocks_served_total | Counter | Blocks served |
State/Storage
| Metric | Type | Description |
|---|---|---|
kaizen_state_commit_duration_seconds | Histogram | State commit duration |
kaizen_state_read_duration_seconds | Histogram | State read duration |
kaizen_state_write_duration_seconds | Histogram | State write duration |
kaizen_state_reads_total | Counter | Total state reads |
kaizen_state_writes_total | Counter | Total state writes |
kaizen_state_cache_size | Gauge | State cache entries |
kaizen_state_cache_hit_ratio | Gauge | Cache hit ratio (0-1) |
RocksDB
| Metric | Type | Description |
|---|---|---|
kaizen_rocksdb_compactions_total | Counter | Total compactions |
kaizen_rocksdb_compaction_duration_seconds | Histogram | Compaction duration |
kaizen_rocksdb_size_bytes | Gauge | Database size |
kaizen_rocksdb_reads_total | Counter | Read operations |
kaizen_rocksdb_writes_total | Counter | Write operations |
kaizen_rocksdb_read_duration_seconds | Histogram | Read duration |
kaizen_rocksdb_write_duration_seconds | Histogram | Write duration |
Oracle
| Metric | Type | Labels | Description |
|---|---|---|---|
kaizen_oracle_updates_total | Counter | pair | Price updates |
kaizen_oracle_stale_total | Counter | pair | Staleness detections |
kaizen_oracle_gaps_total | Counter | pair | Data gaps detected |
kaizen_oracle_backfills_total | Counter | pair | Backfill operations |
kaizen_oracle_latest_timestamp | Gauge | - | Latest timestamp |
kaizen_oracle_price_age_seconds | Histogram | - | Price age when used |
RFQ Trading
| Metric | Type | Labels | Description |
|---|---|---|---|
kaizen_rfq_submitted_total | Counter | - | Orders submitted |
kaizen_rfq_settled_total | Counter | status | Orders settled |
kaizen_rfq_cancelled_total | Counter | reason | Orders cancelled |
kaizen_rfq_expired_total | Counter | - | Orders expired |
kaizen_rfq_active_orders | Gauge | - | Active orders |
kaizen_rfq_bet_amount | Histogram | - | Bet amounts |
kaizen_rfq_payout | Histogram | - | Payouts |
kaizen_rfq_duration_seconds | Histogram | - | Thesis lifecycle |
Bridge
| Metric | Type | Description |
|---|---|---|
kaizen_deposits_total | Counter | Deposits processed |
kaizen_deposit_amount | Histogram | Deposit amounts |
kaizen_withdrawals_total | Counter | Withdrawals requested |
kaizen_withdrawal_amount | Histogram | Withdrawal amounts |
kaizen_withdrawals_processed_total | Counter | Withdrawals processed |
kaizen_withdrawals_pending | Gauge | Pending withdrawals |
Accounts
| Metric | Type | Description |
|---|---|---|
kaizen_transfers_total | Counter | Transfers executed |
kaizen_transfer_amount | Histogram | Transfer amounts |
kaizen_accounts_active | Gauge | Active accounts |
kaizen_api_wallets_created_total | Counter | API wallets created |
kaizen_api_wallets_revoked_total | Counter | API wallets revoked |
Admin
| Metric | Type | Description |
|---|---|---|
kaizen_blacklist_additions_total | Counter | Addresses blacklisted |
kaizen_blacklist_removals_total | Counter | Addresses removed |
kaizen_system_pauses_total | Counter | System pauses |
kaizen_system_resumes_total | Counter | System resumes |
Authentication
| Metric | Type | Description |
|---|---|---|
kaizen_signature_verifications_total | Counter | Signature verifications |
kaizen_signature_failures_total | Counter | Verification failures |
Sentry Integration
Configuration
[sentry]
enabled = true
dsn = "https://your-key@sentry.io/project-id"
environment = "production" # or "staging", "development"
traces_sample_rate = 0.1 # 10% of transactions traced
release = "kaizen@1.0.0" # optional version tagError Tracking
Errors are automatically captured and sent to Sentry. You can also manually capture:
use kaizen_metrics::{capture_error, capture_message, add_breadcrumb};
// Capture an error
capture_error(&some_error);
// Capture a message
capture_message("Something happened", sentry::Level::Warning);
// Add breadcrumb for debugging context
add_breadcrumb("block_producer", "Started block production");User Context
use kaizen_metrics::set_user;
// Set user context for error tracking
set_user(Some(user_id), Some(address.to_string()));Grafana Dashboards
Recommended Panels
Node Overview
- Uptime, memory, CPU usage
- Block height (gauge)
- Blocks produced rate
Block Production
- Block production duration (p50, p95, p99)
- Transactions per block
- Empty blocks ratio
Mempool Health
- Pending transactions
- Rejection rate by reason
- Queue distribution (high/normal priority)
RPC Performance
- Request rate by method
- Latency percentiles
- Error rate
Sync Status (Read Nodes)
- Sync height vs write node
- Sync latency
- Connection status
Example PromQL Queries
# Block production rate (per minute)
rate(kaizen_blocks_produced_total[1m]) * 60
# Average transactions per block
histogram_quantile(0.5, rate(kaizen_block_tx_count_bucket[5m]))
# RPC error rate
sum(rate(kaizen_rpc_requests_total{status="error"}[5m]))
/ sum(rate(kaizen_rpc_requests_total[5m])) * 100
# P99 RPC latency by method
histogram_quantile(0.99, rate(kaizen_rpc_request_duration_seconds_bucket[5m]))
# Mempool rejection rate
rate(kaizen_mempool_txs_rejected_total[5m])
# Sync lag (write height - read height)
kaizen_block_height - kaizen_sync_heightAlerting Rules
Example Prometheus Alerts
groups:
- name: kaizen
rules:
- alert: HighBlockProductionLatency
expr: histogram_quantile(0.99, rate(kaizen_block_production_duration_seconds_bucket[5m])) > 0.09
for: 5m
labels:
severity: warning
annotations:
summary: "Block production taking too long"
- alert: MempoolBacklog
expr: kaizen_mempool_pending_txs > 5000
for: 2m
labels:
severity: warning
annotations:
summary: "Mempool backlog growing"
- alert: SyncDisconnected
expr: kaizen_sync_connected == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Read node disconnected from write node"
- alert: HighRPCErrorRate
expr: |
sum(rate(kaizen_rpc_requests_total{status="error"}[5m]))
/ sum(rate(kaizen_rpc_requests_total[5m])) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "RPC error rate above 5%"
- alert: OracleStale
expr: increase(kaizen_oracle_stale_total[5m]) > 0
labels:
severity: warning
annotations:
summary: "Oracle staleness detected"