Changelog

All notable changes to Kaizen Core are documented here.

[Unreleased] - 2025-12-11

Added

Parallel Execution Pipeline

Added a high-performance parallel execution pipeline for improved transaction throughput.

Key Components:

Parallel Signature Verification: Uses rayon to parallelize ECDSA signature recovery across all CPU cores.
Aggregate-Based Scheduling: Groups non-conflicting transactions into parallel batches using the AggregateAccess trait.
ParallelStateManager: Thread-safe state access via DashMap without requiring &mut self.
TrueParallelExecutor: Combines all components for true parallel transaction execution.

Performance Results:

Component	Batch Size	Parallel	Sequential	Speedup
Signature Verification	1000 txs	318K TPS	40K TPS	8.0x
True Parallel Execution	1000 txs	62K TPS	33K TPS	1.9x
Full Pipeline	1000 txs	91K TPS	33K TPS	2.8x

New Exports:

kaizen_engine::parallel::* - Parallel execution infrastructure
kaizen_state::parallel::ParallelStateManager - Thread-safe state access

Files Changed:

crates/engine/src/parallel.rs - Parallel execution pipeline
crates/state/src/parallel.rs - Thread-safe state manager
crates/engine/benches/tps.rs - Added parallel execution benchmarks

Documentation: See Architecture / Parallel Execution for details.

[Previous] - 2025-12-08

Fixed

Settler Sync Error on Pruned Blocks

Fixed settler failing to sync when write-node has pruned historical blocks.

Root Cause: The EventStreamClient in sidecar expected strictly sequential block heights. When the write-node had pruned blocks (e.g., only keeping last 50K blocks), the settler would request events from block 1 but receive events starting from the earliest available block (e.g., 28771), causing a height mismatch error.

Previous Behavior:

Height mismatch: expected 1, got 28771
Event stream error, reconnecting in 5s...
(repeat forever)

Solution: Changed event stream sync logic to accept monotonically increasing heights with gaps. Instead of requiring strict sequential heights, we now:

Accept any height that's greater than the last processed height
Log when blocks are skipped due to pruning
Continue syncing from wherever the server starts

// Before: Strict sequential check
if batch.height != expected_height {
    return Err("Height mismatch");
}
 
// After: Monotonic increasing check with gap tolerance
if batch.height <= last_height {
    return Err("Height not increasing");
}
// Gaps allowed, just log them

Files Changed:

crates/app/src/sync/sidecar.rs - Relaxed height check in EventStreamClient::event_loop()

Added

RocksDB Native Prometheus Metrics

Added export of RocksDB internal statistics as Prometheus gauges for better storage observability.

New Metrics:

Metric	Description
`kaizen_rocksdb_estimate_num_keys`	Estimated number of keys in DB
`kaizen_rocksdb_live_data_size_bytes`	Size of live data
`kaizen_rocksdb_sst_files_size_bytes`	Total SST file size
`kaizen_rocksdb_memtable_size_bytes`	Current memtable size
`kaizen_rocksdb_block_cache_usage_bytes`	Block cache memory usage
`kaizen_rocksdb_block_cache_pinned_bytes`	Pinned block cache memory
`kaizen_rocksdb_num_running_compactions`	Active compaction jobs
`kaizen_rocksdb_num_running_flushes`	Active flush jobs
`kaizen_rocksdb_pending_compaction_bytes`	Bytes pending compaction

Implementation:

Added export_metrics() method to RocksDbStorage
Periodic export via background task in Server (every 5 seconds when metrics enabled)
Uses RocksDB's property API to fetch internal statistics

Files Changed:

crates/state/src/rocksdb_storage.rs - Added export_metrics() method
crates/state/src/manager.rs - Added export_metrics() method
crates/app/src/server.rs - Added periodic metrics export task
crates/metrics/src/lib.rs - Added individual metric recording functions (record_rocksdb_num_keys(), record_rocksdb_live_data_size(), etc.)

JMT Cache Prometheus Metrics

Added metrics for JMT version cache performance monitoring.

New Metrics:

Metric	Description
`kaizen_jmt_cache_size`	Number of entries in version cache
`kaizen_jmt_cache_hits_total`	Total cache hits
`kaizen_jmt_cache_misses_total`	Total cache misses

Implementation:

Cache hits/misses recorded in JmtStorage::get_value_option()
Cache size recorded after each commit in RocksDbStorage::commit()

Files Changed:

crates/state/src/jmt_storage.rs - Added hit/miss recording
crates/state/src/rocksdb_storage.rs - Added cache size recording
crates/metrics/src/lib.rs - Added cache metric functions

Grafana Dashboard: Storage Performance Panels

Added new panels to the Kaizen Performance dashboard for RocksDB and JMT cache monitoring.

New Sections:

💾 Storage Performance (RocksDB & JMT Cache)
- JMT Cache Size (per node)
- JMT Cache Hit Rate
- RocksDB Database Size
🗄️ RocksDB Internals (Native Stats)
- Block Cache & Memtable Usage
- Running Compactions & Flushes
- Pending Compaction & Live Data

Files Changed:

docker/monitoring/grafana/provisioning/dashboards/json/kaizen-performance.json - Added new panels

Performance

RocksDB Storage Optimizations

Applied comprehensive storage optimizations for improved throughput, lower latency, and reduced disk usage.

1. Column Family-Specific Options

Each column family now has optimized RocksDB options based on its access patterns:

Column Family	Optimization
`jmt_nodes`	Bloom filter (10-bit), point lookup optimized
`jmt_values`	Prefix bloom (32-byte key_hash), seek optimized
`stale_jmt_nodes`	Lower memory, version prefix for range scans
`blocks`	ZSTD compression, larger blocks for sequential reads
`block_hashes`	Bloom filter for hash↔height lookups
`meta`	Bloom filter for point lookups

2. JMT Version Cache

Added in-memory cache for latest version per key_hash to avoid disk seeks for current state reads.

// Fast path: cache hit for current state
jmt_version_cache: Arc<DashMap<KeyHash, VersionCacheEntry>>

Cache auto-populates on slow-path reads
Pruner invalidates cache entries after pruning
Zero impact on state consistency (read-only optimization)

3. Range Delete for Stale Node Pruning

Optimized stale JMT node pruning from O(n) individual deletes to O(1) range delete:

// Before: Individual deletes
for key in stale_indices { batch.delete_cf(cf_stale, key); }
 
// After: Single range delete
self.db.delete_range_cf(cf_stale, &start_key, &end_key)?;

4. WAL Tuning

New configuration options for Write-Ahead Log management:

[storage.rocksdb]
max_total_wal_size_mb = 1024  # Prevents unbounded WAL growth
wal_ttl_seconds = 3600        # Auto-delete old WAL files

5. Statistics for Monitoring

Optional RocksDB statistics collection for performance analysis:

[storage.rocksdb]
enable_statistics = true  # ~5-10% overhead

Access via storage.statistics_string() for block cache hit rates, compaction stats, bloom filter effectiveness.

New Configuration Options:

[storage.rocksdb]
write_buffer_size_mb = 128      # Write buffer per CF
max_write_buffer_number = 4     # Max write buffers before flush
block_cache_size_mb = 512       # Shared LRU block cache
max_background_jobs = 4         # Compaction/flush parallelism
enable_compression = true       # LZ4 for hot, ZSTD for cold
bloom_filter_bits = 10          # Bloom filter bits per key
max_total_wal_size_mb = 1024    # Max WAL size before recycling
wal_ttl_seconds = 0             # WAL file TTL (0 = disabled)
enable_statistics = false       # RocksDB internal stats

Presets:

RocksDbAppConfig::default() - Balanced for general use
RocksDbAppConfig::production() - High-performance (256MB buffers, 1GB cache, stats enabled)

Files Changed:

crates/state/src/rocksdb_options.rs - New RocksDB configuration module
crates/state/src/rocksdb_storage.rs - CF-specific options, JMT cache wiring
crates/state/src/jmt_storage.rs - Version cache support, fast-path reads
crates/state/src/pruner.rs - Range delete optimization, cache invalidation
crates/state/src/manager.rs - Added with_config() constructor
crates/state/src/types.rs - Added StorageConfig composite type
crates/app/src/config.rs - Exposed RocksDB options in app config
crates/app/src/server.rs - Use new storage config

State Consistency: All optimizations are internal implementation details. State roots remain identical across nodes. Verified with existing test suite (55 state tests, 52 engine tests passed).

Fixed

Tester: WebSocket Race Condition on Wallet Switch

Fixed "WebSocket not connected" error when switching to a different wallet.

Root Cause: When switching wallets, the mainWalletAddress changed and triggered the WebSocket subscription effect. However, the React state isWebSocketConnected was still true from the previous connection (hadn't propagated yet), while the actual client.ws instance was already null or disconnected. The guard check passed but subscribeUserTheses threw.

Solution: Added synchronous check on the client's actual WebSocket state (client.isWebSocketConnected) in addition to the React state check.

// Before: Only React state check
if (!client || !isWebSocketConnected || !mainWalletAddress || isMockMode) {
 
// After: Also check client's internal state
if (
  !client ||
  !isWebSocketConnected ||
  !client.isWebSocketConnected ||  // ← Catches race condition
  !mainWalletAddress ||
  isMockMode
) {

Files Changed:

apps/tester/src/hooks/use-thesis-sync.ts - Added client.isWebSocketConnected guard
apps/tester/src/hooks/use-price-stream.ts - Same defensive fix applied

Changed

Documentation Restructure

Reorganized docs for better agent-friendliness and task-oriented navigation.

New Structure:

docs/pages/
├── introduction/     ← What is Kaizen
├── api/              ← Quick lookup (NEW)
│   ├── rpc.mdx       ← JSON-RPC + WebSocket
│   ├── transactions.mdx
│   └── errors.mdx
├── sdk/              ← TypeScript SDK
├── deployment/       ← How to run (NEW)
│   ├── docker.mdx
│   ├── configuration.mdx
│   └── monitoring.mdx
├── architecture/     ← How it works (MERGED)
│   ├── overview.mdx
│   ├── stf.mdx
│   ├── block-production.mdx
│   ├── settlement.mdx
│   ├── oracle.mdx
│   └── storage.mdx
├── components/       ← Individual services
└── reference/        ← Misc reference

Key Changes:

Before	After	Why
`operations/` + `advanced/`	`deployment/`	Task-oriented
`core-concepts/` + `execution/`	`architecture/`	Related content merged
`reference/transactions.mdx`	`api/transactions.mdx`	Better discoverability
API buried in `operations/`	`api/` section	Quick lookup

Removed:

docs/pages/core-concepts/ - Merged into architecture/
docs/pages/execution/ - Merged into architecture/
docs/pages/operations/ - Split into api/ and deployment/
docs/pages/advanced/ - Moved to deployment/ and architecture/
docs/pages/components/tester.mdx - Removed from sidebar (demo app)

Files Changed:

docs/vocs.config.ts - New sidebar structure
docs/pages/api/* - New API reference section
docs/pages/deployment/* - New deployment section
docs/pages/architecture/* - Merged architecture section
Multiple cross-reference fixes across docs

README Cleanup

Simplified README.md to focus on quick start, pointing to docs for details.

Changes:

Fixed outdated binary name (kaizen-app → kaizen-node)
Updated project structure (added missing apps)
Simplified to ~125 lines (was ~400)
Added link to docs.miyao.ai

Fixed

Prometheus Histogram Metrics Export

Fixed histogram metrics being exported as summaries instead of proper histograms, causing histogram_quantile() queries to fail in Grafana.

Root Cause: The metrics-exporter-prometheus crate defaults to exporting histograms as summaries (with quantile labels). Grafana's histogram_quantile() function requires proper histogram format with _bucket suffix.

Solution: Explicitly configured PrometheusBuilder with histogram buckets:

const LATENCY_BUCKETS: &[f64] = &[
    0.0001, 0.0005, 0.001, 0.0025, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0,
];
 
PrometheusBuilder::new()
    .set_buckets(LATENCY_BUCKETS)
    .set_buckets_for_metric(Matcher::Suffix("tx_count"), COUNT_BUCKETS)
    .set_buckets_for_metric(Matcher::Suffix("_amount"), AMOUNT_BUCKETS)

Metrics Now Working:

kaizen_tx_execution_duration_seconds - TX execution latency
kaizen_tx_validation_duration_seconds - TX validation latency
kaizen_block_production_duration_seconds - Block production time
kaizen_block_tx_count - Transactions per block
kaizen_rfq_bet_amount / kaizen_rfq_payout - RFQ amounts

Files Changed:

crates/metrics/src/prometheus.rs - Added bucket configuration

RFQ Settlement Status Labels

Fixed inconsistent status labels for kaizen_rfq_settled_total metric between engine and settler.

Previous Behavior: Engine used debug format (SettledUserWin, SettledSolverWin) while settler used snake_case (user_wins, solver_wins).

Solution: Standardized both to use user_wins / solver_wins labels.

Files Changed:

crates/engine/src/lib.rs - Changed status label format to match settler

Grafana Dashboard Cleanup

Removed uninstrumented metric panels from dashboards and fixed broken queries.

Removed Panels (not instrumented in code):

Dashboard	Removed
Overview	Active RFQs, Mempool, Sync Clients, Pending W/D, Uptime, Oracle metrics, Memory, DB Size
Debug	Mempool Deep Dive section, Storage Debug section
Compare	TX Latency comparisons, Memory comparisons, DB Size Growth
Business	Renamed "Won/Lost" to "User Wins/Solver Wins"

Fixed Queries:

Sync Lag: Changed to scalar(kaizen_block_height) - kaizen_sync_height for correct label matching
Business dashboard: Updated status labels from won/lost to user_wins/solver_wins

Files Changed:

docker/monitoring/grafana/provisioning/dashboards/json/kaizen-overview.json
docker/monitoring/grafana/provisioning/dashboards/json/kaizen-debug.json
docker/monitoring/grafana/provisioning/dashboards/json/kaizen-compare.json
docker/monitoring/grafana/provisioning/dashboards/json/kaizen-business.json

Performance

State Layer Optimizations

Applied drop-in performance optimizations to the state layer for improved throughput and lower latency.

1. DashMap for Pending Updates

Replaced RwLock<HashMap> with lock-free DashMap for pending state updates.

// Before: Lock contention on every read/write
pending_updates: Arc<RwLock<HashMap<KeyHash, Option<Vec<u8>>>>>
 
// After: Lock-free concurrent access
pending_updates: Arc<DashMap<KeyHash, Option<Vec<u8>>>>

Impact: Eliminates lock contention during concurrent state operations within a block.

2. JMT Value Lookup O(1)

Optimized JMT value lookup from O(n) iteration to O(1) using RocksDB's reverse seek.

// Before: Iterate through ALL versions
for item in prefix_iterator { ... }  // O(n)
 
// After: Direct seek to target version
iterator_cf(IteratorMode::From(&seek_key, Direction::Reverse))
iter.next()  // O(1)

Impact: State reads now constant-time regardless of version history depth.

3. Hot Path Caching

Added in-memory caching for frequently-read, rarely-written values.

Cached Value	Read Frequency	Write Frequency
`SystemPaused`	Every transaction	Admin only
`GlobalConfig`	Every RFQ submit	Admin only

Cache is invalidated on begin_block() and cleared on writes.

Benchmark Results:

Metric	Before	After
State commit (read-node)	~3-5ms	~1ms
JMT value lookup	O(n)	O(1)

Dependencies Added:

dashmap     = "6.1"
parking_lot = "0.12"

Files Changed:

crates/state/src/rocksdb_storage.rs - DashMap for pending_updates
crates/state/src/jmt_storage.rs - Reverse seek optimization
crates/state/src/manager.rs - Hot path caching
crates/state/Cargo.toml - Added dashmap, parking_lot
Cargo.toml - Workspace dependencies

State Consistency: All changes are internal implementation details. State roots remain identical between nodes. Verified with 189k+ blocks synced with 0 mismatches.

Added

Multi-Chain Withdrawal Support

Withdrawals now support specifying both destination address and destination chain ID, enabling proper multi-chain bridge functionality.

Previous Behavior: WithdrawTx only had a single destination field (Address type), which was incorrectly repurposed as chain ID in the bridge relayer.

New Behavior:

// Before
pub struct WithdrawTx {
    pub amount: u64,
    pub destination: Address,
}
 
// After
pub struct WithdrawTx {
    pub amount: u64,
    pub destination_address: Address,
    pub destination_chain_id: u64,
}

Supported Chains:

Chain ID	Network
42161	Arbitrum
421614	Arbitrum Sepolia
8453	Base
84532	Base Sepolia

SDK Changes:

// Before
withdraw(amount, destination);
 
// After
withdraw(amount, destinationAddress, destinationChainId);

Tester UI Updates:

Chain selector in Bridge Modal now sets destinationChainId
Pending Withdrawals tab shows destination address with chain name
Example: 0x1234...abcd (Base Sepolia)

CLI Changes:

# Before
kata tx withdraw --destination 0x... --amount 1000000
 
# After
kata tx withdraw --destination 0x... --chain-id 84532 --amount 1000000

Files Changed:

crates/tx/src/payload.rs - WithdrawTx with new fields
crates/types/src/withdrawal.rs - WithdrawalRequest with new fields
crates/types/src/event.rs - Event::WithdrawRequested with new fields
crates/engine/src/executors/bridge.rs - Updated executor
crates/app/src/rpc/types.rs - RPC response with new fields
crates/app/src/indexer/mod.rs - SQL schema and event handling
apps/cli/src/commands/tx.rs - Added --chain-id flag
sdk/src/types.ts - Updated TypeScript types
sdk/src/schema/tx.ts - Updated Borsh schema
sdk/src/signer.ts - Updated EIP-712 type hash
apps/tester/src/components/BridgeModal.tsx - Chain selection
apps/tester/src/components/ThesisPanel.tsx - Display chain name
bridge/src/relayer/server.ts - Use correct fields

Breaking Change: Borsh encoding changed for WithdrawTx. Requires coordinated upgrade of all components.

Settings Modal with Theme Customization

Added a Settings modal accessible from the sidebar, providing user-configurable options for URLs and visual themes.

Features:

URL Configuration: Customize Solver URL, Node RPC URL, and Node WebSocket URL
Theme Selection: 6 themes each for Normal Mode and Degen Mode
Persistence: Settings automatically saved to localStorage
Reset: One-click reset to default settings

Normal Mode Themes:

Theme	Primary Color	Description
Emerald	`#22c55e`	Classic green (default)
Cyber	`#06b6d4`	Cyan/teal
Sunset	`#f59e0b`	Amber/gold
Arctic	`#a5b4fc`	Lavender/indigo
Neon	`#e879f9`	Pink/fuchsia
Monochrome	`#d4d4d8`	Gray/zinc

Degen Mode Themes:

Theme	Primary Color	Description
Inferno	`#f97316`	Orange (default)
Plasma	`#ef4444`	Red
Blaze	`#fbbf24`	Yellow/amber
Volcanic	`#fb7185`	Rose/pink
Supernova	`#c026d3`	Purple/fuchsia
Ember	`#fdba74`	Peach/orange light

Implementation:

Theme colors applied via CSS variables (--theme-primary, --theme-primary-rgb, --theme-bg, etc.)
useThemeApplier hook updates document root CSS variables on theme change
All major UI components updated to use theme variables instead of hardcoded colors

Files Changed:

apps/tester/src/stores/use-settings-store.ts - New store with theme definitions and persistence
apps/tester/src/components/SettingsModal.tsx - New settings modal component
apps/tester/src/hooks/use-theme.ts - Theme applier and color hooks
apps/tester/src/hooks/use-config.ts - Dynamic config based on settings
apps/tester/src/styles/globals.css - CSS variable definitions
apps/tester/src/pages/_app.tsx - Theme applier integration
Multiple components updated to use CSS variables

Pending Margin Reservation for Box Creation

Box creation now reserves margin from available balance, preventing users from creating multiple boxes that exceed their total balance.

Previous Behavior: Users could draw multiple boxes in quick succession. Each box only checked against total balance, not accounting for boxes still in QUOTING status. This allowed creating boxes whose combined margins exceeded the user's actual balance, causing on-chain failures.

New Behavior:

Available balance = Total balance - Sum of all QUOTING theses' margins
Box creation blocked if available balance < bet amount
Error message includes pending amount: "Insufficient balance: X USDC available (Y pending)"
Double-check in executeDegenThesis prevents race conditions

Affected Flows:

Mode	Check Location
Degen Mode	`handleMouseDown` + `executeDegenThesis` in Chart.tsx
Modal Mode	`useThesisValidation` hook (balance check)

Files Changed:

apps/tester/src/components/Chart.tsx - Added pending margin calculation in handleMouseDown and executeDegenThesis
apps/tester/src/hooks/use-thesis-validation.ts - Balance check now subtracts pending QUOTING margins

Pending Withdrawals Tab in Tester

Added a new "Pending Withdrawals" tab to the tester footer panel, showing user's pending withdrawal requests.

Features:

Displays withdrawal ID, status, amount, destination address, and request timestamp
Status indicator with animated pulse for pending items
Loading state while fetching data
Empty state with guidance to request withdrawal from bridge modal
Auto-refreshes every 10 seconds

SDK Changes:

Added getWithdrawalsByUser() method to fetch paginated withdrawal IDs for a user
Added getPendingWithdrawalsForUser() convenience method that filters unprocessed withdrawals by user

Files Changed:

sdk/src/rpc.ts - Added getWithdrawalsByUser()
sdk/src/client.ts - Added getWithdrawalsByUser(), getPendingWithdrawalsForUser()
apps/tester/src/hooks/use-withdrawals-query.ts - New hook for fetching pending withdrawals
apps/tester/src/lib/query-keys.ts - Added withdrawals.pending query key
apps/tester/src/components/ThesisPanel.tsx - Added WithdrawalsTab component and tab button

Modal UX Improvements

Improved modal interaction patterns across tester app.

CheatCodePanel:

Clicking outside the modal (backdrop) now closes it

BridgeModal:

Clicking outside the modal (backdrop) now closes it
Auto-closes after 1.5 seconds when transaction is successful (deposit confirmed or withdrawal submitted)

Files Changed:

apps/tester/src/components/CheatCodePanel.tsx - Added backdrop click handler
apps/tester/src/components/BridgeModal.tsx - Added backdrop click handler and auto-close effect

RPC Pagination Response Format

Updated thesis-related RPC methods to return proper paginated responses with metadata.

Previous Behavior: Methods like kaizen_getThesesByUser returned a plain array Thesis[], making it impossible to know total count or implement pagination UI.

New Behavior: Returns PaginatedResponse<Thesis> with full pagination metadata:

{
  "items": [...],
  "total": 150,
  "limit": 100,
  "offset": 0,
  "hasMore": true
}

Updated Methods:

Method	Change
`kaizen_getThesesByUser`	Returns `PaginatedResponse<Thesis>`
`kaizen_getThesesBySolver`	Returns `PaginatedResponse<Thesis>`
`kaizen_internal_getPendingTheses`	Returns `PaginatedResponse<Thesis>`
`kaizen_internal_getThesesByStatus`	Returns `PaginatedResponse<Thesis>`
`kaizen_internal_getThesesByPair`	Returns `PaginatedResponse<Thesis>`

Note: kaizen_getWithdrawalsByUser already returned PaginatedResponse<number> (withdrawal IDs).

SDK Changes:

Added PaginatedResponse<T> type export
Updated getThesesByUser(), getThesesBySolver(), getMyTheses() return types
Added getWithdrawalsByUser() method to SDK client

Files Changed:

crates/app/src/rpc/methods.rs - Updated RPC handlers
sdk/src/types.ts - Added PaginatedResponse<T> interface
sdk/src/rpc.ts - Updated return types, added getWithdrawalsByUser
sdk/src/client.ts - Updated return types, added withdrawal methods
sdk/src/index.ts - Exported PaginatedResponse
apps/tester/src/hooks/use-thesis-sync.ts - Updated to use response.items
apps/mock-solver/src/rpc.ts - Updated return type
apps/cli/src/rpc.rs - Added PaginatedResponse<T> type
apps/cli/src/commands/degen.rs - Updated to use response.items

Breaking Change: SDK methods now return PaginatedResponse<Thesis> instead of Thesis[]. Update callsites to access .items property.

Grafana Dashboard Suite

Expanded monitoring dashboards from 1 to 5 specialized views for different use cases.

New Dashboards:

Dashboard	UID	Purpose
Overview	`kaizen-overview`	Health check, business metrics (slimmed down)
Performance	`kaizen-performance`	Deep-dive latency, throughput analysis
Business	`kaizen-business`	Trading volume, win rates, bridge flows
Debug	`kaizen-debug`	Error analysis, latency breakdown, logs
Compare	`kaizen-compare`	Period-over-period trend comparison

Kaizen Performance (/d/kaizen-performance):

TPS & throughput with success/fail breakdown
TX execution latency distribution (p50/p90/p95/p99)
TX latency by type (transfer, deposit, submit_thesis, etc.)
Block production phase breakdown (validation → execution → commit)
Storage IOPS and latency (state, RocksDB)
Sync performance and lag tracking
Pruning duration breakdown

Kaizen Business (/d/kaizen-business):

Total trading volume (USDC)
Trade outcomes (won/lost/cancelled/expired)
Hourly volume bars and win rate trends
Bet size distribution over time
Bridge deposit/withdrawal flows and net flow
Transaction type mix (pie chart)

Kaizen Debug (/d/kaizen-debug):

Error counters with color thresholds (failed TXs, rejections, sig failures)
Failed transactions by type
Mempool rejection reasons
Latency breakdown by TX type and RPC method
Mempool queue sizes and eviction rates
Storage IOPS and p99 latency
Integrated error logs from all services

Kaizen Compare (/d/kaizen-compare):

Today vs Yesterday vs Last Week overlays for TPS, latency, RFQ rate
Period-over-period % change stat panels
Block production and storage growth trends
Uses Prometheus offset for time-shifted queries

Overview Dashboard Optimization:

Reduced from ~2600 lines to ~500 lines (80% reduction)
Removed detailed latency panels (moved to Performance dashboard)
Added link to Performance dashboard for deep-dive
Focused on business metrics and high-level health

Files Changed:

docker/monitoring/grafana/provisioning/dashboards/json/kaizen-overview.json - Slimmed down
docker/monitoring/grafana/provisioning/dashboards/json/kaizen-performance.json - New
docker/monitoring/grafana/provisioning/dashboards/json/kaizen-business.json - New
docker/monitoring/grafana/provisioning/dashboards/json/kaizen-debug.json - New
docker/monitoring/grafana/provisioning/dashboards/json/kaizen-compare.json - New

JMT Node Pruning

Fixed unbounded disk growth across all node types (write, read-aggressive, read-archive) by implementing JMT (Jellyfish Merkle Tree) node pruning.

Problem: All three node types experienced identical disk growth rates regardless of pruning configuration. The aggressive and custom pruning modes only pruned JMT values, blocks, and snapshots - but not the JMT tree structure nodes themselves.

Root Cause: The TreeUpdateBatch.stale_node_index_batch from the JMT library was completely ignored. This batch tracks which tree nodes become obsolete at each version, enabling safe deletion of old nodes.

Solution:

New column family - CF_STALE_JMT_NODES stores stale node indices on each commit
Stale node tracking - Records (stale_since_version, node_key) for each obsolete node
Pruning implementation - prune_jmt_nodes() deletes nodes where stale_since_version < min_version_to_keep

Expected Behavior After Fix:

Node Type	Disk Growth
`write`	Stabilizes at `blocks_to_keep × avg_block_size`
`read-aggressive`	Stabilizes at ~16 min of history
`read-archive`	Grows indefinitely (pruning disabled)

New Metrics:

kaizen_pruning_jmt_nodes_total - Total JMT nodes pruned
kaizen_pruning_jmt_nodes_duration_seconds - JMT node pruning duration
kaizen_pruning_blocks_duration_seconds - Block pruning duration
kaizen_pruning_snapshots_duration_seconds - Snapshot pruning duration
kaizen_pruning_jmt_values_duration_seconds - JMT values pruning duration

Files Changed:

crates/state/src/jmt_storage.rs - Added CF_STALE_JMT_NODES, stale index storage
crates/state/src/rocksdb_storage.rs - Create new CF on DB open
crates/state/src/pruner.rs - Implemented prune_jmt_nodes()
crates/state/src/block_storage.rs - Added jmt_nodes_pruned to PruneStats
crates/metrics/src/lib.rs - Added pruning timing metrics

Migration Note: Existing databases will automatically create the new column family on startup. However, historical stale node data is not available, so previously accumulated JMT nodes won't be pruned. For aggressive pruning nodes, consider wiping data and re-syncing from archive node.

Batch Settlement for Settler Sidecar

Settler now batches multiple settlements into a single transaction for improved efficiency.

Previous Behavior: Each settlement was submitted as a separate transaction, requiring N signatures and N RPC calls for N settlements.

New Behavior:

Batch collection - Collects settlements for 50ms or until batch_size (default 100) is reached
Single transaction - All collected settlements are submitted in one SystemSettle transaction
Atomic execution - Uses validation-first pattern to ensure all-or-nothing semantics
Unified type - SystemSettleTx now contains Vec<Settlement> (single settlement = batch of 1)

Benefits:

Reduced transaction count: N settlements → 1 transaction
Reduced signature overhead: N signatures → 1 signature
Reduced RPC calls: N calls → 1 call
Lower latency for burst settlements

API Changes:

// Before: Two separate types
pub struct SystemSettleTx {
    pub thesis_id: u64,
    pub settlement_type: SystemSettlementType,
}
pub struct SystemBatchSettleTx {
    pub settlements: Vec<Settlement>,
}
 
// After: Unified type
pub struct SystemSettleTx {
    pub settlements: Vec<Settlement>,
}
 
pub struct Settlement {
    pub thesis_id: u64,
    pub settlement_type: SystemSettlementType,
}

Files Changed:

crates/tx/src/payload.rs - Unified SystemSettleTx with Vec<Settlement>
crates/engine/src/executors/rfq.rs - Added execute_settlement() with validation-first pattern
crates/engine/src/lib.rs - Single handler for settlement
crates/app/src/settler/service.rs - Batch collection and submission logic
sdk/src/schema/tx.ts - Added SystemSettleTxSchema, SettlementSchema

Breaking Change: Borsh encoding changed. Requires coordinated upgrade of settler and nodes.

Hybrid Thesis Sync (WebSocket + RPC Polling)

Tester app now uses a hybrid approach for active thesis and thesis history, combining real-time WebSocket events with RPC polling for improved reliability.

Previous Behavior: Thesis data only existed in local memory. If WebSocket disconnected, settlement events were missed. No history persisted across sessions.

New Behavior:

Initial load from RPC - Fetches thesis history on connect via getThesesByUser()
Real-time WebSocket - Subscribes to subscribeUserTheses() for immediate settlement notifications
Fallback RPC polling - When WebSocket disconnects, polls every 5s (only if active theses exist)
Reconnection sync - On WebSocket reconnect, syncs from RPC to catch missed events

Benefits:

Thesis history persists across browser refreshes
Missed settlements are recovered when WebSocket is down
Efficient - real-time events when available, polling only as fallback

Files Changed:

apps/tester/src/hooks/use-thesis-sync.ts - New hybrid sync hook
apps/tester/src/stores/use-thesis-store.ts - Added syncTheses, updateThesisByThesisId, clearAll actions
apps/tester/src/hooks/use-price-stream.ts - Removed thesis subscription (moved to sync hook)
apps/tester/src/pages/index.tsx - Integrated useThesisSync hook

Fixed

Tester: Bridge Withdrawal Targeting External Chain

Fixed withdrawal requests incorrectly calling external chain gateway contracts instead of Kaizen Core.

Root Cause: BridgeModal.tsx used wagmi's writeContract to call an external gateway contract for withdrawals, which is incorrect. Withdrawals should be submitted to Kaizen Core, which then gets processed by the relayer to send funds to the external chain.

Previous Behavior:

// Wrong: Calling external chain contract
writeContract({
  address: selectedChainConfig.gatewayAddress,
  abi: GATEWAY_ABI,
  functionName: "withdraw",
  args: [amountParsed, wagmiAddress],
});

New Behavior:

// Correct: Submit to Kaizen Core
const payload = withdrawPayload(amountParsed, wagmiAddress);
await client.sendTransaction(payload, { waitForConfirmation: true });

UI Changes:

Description updated: "Request withdrawal from Kaizen. Funds will be sent to [chain] by the relayer."
Shows Kaizen Core transaction hash instead of external chain explorer link
Info message: "💡 Withdrawals are submitted to Kaizen Core. The relayer will process and send funds to your selected chain."
Withdraw button no longer requires chain switch (only deposits need external chain interaction)

Files Changed:

apps/tester/src/components/BridgeModal.tsx - Use SDK's withdraw() payload builder and client.sendTransaction()

WebSocket Duplicate Event Broadcast

Fixed duplicate WebSocket events being sent to frontend clients, which could cause unnecessary re-renders and subscription loops.

Root Cause: In write-node's executor.rs, events were broadcast twice:

Immediately during execute_tx() for real-time feedback
Again during checkpoint() when the block was committed

This meant every thesis settlement, transfer, and oracle price update was delivered to WebSocket subscribers twice.

Solution: Removed immediate event broadcast from execute_tx(). Events are now only broadcast once during checkpoint(), which includes both transaction events and oracle price events from begin_block().

Trade-off: Transaction events now have ~100ms higher latency (wait for next checkpoint) but are guaranteed to be delivered exactly once.

Files Changed:

crates/app/src/executor.rs - Removed duplicate event broadcast in execute_tx()

WebSocket UserTheses Subscription Missing Settlement Events

Fixed users not receiving RfqSettled events for their own theses when they lost.

Root Cause: The UserTheses WebSocket subscription in subscriptions.rs filtered RfqSettled events by winner == address instead of user == address. This meant users only received settlement notifications when they won, not when they lost.

Solution: Changed the filter condition to check if the event's user field matches the subscription address, ensuring users receive all settlement events for their theses regardless of outcome.

Files Changed:

crates/app/src/ws/subscriptions.rs - Fixed RfqSettled event filter to use user instead of winner

Tester: WebSocket Subscription Loop on Thesis Updates

Fixed infinite WebSocket re-subscription loop that could occur when thesis events were received.

Root Cause: In use-thesis-sync.ts, the handleThesisEvent callback had theses array in its dependency list. When a thesis event arrived and updated the store, the callback was recreated, which triggered the useEffect to unsubscribe and re-subscribe to the WebSocket channel, which could cause duplicate events and further re-renders.

Solution: Wrapped handleThesisEvent in a useRef to keep a stable reference. The subscription useEffect now only depends on connection state, not on the callback itself. The ref is updated on each render to always have access to the latest store state.

Files Changed:

apps/tester/src/hooks/use-thesis-sync.ts - Stabilized callback reference with useRef

Settler: Invalid Breach Timestamp Outside Thesis Window

Fixed "Invalid breach timestamp: X not in [start_time, end_time]" error when settler submitted SolverWins settlements.

Root Cause: The find_breach function in settler could return a breach timestamp that was slightly before the thesis's start_time. This happened because get_price_at() uses a 100ms tolerance window, so it might return a price from timestamp T-50ms when querying for timestamp T. The breach was valid (price did breach), but the returned timestamp was outside the thesis's valid observation period.

Solution: Added explicit bounds check in find_breach to ensure the actual timestamp (from the price cache) falls within [thesis.start_time, thesis.end_time] before returning it as a valid breach.

Files Changed:

crates/app/src/settler/service.rs - Added timestamp bounds validation in find_breach()

Settler: State Loss on Restart

Added height persistence to settler so it can resume from the last processed block after restart.

Previous Behavior: Settler always started from block 0 on restart, requiring full event replay which could be slow or fail if historical events were pruned.

New Behavior:

Settler saves last processed height to {data_dir}/height.txt every 100 blocks
On startup, reads persisted height and resumes event stream from that point
Falls back to height 0 if persistence file doesn't exist or is corrupted

Configuration:

# CLI
settler --data-dir ./.data/settler
 
# Environment variable
SETTLER_DATA_DIR=./.data/settler

Files Changed:

crates/app/src/settler/config.rs - Added data_dir field
crates/app/src/settler/service.rs - Added read_persisted_height(), write_persisted_height()
apps/settler/src/main.rs - Added --data-dir CLI argument

Settler: Failed Settlement Retry

Fixed settler not retrying settlements that failed due to RPC or execution errors.

Root Cause: When a settlement transaction failed (either RPC error or execution error like "Invalid breach timestamp"), the thesis remained in pending_settlements indefinitely. The breach detector would skip it, assuming a settlement was already in flight.

Solution: Added feedback channel from settlement_submitter to a new settlement_result_handler task. When a settlement fails, its thesis_id is removed from pending_settlements, allowing the breach detector to pick it up again for retry.

Files Changed:

crates/app/src/settler/service.rs - Added SettlementResult, settlement_result_handler(), feedback channel

Settler: Breach Price Mismatch on SolverWins Settlement

Fixed "Breach price mismatch" error when settler submitted SolverWins settlements.

Root Cause: The find_breach function used the iteration timestamp (t) instead of the actual price entry timestamp when reporting breaches. When prices were cached at timestamps slightly different from 100ms intervals, the node's ring buffer lookup would return a different price.

Example:
- Oracle price at timestamp 1050 → P1
- Settler iterates at t=1100, finds P1 (within 100ms tolerance)
- Settler sends: breach_timestamp=1100, breach_price=P1
- Node: slot_for_timestamp(1100) = slot 11, which has P2 (from timestamp 1150)
- Mismatch: P1 ≠ P2

Solution: Changed get_price_at() to return (actual_timestamp, price) tuple instead of just price. The breach detector now uses the actual oracle timestamp, ensuring the node's ring buffer lookup returns the same price.

Files Changed:

crates/app/src/settler/service.rs - Fixed get_price_at() and find_breach() to use actual timestamps

Settler: Double Settlement Race Condition

Fixed "Thesis not active" error caused by settler submitting duplicate settlements for the same thesis.

Root Cause: After detecting a breach and sending a settlement decision to the channel, the thesis remained in active_theses until the RfqSettled event arrived. The next breach detection cycle would detect the same breach and send another settlement, which failed because the thesis was already settled.

Solution: Added pending_settlements: HashSet<u64> to track theses with in-flight settlements:

Before detection, filter out theses already in pending_settlements
After detection, mark decided theses as pending before sending to channel
When RfqSettled event arrives, clear both active_theses and pending_settlements

Files Changed:

crates/app/src/settler/service.rs - Added pending settlement tracking

Settler: Enhanced Settlement Response Logging

Added detailed logging for settlement transaction responses to aid debugging.

New Log Fields:

On success: status, block_height, tx_index from receipt
On execution failure: receipt.error message, individual settlement details
On RPC error: breach_timestamp, breach_price for each failed settlement

New Metric:

settler_execution_errors_total - Count of transactions included but failed execution

Files Changed:

crates/app/src/settler/service.rs - Enhanced submit_batch() logging

Tester: API Wallet Mismatch on Wallet Switch

Fixed "Invalid user signature: signer is not user nor an authorized API wallet" error when switching accounts in external wallet.

Root Cause: When user switched accounts in MetaMask, the localStorage API wallet still belonged to the previous account. The new account would try to sign quotes with the old API wallet, causing signature verification failures.

Solution: Added wallet change detection that automatically clears mismatched API wallets.

Files Changed:

apps/tester/src/stores/use-wallet-store.ts - Added handleMainWalletChange() to clear API wallet on account switch
apps/tester/src/hooks/use-kaizen-client.tsx - Calls handleMainWalletChange() when wagmi address changes

Tester: WebSocket Abrupt Disconnect on Wallet Change

Fixed WebSocket connection dropping abruptly when switching wallet accounts, causing poor UX.

Root Cause: React effect cleanup immediately called client.disconnectWebSocket() without any grace period.

Solution: Added graceful disconnection sequence:

Mark WebSocket as disconnected immediately (prevents new requests)
Clear core service client
Wait 100ms before actual WebSocket close

Files Changed:

apps/tester/src/hooks/use-kaizen-client.tsx - Added graceful disconnect with timeout

Tester: EnableConnectionModal "Existing Wallet Found" UX Confusion

Fixed confusing "Existing API Wallet Found" message appearing during API wallet setup flow.

Root Cause: The useEffect that reset modal state had apiWallet in dependencies, causing it to re-run and show the message immediately after generating a new wallet.

Solution:

Changed effect to only trigger on isOpen change, not wallet state changes
Added isReusingExisting state to track if resuming previous setup
Changed message from "Existing API Wallet Found" to "Resume Setup" for clarity

Files Changed:

apps/tester/src/components/EnableConnectionModal.tsx - Fixed effect dependencies and improved messaging

Tester: Prevent Box Drawing Without Sufficient Balance

Box drawing is now disabled when user balance is below minimum bet amount, regardless of degen mode.

Previous Behavior: Users could draw boxes with 0 balance, only to see error after attempting to execute.

New Behavior:

BOX tool button is disabled and grayed out when balance < minimum bet
Clicking disabled button shows toast explaining insufficient balance
If balance drops while BOX tool is active, tool is auto-deactivated
Chart also checks balance before allowing drag start (defense in depth)

Files Changed:

apps/tester/src/components/RightPanel.tsx - Added balance check on tool activation
apps/tester/src/components/Chart.tsx - Added balance check in mousedown handler

Settler Challenge Deadline Timing Race

Fixed "ChallengeWindowNotOver" error when settler submits UserWins settlement right after deadline passes.

Root Cause: Settler used SystemTime::now() to check deadline, but core uses block timestamp which is aligned down to 100ms intervals via align_timestamp(). This caused a race condition where settler saw the deadline as passed, but core's block timestamp hadn't caught up yet.

Timeline example:
- T=950ms: Checkpoint → block_timestamp = 900ms (aligned down)
- T=1001ms: Settler sees now >= deadline(1000ms) → submits UserWins
- TX executes with block_timestamp = 900ms
- Core: 900 < 1000 → ChallengeWindowNotOver!

Solution: Added deadline_buffer (default 200ms) to settler config. Settler now waits until now >= challenge_deadline + deadline_buffer before submitting UserWins settlement.

Files Changed:

crates/app/src/settler/config.rs - Added deadline_buffer field
crates/app/src/settler/service.rs - Apply buffer in breach detection
apps/settler/src/main.rs - Added --deadline-buffer CLI flag
docker-compose.yml - Explicitly set deadline buffer

Configuration:

# CLI (default 200ms)
settler --write-node 127.0.0.1:9000 --deadline-buffer 200

SDK Quote Signing Hash Mismatch

Fixed "Invalid user signature" error when submitting thesis via tester app.

Root Cause: SDK's buildQuoteSigningHash passed hex-encoded bytes to viem's keccak256, which produced a different hash than Rust's direct byte hashing.

Solution: Pass raw Uint8Array directly to keccak256 instead of converting to hex first.

// Before (incorrect)
return keccak256(bytesToHex(message));
 
// After (correct)
return keccak256(message);

Files Changed:

sdk/src/signer.ts - Fixed buildQuoteSigningHash function

SDK RfqSettledEvent Schema Mismatch

Fixed WebSocket event deserialization failure for thesis settlement events.

Root Cause: SDK's RfqSettledEvent was missing fields that Rust's Event::RfqSettled had.

Solution: Added missing fields to match Rust schema.

Fields Added:

user: AddressSchema
solver: AddressSchema
oraclePair: OraclePairSchema
betAmount: bigint

Files Changed:

sdk/src/schema/event.ts - Updated RfqSettledEvent class

Changed

Oracle Service Rename

Renamed mock-oracle to oracle as it's now the official production service.

Changes:

Directory: apps/mock-oracle → apps/oracle
Package: @kaizen-core/mock-oracle → @kaizen-core/oracle
Docker service: mock-oracle → oracle
Container: kaizen-mock-oracle → kaizen-oracle

Files Changed:

apps/oracle/package.json - Package name
apps/oracle/src/logger.ts - Logger name
pnpm-workspace.yaml - Workspace path
docker-compose.yml - Service config
docker/Dockerfile.oracle - Build paths
docker/config/write-node.toml - Oracle URL
docker/monitoring/prometheus/prometheus.yml - Scrape target

Migration:

# Docker users: rebuild the oracle image
docker compose build oracle
 
# Development: reinstall dependencies
pnpm install

Fixed

Read-Node State Root Divergence

Fixed critical state synchronization issues between write-node and read-node that caused WebSocket disconnections after transaction execution.

Root Causes:

Duplicate Transaction Check During Replay: Read-node was rejecting replayed transactions as duplicates
Non-deterministic HashMap Iteration: pending_updates HashMap iteration order varied between nodes, causing different JMT state roots
Timestamp Inconsistency: Thesis.created_at used different timestamps between write-node (system time) and read-node (block time)

Solutions:

Added execute_tx_replay() method to bypass duplicate checks during block sync
Sorted pending_updates by KeyHash before JMT commit for deterministic ordering
Introduced read_version/write_version separation in StateManager
Passed consistent block_timestamp to transaction execution

Files Changed:

crates/engine/src/lib.rs - Added replay mode for transaction execution
crates/state/src/manager.rs - Read/write version separation
crates/state/src/rocksdb_storage.rs - Deterministic HashMap ordering
crates/app/src/sync/client.rs - Snapshot/restore on verification failure
crates/app/src/executor.rs - Consistent block timestamp handling

CLI Transaction Encoding

Fixed CLI's tx commands (withdraw, transfer, etc.) not working due to incorrect transaction encoding.

Root Cause: CLI was building transactions with custom format instead of using kaizen_tx::Transaction type.

Solution: Refactored CLI to use kaizen-tx crate for proper transaction building and signing.

Files Changed:

apps/cli/Cargo.toml - Added kaizen-tx dependency
apps/cli/src/commands/tx.rs - Rewrote using kaizen_tx::Transaction
apps/cli/src/rpc.rs - Fixed RPC method name and response type

CLI Signature Mismatch

Fixed "Invalid user signature" error when submitting thesis via CLI.

Root Cause: CLI's sign_quote function used different domain separator than SDK/mock-solver.

Solution: Aligned signing logic to use "Kaizen:SolverQuote" domain separator with keccak256 hashing.

Files Changed:

apps/cli/src/commands/thesis.rs - Fixed signature generation

Bridge Withdrawal Status Format

Fixed withdrawal status format incompatibility with bridge service.

Root Cause: Core returned human-readable status ("Pending") but bridge expected numeric string ("0").

Solution: Changed status serialization to output enum discriminant as string.

Files Changed:

crates/app/src/rpc/types.rs - Changed format!("{:?}", status) to (status as u8).to_string()

Status Mapping:

Numeric	Status
`"0"`	Pending
`"1"`	Processing
`"2"`	Completed
`"3"`	Failed

API Changes

RPC Methods

kaizen_sendTransaction

Now properly returns execution result object instead of just transaction hash

{
  "hash": "0x...",
  "status": "executed",
  "success": true,
  "error": null,
  "blockHeight": 1234,
  "txIndex": 0,
  "events": [...]
}

kaizen_getUnprocessedWithdrawals

Status field now returns numeric string for bridge compatibility

Testing

Full lifecycle test verified:

✅ Bridge Deposit (faucet mint)
✅ Thesis Submit (RFQ)
✅ Settlement (UserWin/SolverWin)
✅ Bridge Withdraw
✅ Read-node Sync (0 state root mismatches)

Stress test results:

87 thesis submissions
50 rapid-fire parallel submissions
0 state root mismatches between write-node and read-node