Changelog
All notable changes to Kaizen Core are documented here.
[Unreleased] - 2025-12-11
Added
Parallel Execution Pipeline
Added a high-performance parallel execution pipeline for improved transaction throughput.
Key Components:-
Parallel Signature Verification: Uses rayon to parallelize ECDSA signature recovery across all CPU cores.
-
Aggregate-Based Scheduling: Groups non-conflicting transactions into parallel batches using the
AggregateAccesstrait. -
ParallelStateManager: Thread-safe state access via
DashMapwithout requiring&mut self. -
TrueParallelExecutor: Combines all components for true parallel transaction execution.
| Component | Batch Size | Parallel | Sequential | Speedup |
|---|---|---|---|---|
| Signature Verification | 1000 txs | 318K TPS | 40K TPS | 8.0x |
| True Parallel Execution | 1000 txs | 62K TPS | 33K TPS | 1.9x |
| Full Pipeline | 1000 txs | 91K TPS | 33K TPS | 2.8x |
kaizen_engine::parallel::*- Parallel execution infrastructurekaizen_state::parallel::ParallelStateManager- Thread-safe state access
crates/engine/src/parallel.rs- Parallel execution pipelinecrates/state/src/parallel.rs- Thread-safe state managercrates/engine/benches/tps.rs- Added parallel execution benchmarks
Documentation: See Architecture / Parallel Execution for details.
[Previous] - 2025-12-08
Fixed
Settler Sync Error on Pruned Blocks
Fixed settler failing to sync when write-node has pruned historical blocks.
Root Cause: The EventStreamClient in sidecar expected strictly sequential block heights. When the write-node had pruned blocks (e.g., only keeping last 50K blocks), the settler would request events from block 1 but receive events starting from the earliest available block (e.g., 28771), causing a height mismatch error.
Height mismatch: expected 1, got 28771
Event stream error, reconnecting in 5s...
(repeat forever)Solution: Changed event stream sync logic to accept monotonically increasing heights with gaps. Instead of requiring strict sequential heights, we now:
- Accept any height that's greater than the last processed height
- Log when blocks are skipped due to pruning
- Continue syncing from wherever the server starts
// Before: Strict sequential check
if batch.height != expected_height {
return Err("Height mismatch");
}
// After: Monotonic increasing check with gap tolerance
if batch.height <= last_height {
return Err("Height not increasing");
}
// Gaps allowed, just log themcrates/app/src/sync/sidecar.rs- Relaxed height check inEventStreamClient::event_loop()
Added
RocksDB Native Prometheus Metrics
Added export of RocksDB internal statistics as Prometheus gauges for better storage observability.
New Metrics:| Metric | Description |
|---|---|
kaizen_rocksdb_estimate_num_keys | Estimated number of keys in DB |
kaizen_rocksdb_live_data_size_bytes | Size of live data |
kaizen_rocksdb_sst_files_size_bytes | Total SST file size |
kaizen_rocksdb_memtable_size_bytes | Current memtable size |
kaizen_rocksdb_block_cache_usage_bytes | Block cache memory usage |
kaizen_rocksdb_block_cache_pinned_bytes | Pinned block cache memory |
kaizen_rocksdb_num_running_compactions | Active compaction jobs |
kaizen_rocksdb_num_running_flushes | Active flush jobs |
kaizen_rocksdb_pending_compaction_bytes | Bytes pending compaction |
- Added
export_metrics()method toRocksDbStorage - Periodic export via background task in
Server(every 5 seconds when metrics enabled) - Uses RocksDB's property API to fetch internal statistics
crates/state/src/rocksdb_storage.rs- Addedexport_metrics()methodcrates/state/src/manager.rs- Addedexport_metrics()methodcrates/app/src/server.rs- Added periodic metrics export taskcrates/metrics/src/lib.rs- Added individual metric recording functions (record_rocksdb_num_keys(),record_rocksdb_live_data_size(), etc.)
JMT Cache Prometheus Metrics
Added metrics for JMT version cache performance monitoring.
New Metrics:| Metric | Description |
|---|---|
kaizen_jmt_cache_size | Number of entries in version cache |
kaizen_jmt_cache_hits_total | Total cache hits |
kaizen_jmt_cache_misses_total | Total cache misses |
- Cache hits/misses recorded in
JmtStorage::get_value_option() - Cache size recorded after each commit in
RocksDbStorage::commit()
crates/state/src/jmt_storage.rs- Added hit/miss recordingcrates/state/src/rocksdb_storage.rs- Added cache size recordingcrates/metrics/src/lib.rs- Added cache metric functions
Grafana Dashboard: Storage Performance Panels
Added new panels to the Kaizen Performance dashboard for RocksDB and JMT cache monitoring.
New Sections:-
πΎ Storage Performance (RocksDB & JMT Cache)
- JMT Cache Size (per node)
- JMT Cache Hit Rate
- RocksDB Database Size
-
ποΈ RocksDB Internals (Native Stats)
- Block Cache & Memtable Usage
- Running Compactions & Flushes
- Pending Compaction & Live Data
docker/monitoring/grafana/provisioning/dashboards/json/kaizen-performance.json- Added new panels
Performance
RocksDB Storage Optimizations
Applied comprehensive storage optimizations for improved throughput, lower latency, and reduced disk usage.
1. Column Family-Specific OptionsEach column family now has optimized RocksDB options based on its access patterns:
| Column Family | Optimization |
|---|---|
jmt_nodes | Bloom filter (10-bit), point lookup optimized |
jmt_values | Prefix bloom (32-byte key_hash), seek optimized |
stale_jmt_nodes | Lower memory, version prefix for range scans |
blocks | ZSTD compression, larger blocks for sequential reads |
block_hashes | Bloom filter for hashβheight lookups |
meta | Bloom filter for point lookups |
Added in-memory cache for latest version per key_hash to avoid disk seeks for current state reads.
// Fast path: cache hit for current state
jmt_version_cache: Arc<DashMap<KeyHash, VersionCacheEntry>>- Cache auto-populates on slow-path reads
- Pruner invalidates cache entries after pruning
- Zero impact on state consistency (read-only optimization)
Optimized stale JMT node pruning from O(n) individual deletes to O(1) range delete:
// Before: Individual deletes
for key in stale_indices { batch.delete_cf(cf_stale, key); }
// After: Single range delete
self.db.delete_range_cf(cf_stale, &start_key, &end_key)?;New configuration options for Write-Ahead Log management:
[storage.rocksdb]
max_total_wal_size_mb = 1024 # Prevents unbounded WAL growth
wal_ttl_seconds = 3600 # Auto-delete old WAL filesOptional RocksDB statistics collection for performance analysis:
[storage.rocksdb]
enable_statistics = true # ~5-10% overheadAccess via storage.statistics_string() for block cache hit rates, compaction stats, bloom filter effectiveness.
[storage.rocksdb]
write_buffer_size_mb = 128 # Write buffer per CF
max_write_buffer_number = 4 # Max write buffers before flush
block_cache_size_mb = 512 # Shared LRU block cache
max_background_jobs = 4 # Compaction/flush parallelism
enable_compression = true # LZ4 for hot, ZSTD for cold
bloom_filter_bits = 10 # Bloom filter bits per key
max_total_wal_size_mb = 1024 # Max WAL size before recycling
wal_ttl_seconds = 0 # WAL file TTL (0 = disabled)
enable_statistics = false # RocksDB internal statsRocksDbAppConfig::default()- Balanced for general useRocksDbAppConfig::production()- High-performance (256MB buffers, 1GB cache, stats enabled)
crates/state/src/rocksdb_options.rs- New RocksDB configuration modulecrates/state/src/rocksdb_storage.rs- CF-specific options, JMT cache wiringcrates/state/src/jmt_storage.rs- Version cache support, fast-path readscrates/state/src/pruner.rs- Range delete optimization, cache invalidationcrates/state/src/manager.rs- Addedwith_config()constructorcrates/state/src/types.rs- AddedStorageConfigcomposite typecrates/app/src/config.rs- Exposed RocksDB options in app configcrates/app/src/server.rs- Use new storage config
State Consistency: All optimizations are internal implementation details. State roots remain identical across nodes. Verified with existing test suite (55 state tests, 52 engine tests passed).
Fixed
Tester: WebSocket Race Condition on Wallet Switch
Fixed "WebSocket not connected" error when switching to a different wallet.
Root Cause: When switching wallets, the mainWalletAddress changed and triggered the WebSocket subscription effect. However, the React state isWebSocketConnected was still true from the previous connection (hadn't propagated yet), while the actual client.ws instance was already null or disconnected. The guard check passed but subscribeUserTheses threw.
Solution: Added synchronous check on the client's actual WebSocket state (client.isWebSocketConnected) in addition to the React state check.
// Before: Only React state check
if (!client || !isWebSocketConnected || !mainWalletAddress || isMockMode) {
// After: Also check client's internal state
if (
!client ||
!isWebSocketConnected ||
!client.isWebSocketConnected || // β Catches race condition
!mainWalletAddress ||
isMockMode
) {apps/tester/src/hooks/use-thesis-sync.ts- Addedclient.isWebSocketConnectedguardapps/tester/src/hooks/use-price-stream.ts- Same defensive fix applied
Changed
Documentation Restructure
Reorganized docs for better agent-friendliness and task-oriented navigation.
New Structure:docs/pages/
βββ introduction/ β What is Kaizen
βββ api/ β Quick lookup (NEW)
β βββ rpc.mdx β JSON-RPC + WebSocket
β βββ transactions.mdx
β βββ errors.mdx
βββ sdk/ β TypeScript SDK
βββ deployment/ β How to run (NEW)
β βββ docker.mdx
β βββ configuration.mdx
β βββ monitoring.mdx
βββ architecture/ β How it works (MERGED)
β βββ overview.mdx
β βββ stf.mdx
β βββ block-production.mdx
β βββ settlement.mdx
β βββ oracle.mdx
β βββ storage.mdx
βββ components/ β Individual services
βββ reference/ β Misc reference| Before | After | Why |
|---|---|---|
operations/ + advanced/ | deployment/ | Task-oriented |
core-concepts/ + execution/ | architecture/ | Related content merged |
reference/transactions.mdx | api/transactions.mdx | Better discoverability |
API buried in operations/ | api/ section | Quick lookup |
docs/pages/core-concepts/- Merged intoarchitecture/docs/pages/execution/- Merged intoarchitecture/docs/pages/operations/- Split intoapi/anddeployment/docs/pages/advanced/- Moved todeployment/andarchitecture/docs/pages/components/tester.mdx- Removed from sidebar (demo app)
docs/vocs.config.ts- New sidebar structuredocs/pages/api/*- New API reference sectiondocs/pages/deployment/*- New deployment sectiondocs/pages/architecture/*- Merged architecture section- Multiple cross-reference fixes across docs
README Cleanup
Simplified README.md to focus on quick start, pointing to docs for details.
Changes:- Fixed outdated binary name (
kaizen-appβkaizen-node) - Updated project structure (added missing apps)
- Simplified to ~125 lines (was ~400)
- Added link to docs.miyao.ai
Fixed
Prometheus Histogram Metrics Export
Fixed histogram metrics being exported as summaries instead of proper histograms, causing histogram_quantile() queries to fail in Grafana.
Root Cause: The metrics-exporter-prometheus crate defaults to exporting histograms as summaries (with quantile labels). Grafana's histogram_quantile() function requires proper histogram format with _bucket suffix.
Solution: Explicitly configured PrometheusBuilder with histogram buckets:
const LATENCY_BUCKETS: &[f64] = &[
0.0001, 0.0005, 0.001, 0.0025, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0,
];
PrometheusBuilder::new()
.set_buckets(LATENCY_BUCKETS)
.set_buckets_for_metric(Matcher::Suffix("tx_count"), COUNT_BUCKETS)
.set_buckets_for_metric(Matcher::Suffix("_amount"), AMOUNT_BUCKETS)kaizen_tx_execution_duration_seconds- TX execution latencykaizen_tx_validation_duration_seconds- TX validation latencykaizen_block_production_duration_seconds- Block production timekaizen_block_tx_count- Transactions per blockkaizen_rfq_bet_amount/kaizen_rfq_payout- RFQ amounts
crates/metrics/src/prometheus.rs- Added bucket configuration
RFQ Settlement Status Labels
Fixed inconsistent status labels for kaizen_rfq_settled_total metric between engine and settler.
Previous Behavior: Engine used debug format (SettledUserWin, SettledSolverWin) while settler used snake_case (user_wins, solver_wins).
Solution: Standardized both to use user_wins / solver_wins labels.
crates/engine/src/lib.rs- Changed status label format to match settler
Grafana Dashboard Cleanup
Removed uninstrumented metric panels from dashboards and fixed broken queries.
Removed Panels (not instrumented in code):| Dashboard | Removed |
|---|---|
| Overview | Active RFQs, Mempool, Sync Clients, Pending W/D, Uptime, Oracle metrics, Memory, DB Size |
| Debug | Mempool Deep Dive section, Storage Debug section |
| Compare | TX Latency comparisons, Memory comparisons, DB Size Growth |
| Business | Renamed "Won/Lost" to "User Wins/Solver Wins" |
- Sync Lag: Changed to
scalar(kaizen_block_height) - kaizen_sync_heightfor correct label matching - Business dashboard: Updated status labels from
won/losttouser_wins/solver_wins
docker/monitoring/grafana/provisioning/dashboards/json/kaizen-overview.jsondocker/monitoring/grafana/provisioning/dashboards/json/kaizen-debug.jsondocker/monitoring/grafana/provisioning/dashboards/json/kaizen-compare.jsondocker/monitoring/grafana/provisioning/dashboards/json/kaizen-business.json
Performance
State Layer Optimizations
Applied drop-in performance optimizations to the state layer for improved throughput and lower latency.
1. DashMap for Pending UpdatesReplaced RwLock<HashMap> with lock-free DashMap for pending state updates.
// Before: Lock contention on every read/write
pending_updates: Arc<RwLock<HashMap<KeyHash, Option<Vec<u8>>>>>
// After: Lock-free concurrent access
pending_updates: Arc<DashMap<KeyHash, Option<Vec<u8>>>>Impact: Eliminates lock contention during concurrent state operations within a block.
2. JMT Value Lookup O(1)Optimized JMT value lookup from O(n) iteration to O(1) using RocksDB's reverse seek.
// Before: Iterate through ALL versions
for item in prefix_iterator { ... } // O(n)
// After: Direct seek to target version
iterator_cf(IteratorMode::From(&seek_key, Direction::Reverse))
iter.next() // O(1)Impact: State reads now constant-time regardless of version history depth.
3. Hot Path CachingAdded in-memory caching for frequently-read, rarely-written values.
| Cached Value | Read Frequency | Write Frequency |
|---|---|---|
SystemPaused | Every transaction | Admin only |
GlobalConfig | Every RFQ submit | Admin only |
Cache is invalidated on begin_block() and cleared on writes.
| Metric | Before | After |
|---|---|---|
| State commit (read-node) | ~3-5ms | ~1ms |
| JMT value lookup | O(n) | O(1) |
dashmap = "6.1"
parking_lot = "0.12"crates/state/src/rocksdb_storage.rs- DashMap for pending_updatescrates/state/src/jmt_storage.rs- Reverse seek optimizationcrates/state/src/manager.rs- Hot path cachingcrates/state/Cargo.toml- Added dashmap, parking_lotCargo.toml- Workspace dependencies
State Consistency: All changes are internal implementation details. State roots remain identical between nodes. Verified with 189k+ blocks synced with 0 mismatches.
Added
Multi-Chain Withdrawal Support
Withdrawals now support specifying both destination address and destination chain ID, enabling proper multi-chain bridge functionality.
Previous Behavior: WithdrawTx only had a single destination field (Address type), which was incorrectly repurposed as chain ID in the bridge relayer.
// Before
pub struct WithdrawTx {
pub amount: u64,
pub destination: Address,
}
// After
pub struct WithdrawTx {
pub amount: u64,
pub destination_address: Address,
pub destination_chain_id: u64,
}| Chain ID | Network |
|---|---|
| 42161 | Arbitrum |
| 421614 | Arbitrum Sepolia |
| 8453 | Base |
| 84532 | Base Sepolia |
// Before
withdraw(amount, destination);
// After
withdraw(amount, destinationAddress, destinationChainId);- Chain selector in Bridge Modal now sets
destinationChainId - Pending Withdrawals tab shows destination address with chain name
- Example:
0x1234...abcd (Base Sepolia)
# Before
kata tx withdraw --destination 0x... --amount 1000000
# After
kata tx withdraw --destination 0x... --chain-id 84532 --amount 1000000crates/tx/src/payload.rs-WithdrawTxwith new fieldscrates/types/src/withdrawal.rs-WithdrawalRequestwith new fieldscrates/types/src/event.rs-Event::WithdrawRequestedwith new fieldscrates/engine/src/executors/bridge.rs- Updated executorcrates/app/src/rpc/types.rs- RPC response with new fieldscrates/app/src/indexer/mod.rs- SQL schema and event handlingapps/cli/src/commands/tx.rs- Added--chain-idflagsdk/src/types.ts- Updated TypeScript typessdk/src/schema/tx.ts- Updated Borsh schemasdk/src/signer.ts- Updated EIP-712 type hashapps/tester/src/components/BridgeModal.tsx- Chain selectionapps/tester/src/components/ThesisPanel.tsx- Display chain namebridge/src/relayer/server.ts- Use correct fields
Breaking Change: Borsh encoding changed for WithdrawTx. Requires coordinated upgrade of all components.
Settings Modal with Theme Customization
Added a Settings modal accessible from the sidebar, providing user-configurable options for URLs and visual themes.
Features:- URL Configuration: Customize Solver URL, Node RPC URL, and Node WebSocket URL
- Theme Selection: 6 themes each for Normal Mode and Degen Mode
- Persistence: Settings automatically saved to localStorage
- Reset: One-click reset to default settings
| Theme | Primary Color | Description |
|---|---|---|
| Emerald | #22c55e | Classic green (default) |
| Cyber | #06b6d4 | Cyan/teal |
| Sunset | #f59e0b | Amber/gold |
| Arctic | #a5b4fc | Lavender/indigo |
| Neon | #e879f9 | Pink/fuchsia |
| Monochrome | #d4d4d8 | Gray/zinc |
| Theme | Primary Color | Description |
|---|---|---|
| Inferno | #f97316 | Orange (default) |
| Plasma | #ef4444 | Red |
| Blaze | #fbbf24 | Yellow/amber |
| Volcanic | #fb7185 | Rose/pink |
| Supernova | #c026d3 | Purple/fuchsia |
| Ember | #fdba74 | Peach/orange light |
- Theme colors applied via CSS variables (
--theme-primary,--theme-primary-rgb,--theme-bg, etc.) useThemeApplierhook updates document root CSS variables on theme change- All major UI components updated to use theme variables instead of hardcoded colors
apps/tester/src/stores/use-settings-store.ts- New store with theme definitions and persistenceapps/tester/src/components/SettingsModal.tsx- New settings modal componentapps/tester/src/hooks/use-theme.ts- Theme applier and color hooksapps/tester/src/hooks/use-config.ts- Dynamic config based on settingsapps/tester/src/styles/globals.css- CSS variable definitionsapps/tester/src/pages/_app.tsx- Theme applier integration- Multiple components updated to use CSS variables
Pending Margin Reservation for Box Creation
Box creation now reserves margin from available balance, preventing users from creating multiple boxes that exceed their total balance.
Previous Behavior: Users could draw multiple boxes in quick succession. Each box only checked against total balance, not accounting for boxes still in QUOTING status. This allowed creating boxes whose combined margins exceeded the user's actual balance, causing on-chain failures.
New Behavior:- Available balance = Total balance - Sum of all QUOTING theses' margins
- Box creation blocked if available balance < bet amount
- Error message includes pending amount: "Insufficient balance: X USDC available (Y pending)"
- Double-check in
executeDegenThesisprevents race conditions
| Mode | Check Location |
|---|---|
| Degen Mode | handleMouseDown + executeDegenThesis in Chart.tsx |
| Modal Mode | useThesisValidation hook (balance check) |
apps/tester/src/components/Chart.tsx- Added pending margin calculation inhandleMouseDownandexecuteDegenThesisapps/tester/src/hooks/use-thesis-validation.ts- Balance check now subtracts pending QUOTING margins
Pending Withdrawals Tab in Tester
Added a new "Pending Withdrawals" tab to the tester footer panel, showing user's pending withdrawal requests.
Features:- Displays withdrawal ID, status, amount, destination address, and request timestamp
- Status indicator with animated pulse for pending items
- Loading state while fetching data
- Empty state with guidance to request withdrawal from bridge modal
- Auto-refreshes every 10 seconds
- Added
getWithdrawalsByUser()method to fetch paginated withdrawal IDs for a user - Added
getPendingWithdrawalsForUser()convenience method that filters unprocessed withdrawals by user
sdk/src/rpc.ts- AddedgetWithdrawalsByUser()sdk/src/client.ts- AddedgetWithdrawalsByUser(),getPendingWithdrawalsForUser()apps/tester/src/hooks/use-withdrawals-query.ts- New hook for fetching pending withdrawalsapps/tester/src/lib/query-keys.ts- Addedwithdrawals.pendingquery keyapps/tester/src/components/ThesisPanel.tsx- AddedWithdrawalsTabcomponent and tab button
Modal UX Improvements
Improved modal interaction patterns across tester app.
CheatCodePanel:- Clicking outside the modal (backdrop) now closes it
- Clicking outside the modal (backdrop) now closes it
- Auto-closes after 1.5 seconds when transaction is successful (deposit confirmed or withdrawal submitted)
apps/tester/src/components/CheatCodePanel.tsx- Added backdrop click handlerapps/tester/src/components/BridgeModal.tsx- Added backdrop click handler and auto-close effect
RPC Pagination Response Format
Updated thesis-related RPC methods to return proper paginated responses with metadata.
Previous Behavior: Methods like kaizen_getThesesByUser returned a plain array Thesis[], making it impossible to know total count or implement pagination UI.
New Behavior: Returns PaginatedResponse<Thesis> with full pagination metadata:
{
"items": [...],
"total": 150,
"limit": 100,
"offset": 0,
"hasMore": true
}| Method | Change |
|---|---|
kaizen_getThesesByUser | Returns PaginatedResponse<Thesis> |
kaizen_getThesesBySolver | Returns PaginatedResponse<Thesis> |
kaizen_internal_getPendingTheses | Returns PaginatedResponse<Thesis> |
kaizen_internal_getThesesByStatus | Returns PaginatedResponse<Thesis> |
kaizen_internal_getThesesByPair | Returns PaginatedResponse<Thesis> |
Note: kaizen_getWithdrawalsByUser already returned PaginatedResponse<number> (withdrawal IDs).
- Added
PaginatedResponse<T>type export - Updated
getThesesByUser(),getThesesBySolver(),getMyTheses()return types - Added
getWithdrawalsByUser()method to SDK client
crates/app/src/rpc/methods.rs- Updated RPC handlerssdk/src/types.ts- AddedPaginatedResponse<T>interfacesdk/src/rpc.ts- Updated return types, addedgetWithdrawalsByUsersdk/src/client.ts- Updated return types, added withdrawal methodssdk/src/index.ts- ExportedPaginatedResponseapps/tester/src/hooks/use-thesis-sync.ts- Updated to useresponse.itemsapps/mock-solver/src/rpc.ts- Updated return typeapps/cli/src/rpc.rs- AddedPaginatedResponse<T>typeapps/cli/src/commands/degen.rs- Updated to useresponse.items
Breaking Change: SDK methods now return PaginatedResponse<Thesis> instead of Thesis[]. Update callsites to access .items property.
Grafana Dashboard Suite
Expanded monitoring dashboards from 1 to 5 specialized views for different use cases.
New Dashboards:| Dashboard | UID | Purpose |
|---|---|---|
| Overview | kaizen-overview | Health check, business metrics (slimmed down) |
| Performance | kaizen-performance | Deep-dive latency, throughput analysis |
| Business | kaizen-business | Trading volume, win rates, bridge flows |
| Debug | kaizen-debug | Error analysis, latency breakdown, logs |
| Compare | kaizen-compare | Period-over-period trend comparison |
Kaizen Performance (/d/kaizen-performance):
- TPS & throughput with success/fail breakdown
- TX execution latency distribution (p50/p90/p95/p99)
- TX latency by type (transfer, deposit, submit_thesis, etc.)
- Block production phase breakdown (validation β execution β commit)
- Storage IOPS and latency (state, RocksDB)
- Sync performance and lag tracking
- Pruning duration breakdown
Kaizen Business (/d/kaizen-business):
- Total trading volume (USDC)
- Trade outcomes (won/lost/cancelled/expired)
- Hourly volume bars and win rate trends
- Bet size distribution over time
- Bridge deposit/withdrawal flows and net flow
- Transaction type mix (pie chart)
Kaizen Debug (/d/kaizen-debug):
- Error counters with color thresholds (failed TXs, rejections, sig failures)
- Failed transactions by type
- Mempool rejection reasons
- Latency breakdown by TX type and RPC method
- Mempool queue sizes and eviction rates
- Storage IOPS and p99 latency
- Integrated error logs from all services
Kaizen Compare (/d/kaizen-compare):
- Today vs Yesterday vs Last Week overlays for TPS, latency, RFQ rate
- Period-over-period % change stat panels
- Block production and storage growth trends
- Uses Prometheus
offsetfor time-shifted queries
- Reduced from ~2600 lines to ~500 lines (80% reduction)
- Removed detailed latency panels (moved to Performance dashboard)
- Added link to Performance dashboard for deep-dive
- Focused on business metrics and high-level health
docker/monitoring/grafana/provisioning/dashboards/json/kaizen-overview.json- Slimmed downdocker/monitoring/grafana/provisioning/dashboards/json/kaizen-performance.json- Newdocker/monitoring/grafana/provisioning/dashboards/json/kaizen-business.json- Newdocker/monitoring/grafana/provisioning/dashboards/json/kaizen-debug.json- Newdocker/monitoring/grafana/provisioning/dashboards/json/kaizen-compare.json- New
JMT Node Pruning
Fixed unbounded disk growth across all node types (write, read-aggressive, read-archive) by implementing JMT (Jellyfish Merkle Tree) node pruning.
Problem: All three node types experienced identical disk growth rates regardless of pruning configuration. The aggressive and custom pruning modes only pruned JMT values, blocks, and snapshots - but not the JMT tree structure nodes themselves.
Root Cause: The TreeUpdateBatch.stale_node_index_batch from the JMT library was completely ignored. This batch tracks which tree nodes become obsolete at each version, enabling safe deletion of old nodes.
- New column family -
CF_STALE_JMT_NODESstores stale node indices on each commit - Stale node tracking - Records
(stale_since_version, node_key)for each obsolete node - Pruning implementation -
prune_jmt_nodes()deletes nodes wherestale_since_version < min_version_to_keep
| Node Type | Disk Growth |
|---|---|
write | Stabilizes at blocks_to_keep Γ avg_block_size |
read-aggressive | Stabilizes at ~16 min of history |
read-archive | Grows indefinitely (pruning disabled) |
kaizen_pruning_jmt_nodes_total- Total JMT nodes prunedkaizen_pruning_jmt_nodes_duration_seconds- JMT node pruning durationkaizen_pruning_blocks_duration_seconds- Block pruning durationkaizen_pruning_snapshots_duration_seconds- Snapshot pruning durationkaizen_pruning_jmt_values_duration_seconds- JMT values pruning duration
crates/state/src/jmt_storage.rs- AddedCF_STALE_JMT_NODES, stale index storagecrates/state/src/rocksdb_storage.rs- Create new CF on DB opencrates/state/src/pruner.rs- Implementedprune_jmt_nodes()crates/state/src/block_storage.rs- Addedjmt_nodes_prunedtoPruneStatscrates/metrics/src/lib.rs- Added pruning timing metrics
Migration Note: Existing databases will automatically create the new column family on startup. However, historical stale node data is not available, so previously accumulated JMT nodes won't be pruned. For aggressive pruning nodes, consider wiping data and re-syncing from archive node.
Batch Settlement for Settler Sidecar
Settler now batches multiple settlements into a single transaction for improved efficiency.
Previous Behavior: Each settlement was submitted as a separate transaction, requiring N signatures and N RPC calls for N settlements.
New Behavior:- Batch collection - Collects settlements for 50ms or until
batch_size(default 100) is reached - Single transaction - All collected settlements are submitted in one
SystemSettletransaction - Atomic execution - Uses validation-first pattern to ensure all-or-nothing semantics
- Unified type -
SystemSettleTxnow containsVec<Settlement>(single settlement = batch of 1)
- Reduced transaction count: N settlements β 1 transaction
- Reduced signature overhead: N signatures β 1 signature
- Reduced RPC calls: N calls β 1 call
- Lower latency for burst settlements
// Before: Two separate types
pub struct SystemSettleTx {
pub thesis_id: u64,
pub settlement_type: SystemSettlementType,
}
pub struct SystemBatchSettleTx {
pub settlements: Vec<Settlement>,
}
// After: Unified type
pub struct SystemSettleTx {
pub settlements: Vec<Settlement>,
}
pub struct Settlement {
pub thesis_id: u64,
pub settlement_type: SystemSettlementType,
}crates/tx/src/payload.rs- UnifiedSystemSettleTxwithVec<Settlement>crates/engine/src/executors/rfq.rs- Addedexecute_settlement()with validation-first patterncrates/engine/src/lib.rs- Single handler for settlementcrates/app/src/settler/service.rs- Batch collection and submission logicsdk/src/schema/tx.ts- AddedSystemSettleTxSchema,SettlementSchema
Breaking Change: Borsh encoding changed. Requires coordinated upgrade of settler and nodes.
Hybrid Thesis Sync (WebSocket + RPC Polling)
Tester app now uses a hybrid approach for active thesis and thesis history, combining real-time WebSocket events with RPC polling for improved reliability.
Previous Behavior: Thesis data only existed in local memory. If WebSocket disconnected, settlement events were missed. No history persisted across sessions.
New Behavior:- Initial load from RPC - Fetches thesis history on connect via
getThesesByUser() - Real-time WebSocket - Subscribes to
subscribeUserTheses()for immediate settlement notifications - Fallback RPC polling - When WebSocket disconnects, polls every 5s (only if active theses exist)
- Reconnection sync - On WebSocket reconnect, syncs from RPC to catch missed events
- Thesis history persists across browser refreshes
- Missed settlements are recovered when WebSocket is down
- Efficient - real-time events when available, polling only as fallback
apps/tester/src/hooks/use-thesis-sync.ts- New hybrid sync hookapps/tester/src/stores/use-thesis-store.ts- AddedsyncTheses,updateThesisByThesisId,clearAllactionsapps/tester/src/hooks/use-price-stream.ts- Removed thesis subscription (moved to sync hook)apps/tester/src/pages/index.tsx- IntegrateduseThesisSynchook
Fixed
Tester: Bridge Withdrawal Targeting External Chain
Fixed withdrawal requests incorrectly calling external chain gateway contracts instead of Kaizen Core.
Root Cause: BridgeModal.tsx used wagmi's writeContract to call an external gateway contract for withdrawals, which is incorrect. Withdrawals should be submitted to Kaizen Core, which then gets processed by the relayer to send funds to the external chain.
// Wrong: Calling external chain contract
writeContract({
address: selectedChainConfig.gatewayAddress,
abi: GATEWAY_ABI,
functionName: "withdraw",
args: [amountParsed, wagmiAddress],
});// Correct: Submit to Kaizen Core
const payload = withdrawPayload(amountParsed, wagmiAddress);
await client.sendTransaction(payload, { waitForConfirmation: true });- Description updated: "Request withdrawal from Kaizen. Funds will be sent to [chain] by the relayer."
- Shows Kaizen Core transaction hash instead of external chain explorer link
- Info message: "π‘ Withdrawals are submitted to Kaizen Core. The relayer will process and send funds to your selected chain."
- Withdraw button no longer requires chain switch (only deposits need external chain interaction)
apps/tester/src/components/BridgeModal.tsx- Use SDK'swithdraw()payload builder andclient.sendTransaction()
WebSocket Duplicate Event Broadcast
Fixed duplicate WebSocket events being sent to frontend clients, which could cause unnecessary re-renders and subscription loops.
Root Cause: In write-node's executor.rs, events were broadcast twice:
- Immediately during
execute_tx()for real-time feedback - Again during
checkpoint()when the block was committed
This meant every thesis settlement, transfer, and oracle price update was delivered to WebSocket subscribers twice.
Solution: Removed immediate event broadcast from execute_tx(). Events are now only broadcast once during checkpoint(), which includes both transaction events and oracle price events from begin_block().
Trade-off: Transaction events now have ~100ms higher latency (wait for next checkpoint) but are guaranteed to be delivered exactly once.
Files Changed:crates/app/src/executor.rs- Removed duplicate event broadcast inexecute_tx()
WebSocket UserTheses Subscription Missing Settlement Events
Fixed users not receiving RfqSettled events for their own theses when they lost.
Root Cause: The UserTheses WebSocket subscription in subscriptions.rs filtered RfqSettled events by winner == address instead of user == address. This meant users only received settlement notifications when they won, not when they lost.
Solution: Changed the filter condition to check if the event's user field matches the subscription address, ensuring users receive all settlement events for their theses regardless of outcome.
crates/app/src/ws/subscriptions.rs- FixedRfqSettledevent filter to useuserinstead ofwinner
Tester: WebSocket Subscription Loop on Thesis Updates
Fixed infinite WebSocket re-subscription loop that could occur when thesis events were received.
Root Cause: In use-thesis-sync.ts, the handleThesisEvent callback had theses array in its dependency list. When a thesis event arrived and updated the store, the callback was recreated, which triggered the useEffect to unsubscribe and re-subscribe to the WebSocket channel, which could cause duplicate events and further re-renders.
Solution: Wrapped handleThesisEvent in a useRef to keep a stable reference. The subscription useEffect now only depends on connection state, not on the callback itself. The ref is updated on each render to always have access to the latest store state.
apps/tester/src/hooks/use-thesis-sync.ts- Stabilized callback reference withuseRef
Settler: Invalid Breach Timestamp Outside Thesis Window
Fixed "Invalid breach timestamp: X not in [start_time, end_time]" error when settler submitted SolverWins settlements.
Root Cause: The find_breach function in settler could return a breach timestamp that was slightly before the thesis's start_time. This happened because get_price_at() uses a 100ms tolerance window, so it might return a price from timestamp T-50ms when querying for timestamp T. The breach was valid (price did breach), but the returned timestamp was outside the thesis's valid observation period.
Solution: Added explicit bounds check in find_breach to ensure the actual timestamp (from the price cache) falls within [thesis.start_time, thesis.end_time] before returning it as a valid breach.
crates/app/src/settler/service.rs- Added timestamp bounds validation infind_breach()
Settler: State Loss on Restart
Added height persistence to settler so it can resume from the last processed block after restart.
Previous Behavior: Settler always started from block 0 on restart, requiring full event replay which could be slow or fail if historical events were pruned.
New Behavior:- Settler saves last processed height to
{data_dir}/height.txtevery 100 blocks - On startup, reads persisted height and resumes event stream from that point
- Falls back to height 0 if persistence file doesn't exist or is corrupted
# CLI
settler --data-dir ./.data/settler
# Environment variable
SETTLER_DATA_DIR=./.data/settlercrates/app/src/settler/config.rs- Addeddata_dirfieldcrates/app/src/settler/service.rs- Addedread_persisted_height(),write_persisted_height()apps/settler/src/main.rs- Added--data-dirCLI argument
Settler: Failed Settlement Retry
Fixed settler not retrying settlements that failed due to RPC or execution errors.
Root Cause: When a settlement transaction failed (either RPC error or execution error like "Invalid breach timestamp"), the thesis remained in pending_settlements indefinitely. The breach detector would skip it, assuming a settlement was already in flight.
Solution: Added feedback channel from settlement_submitter to a new settlement_result_handler task. When a settlement fails, its thesis_id is removed from pending_settlements, allowing the breach detector to pick it up again for retry.
crates/app/src/settler/service.rs- AddedSettlementResult,settlement_result_handler(), feedback channel
Settler: Breach Price Mismatch on SolverWins Settlement
Fixed "Breach price mismatch" error when settler submitted SolverWins settlements.
Root Cause: The find_breach function used the iteration timestamp (t) instead of the actual price entry timestamp when reporting breaches. When prices were cached at timestamps slightly different from 100ms intervals, the node's ring buffer lookup would return a different price.
Example:
- Oracle price at timestamp 1050 β P1
- Settler iterates at t=1100, finds P1 (within 100ms tolerance)
- Settler sends: breach_timestamp=1100, breach_price=P1
- Node: slot_for_timestamp(1100) = slot 11, which has P2 (from timestamp 1150)
- Mismatch: P1 β P2Solution: Changed get_price_at() to return (actual_timestamp, price) tuple instead of just price. The breach detector now uses the actual oracle timestamp, ensuring the node's ring buffer lookup returns the same price.
crates/app/src/settler/service.rs- Fixedget_price_at()andfind_breach()to use actual timestamps
Settler: Double Settlement Race Condition
Fixed "Thesis not active" error caused by settler submitting duplicate settlements for the same thesis.
Root Cause: After detecting a breach and sending a settlement decision to the channel, the thesis remained in active_theses until the RfqSettled event arrived. The next breach detection cycle would detect the same breach and send another settlement, which failed because the thesis was already settled.
Solution: Added pending_settlements: HashSet<u64> to track theses with in-flight settlements:
- Before detection, filter out theses already in
pending_settlements - After detection, mark decided theses as pending before sending to channel
- When
RfqSettledevent arrives, clear bothactive_thesesandpending_settlements
crates/app/src/settler/service.rs- Added pending settlement tracking
Settler: Enhanced Settlement Response Logging
Added detailed logging for settlement transaction responses to aid debugging.
New Log Fields:- On success:
status,block_height,tx_indexfrom receipt - On execution failure:
receipt.errormessage, individual settlement details - On RPC error:
breach_timestamp,breach_pricefor each failed settlement
settler_execution_errors_total- Count of transactions included but failed execution
crates/app/src/settler/service.rs- Enhancedsubmit_batch()logging
Tester: API Wallet Mismatch on Wallet Switch
Fixed "Invalid user signature: signer is not user nor an authorized API wallet" error when switching accounts in external wallet.
Root Cause: When user switched accounts in MetaMask, the localStorage API wallet still belonged to the previous account. The new account would try to sign quotes with the old API wallet, causing signature verification failures.
Solution: Added wallet change detection that automatically clears mismatched API wallets.
Files Changed:apps/tester/src/stores/use-wallet-store.ts- AddedhandleMainWalletChange()to clear API wallet on account switchapps/tester/src/hooks/use-kaizen-client.tsx- CallshandleMainWalletChange()when wagmi address changes
Tester: WebSocket Abrupt Disconnect on Wallet Change
Fixed WebSocket connection dropping abruptly when switching wallet accounts, causing poor UX.
Root Cause: React effect cleanup immediately called client.disconnectWebSocket() without any grace period.
Solution: Added graceful disconnection sequence:
- Mark WebSocket as disconnected immediately (prevents new requests)
- Clear core service client
- Wait 100ms before actual WebSocket close
apps/tester/src/hooks/use-kaizen-client.tsx- Added graceful disconnect with timeout
Tester: EnableConnectionModal "Existing Wallet Found" UX Confusion
Fixed confusing "Existing API Wallet Found" message appearing during API wallet setup flow.
Root Cause: The useEffect that reset modal state had apiWallet in dependencies, causing it to re-run and show the message immediately after generating a new wallet.
- Changed effect to only trigger on
isOpenchange, not wallet state changes - Added
isReusingExistingstate to track if resuming previous setup - Changed message from "Existing API Wallet Found" to "Resume Setup" for clarity
apps/tester/src/components/EnableConnectionModal.tsx- Fixed effect dependencies and improved messaging
Tester: Prevent Box Drawing Without Sufficient Balance
Box drawing is now disabled when user balance is below minimum bet amount, regardless of degen mode.
Previous Behavior: Users could draw boxes with 0 balance, only to see error after attempting to execute.
New Behavior:- BOX tool button is disabled and grayed out when balance < minimum bet
- Clicking disabled button shows toast explaining insufficient balance
- If balance drops while BOX tool is active, tool is auto-deactivated
- Chart also checks balance before allowing drag start (defense in depth)
apps/tester/src/components/RightPanel.tsx- Added balance check on tool activationapps/tester/src/components/Chart.tsx- Added balance check in mousedown handler
Settler Challenge Deadline Timing Race
Fixed "ChallengeWindowNotOver" error when settler submits UserWins settlement right after deadline passes.
Root Cause: Settler used SystemTime::now() to check deadline, but core uses block timestamp which is aligned down to 100ms intervals via align_timestamp(). This caused a race condition where settler saw the deadline as passed, but core's block timestamp hadn't caught up yet.
Timeline example:
- T=950ms: Checkpoint β block_timestamp = 900ms (aligned down)
- T=1001ms: Settler sees now >= deadline(1000ms) β submits UserWins
- TX executes with block_timestamp = 900ms
- Core: 900 < 1000 β ChallengeWindowNotOver!Solution: Added deadline_buffer (default 200ms) to settler config. Settler now waits until now >= challenge_deadline + deadline_buffer before submitting UserWins settlement.
crates/app/src/settler/config.rs- Addeddeadline_bufferfieldcrates/app/src/settler/service.rs- Apply buffer in breach detectionapps/settler/src/main.rs- Added--deadline-bufferCLI flagdocker-compose.yml- Explicitly set deadline buffer
# CLI (default 200ms)
settler --write-node 127.0.0.1:9000 --deadline-buffer 200SDK Quote Signing Hash Mismatch
Fixed "Invalid user signature" error when submitting thesis via tester app.
Root Cause: SDK's buildQuoteSigningHash passed hex-encoded bytes to viem's keccak256, which produced a different hash than Rust's direct byte hashing.
Solution: Pass raw Uint8Array directly to keccak256 instead of converting to hex first.
// Before (incorrect)
return keccak256(bytesToHex(message));
// After (correct)
return keccak256(message);sdk/src/signer.ts- FixedbuildQuoteSigningHashfunction
SDK RfqSettledEvent Schema Mismatch
Fixed WebSocket event deserialization failure for thesis settlement events.
Root Cause: SDK's RfqSettledEvent was missing fields that Rust's Event::RfqSettled had.
Solution: Added missing fields to match Rust schema.
Fields Added:user: AddressSchemasolver: AddressSchemaoraclePair: OraclePairSchemabetAmount: bigint
sdk/src/schema/event.ts- UpdatedRfqSettledEventclass
Changed
Oracle Service Rename
Renamed mock-oracle to oracle as it's now the official production service.
- Directory:
apps/mock-oracleβapps/oracle - Package:
@kaizen-core/mock-oracleβ@kaizen-core/oracle - Docker service:
mock-oracleβoracle - Container:
kaizen-mock-oracleβkaizen-oracle
apps/oracle/package.json- Package nameapps/oracle/src/logger.ts- Logger namepnpm-workspace.yaml- Workspace pathdocker-compose.yml- Service configdocker/Dockerfile.oracle- Build pathsdocker/config/write-node.toml- Oracle URLdocker/monitoring/prometheus/prometheus.yml- Scrape target
# Docker users: rebuild the oracle image
docker compose build oracle
# Development: reinstall dependencies
pnpm installFixed
Read-Node State Root Divergence
Fixed critical state synchronization issues between write-node and read-node that caused WebSocket disconnections after transaction execution.
Root Causes:- Duplicate Transaction Check During Replay: Read-node was rejecting replayed transactions as duplicates
- Non-deterministic HashMap Iteration:
pending_updatesHashMap iteration order varied between nodes, causing different JMT state roots - Timestamp Inconsistency:
Thesis.created_atused different timestamps between write-node (system time) and read-node (block time)
- Added
execute_tx_replay()method to bypass duplicate checks during block sync - Sorted
pending_updatesbyKeyHashbefore JMT commit for deterministic ordering - Introduced
read_version/write_versionseparation inStateManager - Passed consistent
block_timestampto transaction execution
crates/engine/src/lib.rs- Added replay mode for transaction executioncrates/state/src/manager.rs- Read/write version separationcrates/state/src/rocksdb_storage.rs- Deterministic HashMap orderingcrates/app/src/sync/client.rs- Snapshot/restore on verification failurecrates/app/src/executor.rs- Consistent block timestamp handling
CLI Transaction Encoding
Fixed CLI's tx commands (withdraw, transfer, etc.) not working due to incorrect transaction encoding.
Root Cause: CLI was building transactions with custom format instead of using kaizen_tx::Transaction type.
Solution: Refactored CLI to use kaizen-tx crate for proper transaction building and signing.
apps/cli/Cargo.toml- Addedkaizen-txdependencyapps/cli/src/commands/tx.rs- Rewrote usingkaizen_tx::Transactionapps/cli/src/rpc.rs- Fixed RPC method name and response type
CLI Signature Mismatch
Fixed "Invalid user signature" error when submitting thesis via CLI.
Root Cause: CLI's sign_quote function used different domain separator than SDK/mock-solver.
Solution: Aligned signing logic to use "Kaizen:SolverQuote" domain separator with keccak256 hashing.
apps/cli/src/commands/thesis.rs- Fixed signature generation
Bridge Withdrawal Status Format
Fixed withdrawal status format incompatibility with bridge service.
Root Cause: Core returned human-readable status ("Pending") but bridge expected numeric string ("0").
Solution: Changed status serialization to output enum discriminant as string.
Files Changed:crates/app/src/rpc/types.rs- Changedformat!("{:?}", status)to(status as u8).to_string()
| Numeric | Status |
|---|---|
"0" | Pending |
"1" | Processing |
"2" | Completed |
"3" | Failed |
API Changes
RPC Methods
kaizen_sendTransaction
- Now properly returns execution result object instead of just transaction hash
{
"hash": "0x...",
"status": "executed",
"success": true,
"error": null,
"blockHeight": 1234,
"txIndex": 0,
"events": [...]
}kaizen_getUnprocessedWithdrawals
- Status field now returns numeric string for bridge compatibility
Testing
Full lifecycle test verified:
- β Bridge Deposit (faucet mint)
- β Thesis Submit (RFQ)
- β Settlement (UserWin/SolverWin)
- β Bridge Withdraw
- β Read-node Sync (0 state root mismatches)
Stress test results:
- 87 thesis submissions
- 50 rapid-fire parallel submissions
- 0 state root mismatches between write-node and read-node
