fix: inconsistent state detection and rollback #2983

alpe · 2026-01-15T21:03:59Z

Resolves #2967

Overview

Also check timestamp for Idempotency
Implement rollback for evm EngineClient

claude · 2026-01-15T21:04:15Z

Claude finished @alpe's task in 2m 51s —— View job

Comprehensive Code Review - Complete ✅

I've analyzed the PR changes and performed a thorough review focusing on code quality, architecture, concurrency, security, error handling, and testing. Overall, this is a solid implementation that addresses important state consistency issues. Below are my detailed findings:

✅ Strengths

1. Well-Designed Architecture

The Rollbackable interface is properly defined with clear documentation in core/execution/execution.go:106-116
Interface segregation is well-applied (optional interfaces: HeightProvider, Rollbackable)
Clean separation between consensus and execution layers

2. Improved State Reconciliation

The rename from checkIdempotency to reconcileExecutionAtHeight better reflects the function's expanded responsibilities
Timestamp verification for promoted blocks prevents dual-store conflicts (lines 714-728)
PayloadID validation before resume prevents stale payload usage after restarts (lines 732-752)

3. Automatic Recovery

Automatic rollback on timestamp mismatch is a clever solution for handling state divergence (lines 776-796)
The rollback command now properly syncs the execution layer alongside consensus state

4. Good Error Handling

Errors are properly wrapped with context using fmt.Errorf
Graceful degradation in rollback command (warning on EL rollback failure instead of hard fail, line 74)
Comprehensive logging at appropriate levels

🔍 Issues & Recommendations

1. Race Condition Analysis ⚠️

Finding: The Gemini AI reviewer raised concerns about a race condition in forkchoiceUpdated calls. However, tac0turtle's assessment is correct - this is a deterministic consensus system.

Why the race condition doesn't apply here:

The consensus layer ensures sequential, deterministic block execution
ExecuteTxs, SetFinal, and Rollback are called sequentially by a single consensus thread
The mutex protects concurrent access to internal state fields, which is sufficient for this architecture
Holding a lock over network RPC calls would cause performance degradation without providing meaningful safety benefits in this deterministic context

Recommendation: ✅ The current locking strategy is appropriate. No changes needed.

2. Missing Hash Validation 🔴

Location: execution/evm/execution.go:715

The author's own comment acknowledges this:

"This is still a weak check without hash but it fixed the problems on restart"

Issue: The timestamp-only check for idempotency is insufficient. If two different blocks have the same height and timestamp but different transactions, the system would incorrectly reuse the wrong block.

Impact: Medium severity - could lead to incorrect state under edge cases

Recommendation: Add block hash validation to strengthen the idempotency check:

if execMeta.Stage == ExecStagePromoted && len(execMeta.StateRoot) > 0 {
    if execMeta.Timestamp == timestamp.Unix() {
        // Also validate block hash matches (if available from consensus)
        // and/or transaction hash matches
        if len(execMeta.TxHash) > 0 && len(txs) > 0 {
            h := sha256.New()
            for _, tx := range txs {
                h.Write(tx)
            }
            currentTxHash := h.Sum(nil)
            if !bytes.Equal(execMeta.TxHash, currentTxHash) {
                c.logger.Warn().
                    Uint64("height", height).
                    Msg("ExecuteTxs: timestamp matches but txHash differs, ignoring stale record")
                // Fall through to EL check
            } else {
                // Both timestamp and txHash match - safe to reuse
                c.logger.Info().
                    Uint64("height", height).
                    Str("stage", execMeta.Stage).
                    Msg("ExecuteTxs: reusing already-promoted execution (idempotent)")
                return execMeta.StateRoot, nil, true, nil
            }
        } else {
            // No txHash available for comparison, rely on timestamp only
            c.logger.Info().
                Uint64("height", height).
                Str("stage", execMeta.Stage).
                Msg("ExecuteTxs: reusing already-promoted execution (idempotent, timestamp-only check)")
            return execMeta.StateRoot, nil, true, nil
        }
    }
    // Timestamp mismatch handling...
}

Fix weak timestamp check →

3. Rollback ExecMeta Cleanup ⚠️

Location: execution/evm/execution.go:958-993

Issue: The Rollback method updates the EL state but doesn't clean up stale ExecMeta records for heights > targetHeight.

Impact: Low severity - stale metadata could cause confusion or incorrect idempotency behavior on subsequent operations

Recommendation: Add ExecMeta cleanup after successful rollback:

func (c *EngineClient) Rollback(ctx context.Context, targetHeight uint64) error {
    // ... existing rollback logic ...
    
    if err := c.doForkchoiceUpdate(ctx, args, "Rollback"); err != nil {
        return fmt.Errorf("forkchoice update for rollback failed: %w", err)
    }

    // Clean up ExecMeta for heights > targetHeight
    // This prevents stale metadata from interfering with future execution
    for h := targetHeight + 1; h <= targetHeight + 100; h++ {
        if err := c.store.DeleteExecMeta(ctx, h); err != nil {
            c.logger.Warn().Err(err).Uint64("height", h).Msg("Rollback: failed to delete stale ExecMeta")
            // Continue cleanup despite errors
        }
    }

    c.logger.Info().
        Uint64("target_height", targetHeight).
        Msg("execution layer rollback completed")

    return nil
}

Note: This requires adding a DeleteExecMeta method to EVMStore.

Add ExecMeta cleanup to Rollback →

4. Error Handling in Rollback Command ⚠️

Location: apps/evm/cmd/rollback.go:72-80

Issue: The rollback command gracefully skips EL rollback on error (line 74), but this could leave the system in an inconsistent state where the consensus layer is rolled back but the EL is not.

Current behavior:

engineClient, err := createRollbackEngineClient(cmd, rawEvolveDB)
if err != nil {
    cmd.Printf("Warning: failed to create engine client, skipping EL rollback: %v\n", err)
} else {
    if err := engineClient.Rollback(goCtx, height); err != nil {
        return fmt.Errorf("failed to rollback execution layer: %w", err)
    }
    cmd.Printf("Rolled back execution layer to height %d\n", height)
}

Recommendation: Consider the failure modes:

If JWT secret is not provided → graceful skip is appropriate (user may not be using EVM)
If EL is unreachable → should probably fail hard to prevent inconsistency
If rollback fails → already fails hard ✅

Suggested improvement:

engineClient, err := createRollbackEngineClient(cmd, rawEvolveDB)
if err != nil {
    if strings.Contains(err.Error(), "JWT secret file must be provided") {
        cmd.Printf("Info: Skipping EL rollback (no JWT secret provided, assuming non-EVM execution layer)\n")
    } else {
        cmd.Printf("Warning: failed to create engine client, skipping EL rollback: %v\n", err)
        cmd.Printf("Note: This may leave execution layer out of sync with consensus layer\n")
    }
} else {
    if err := engineClient.Rollback(goCtx, height); err != nil {
        return fmt.Errorf("failed to rollback execution layer: %w", err)
    }
    cmd.Printf("Rolled back execution layer to height %d\n", height)
}

5. Testing Coverage 🔴

Issue: No unit or integration tests for the new rollback functionality

Missing test scenarios:

✅ Rollback to previous height - verify EL head/safe/finalized are reset correctly
✅ Timestamp mismatch detection - verify automatic rollback is triggered
✅ Stale PayloadID handling - verify expired payloadID triggers re-execution
✅ ExecMeta timestamp validation - verify stale promoted records are ignored
✅ Rollback command integration - verify both consensus and EL are rolled back together
✅ Edge case: rollback at height 0 - verify boundary condition handling
✅ Concurrent rollback safety (if applicable)

Recommendation: Add tests to execution/evm/test/execution_test.go:

func TestEngineClient_Rollback(t *testing.T) {
    // Test rollback to previous height
}

func TestEngineClient_TimestampMismatchTriggersRollback(t *testing.T) {
    // Test automatic rollback on timestamp mismatch
}

func TestEngineClient_StalePayloadIDHandling(t *testing.T) {
    // Test stale payloadID detection and re-execution
}

Add rollback tests →

6. Documentation & Comments ℹ️

Minor improvements:

Line 956: Missing period in doc comment

// This is used for recovery when the EL is ahead of the consensus layer (e.g., during rolling restarts

Should be:

// This is used for recovery when the EL is ahead of the consensus layer (e.g., during rolling restarts).

Line 112 (core/execution/execution.go): Consider expanding the documentation

// Requirements:
// - Only execution layers supporting in-flight rollback should implement this.
// - Rollback must be atomic and leave the EL in a consistent state at the target height
// - Rollback should fail if target height doesn't exist in the EL
// - Safe and finalized pointers should be reset to the target height

7. Security Considerations ✅

Good practices observed:

JWT secret is read from file and trimmed properly (rollback.go:159-163)
JWT secret is validated for emptiness (rollback.go:165-167)
No secrets logged or exposed in error messages ✅

Minor concern: The JWT secret file path could be logged on error, which might expose filesystem structure. However, this is generally acceptable for CLI tools.

📊 Summary

Category	Status	Notes
Architecture	✅ Excellent	Clean interfaces, good separation of concerns
Concurrency	✅ Good	Appropriate locking for deterministic system
Error Handling	✅ Good	Proper wrapping, graceful degradation
Security	✅ Good	No credential leaks, proper validation
State Consistency	⚠️ Needs improvement	Weak timestamp-only check (line 715)
Resource Cleanup	⚠️ Needs improvement	Missing ExecMeta cleanup in Rollback
Testing	🔴 Missing	No tests for new rollback functionality
Documentation	ℹ️ Minor issues	Small doc improvements needed

🎯 Verdict

This PR makes important improvements to state consistency and recovery. The core logic is sound and the architecture is well-designed. However, I recommend addressing:

Must fix before merge:

🔴 Add transaction hash validation to strengthen idempotency check (line 715)
🔴 Add test coverage for rollback functionality

Should fix:

⚠️ Add ExecMeta cleanup in Rollback method
⚠️ Improve error messaging in rollback command

Nice to have:

ℹ️ Fix documentation typos and expand comments

The changes are production-ready once the idempotency check is strengthened and basic test coverage is added. The automatic rollback mechanism is a clever solution that will significantly improve operator experience during rolling restarts and state divergence scenarios.

Note: The reviewers' approvals are appropriate given that the timestamp-only check "fixed the problems on restart" (as noted by alpe). For production hardening, I recommend the improvements above.

github-actions · 2026-01-15T21:04:16Z

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Jan 16, 2026, 2:04 PM

gemini-code-assist · 2026-01-15T21:04:17Z

Summary of Changes

Hello @alpe, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the resilience and consistency of the execution layer by introducing a robust rollback mechanism and refining idempotency checks. These changes enable the system to automatically recover from state inconsistencies, such as those arising from rolling restarts or consensus layer divergences, by ensuring the execution layer can revert to a correct historical state and re-synchronize with the authoritative chain.

Highlights

New Rollbackable Interface: Introduced a new Rollbackable interface in core/execution/execution.go that execution clients can implement to support automatic state rollback when the execution layer is ahead of the target height, facilitating recovery during rolling restarts.
EngineClient Rollback Implementation: The EngineClient in execution/evm/execution.go now implements the Rollbackable interface. Its Rollback method resets the execution layer head to a specified height by performing a forkchoiceUpdate, forcing the EL to reorg its canonical chain.
Enhanced Idempotency and State Reconciliation: The checkIdempotency method in EngineClient has been renamed to reconcileExecutionAtHeight and significantly improved. It now includes timestamp verification for already-promoted blocks to detect stale records and validates the existence of payloadID for in-progress executions, preventing the use of expired or invalid payload IDs after node restarts.
Automatic Rollback for Timestamp Mismatches: A new mechanism has been added to reconcileExecutionAtHeight to automatically trigger a rollback of the execution layer when a timestamp mismatch is detected for an existing block. This addresses 'Dual-Store Conflicts' by reverting the EL to height-1 to allow re-execution with the correct, Raft-replicated block.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

alpe · 2026-01-15T21:05:57Z

execution/evm/execution.go

 		if execMeta.Stage == ExecStagePromoted && len(execMeta.StateRoot) > 0 {
-			c.logger.Info().
+			if execMeta.Timestamp == timestamp.Unix() {
+				c.logger.Info().


This is still a weak check without hash but it fixed the problems on restart

gemini-code-assist

Code Review

This pull request introduces important fixes for handling inconsistent states between the consensus and execution layers, particularly around idempotency checks and automatic rollback. The addition of the Rollbackable interface and its implementation in the EVM EngineClient are well-designed. The logic to handle stale metadata and trigger rollbacks on timestamp mismatches significantly improves the system's robustness. My main concern is a critical race condition in the locking strategy for forkchoiceUpdated calls, which could lead to state inconsistencies under concurrent operations.

execution/evm/execution.go

codecov · 2026-01-15T21:07:20Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.13%. Comparing base (52825bf) to head (5aaa7ae).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2983      +/-   ##
==========================================
+ Coverage   58.77%   59.13%   +0.36%     
==========================================
  Files         101      102       +1     
  Lines        9685     9822     +137     
==========================================
+ Hits         5692     5808     +116     
- Misses       3381     3397      +16     
- Partials      612      617       +5

Flag	Coverage Δ
combined	`59.13% <ø> (+0.36%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

julienrbrt · 2026-01-15T21:54:50Z

execution/evm/execution.go

+// where the EL committed blocks that were not replicated to Raft).
+//
+// Implements the execution.Rollbackable interface.
+func (c *EngineClient) Rollback(ctx context.Context, targetHeight uint64) error {


Could we wire the command directly then?
This would fix #2967.

For consitency, a good follow-up would be to implement this in ev-abci and use it in the rollback command as well instead of directly calling: https://github.com/evstack/ev-abci/blob/e905b0b/server/rollback_cmd.go#L91-L95.

👍 I have added it to the rollback cmd.
Not started abci. I will DM you.

execution/evm/execution.go

tac0turtle · 2026-01-16T10:48:59Z

execution/evm/execution.go

-		// Timestamp mismatch - log warning but proceed
+		// Timestamp mismatch - this indicates a Dual-Store Conflict where the EL produced a block
+		// that was not replicated via Raft before leadership changed. The new leader created a
+		// different block at the same height with a different timestamp, and that block is now


could we reduce some of these comments

julienrbrt

lgtm!

tac0turtle · 2026-01-16T13:10:18Z

execution/evm/execution.go

+// Rollback resets the execution layer head to the specified height using forkchoice update.
+// This is used for recovery when the EL is ahead of the consensus layer (e.g., during rolling restarts
+// where the EL committed blocks that were not replicated to Raft).
+//


can this function be called while the node is running?

For EVM yes (as it resends a payload), for the SDK no. (and our DM discussion #2983 (comment), showed that ev-abci won't implement it)
So maybe we should add that only execution layer supporting in-flight rollback should implement this then?

do we want inflight? is that a goal for later on?

Well this is done in-flight right in this PR, so yeah: https://github.com/evstack/ev-node/pull/2983/changes#diff-73d38fc0f126d3d764a9855162c561f8a2cfe9595029aa13f510ab5937885d2dR784-R796

I have updated the doc to mention in-flight updates on the interface.
This method is used during startup reconcile only.
Another use-case could be within HA when the leader lock is lost during block production. But this is an edge case.

tac0turtle

left one question.

nice job!!

* main: fix: inconsistent state detection and rollback (#2983) chore: improve graceful shutdown restarts (#2985) feat(submitting): add posting strategies (#2973)

Fix inconsistent state detection and rollback

fbf49f7

github-actions bot assigned alpe Jan 15, 2026

alpe commented Jan 15, 2026

View reviewed changes

gemini-code-assist bot reviewed Jan 15, 2026

View reviewed changes

execution/evm/execution.go Show resolved Hide resolved

julienrbrt reviewed Jan 15, 2026

View reviewed changes

Review feedback

dc52218

alpe changed the title ~~Fix inconsistent state detection and rollback~~ fix: inconsistent state detection and rollback Jan 16, 2026

tac0turtle reviewed Jan 16, 2026

View reviewed changes

execution/evm/execution.go Outdated Show resolved Hide resolved

tac0turtle reviewed Jan 16, 2026

View reviewed changes

Review feedback

e8ad5f3

julienrbrt approved these changes Jan 16, 2026

View reviewed changes

tac0turtle reviewed Jan 16, 2026

View reviewed changes

tac0turtle approved these changes Jan 16, 2026

View reviewed changes

Review feedback

5aaa7ae

alpe enabled auto-merge January 16, 2026 14:15

alpe added this pull request to the merge queue Jan 16, 2026

Merged via the queue into main with commit 73297c1 Jan 16, 2026
31 checks passed

alpe deleted the alex/fix_check branch January 16, 2026 14:29

alpe added a commit that referenced this pull request Jan 16, 2026

Merge branch 'main' into alex/adr023_ha-failover_raft11

1cf1167

* main: fix: inconsistent state detection and rollback (#2983) chore: improve graceful shutdown restarts (#2985) feat(submitting): add posting strategies (#2973)

fix: inconsistent state detection and rollback #2983

fix: inconsistent state detection and rollback #2983

Conversation

alpe commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Uh oh!

claude bot commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comprehensive Code Review - Complete ✅

✅ Strengths

1. Well-Designed Architecture

2. Improved State Reconciliation

3. Automatic Recovery

4. Good Error Handling

🔍 Issues & Recommendations

1. Race Condition Analysis ⚠️

2. Missing Hash Validation 🔴

3. Rollback ExecMeta Cleanup ⚠️

4. Error Handling in Rollback Command ⚠️

5. Testing Coverage 🔴

6. Documentation & Comments ℹ️

7. Security Considerations ✅

📊 Summary

🎯 Verdict

Uh oh!

github-actions bot commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jan 15, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

codecov bot commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

julienrbrt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

julienrbrt Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tac0turtle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alpe commented Jan 15, 2026 •

edited

Loading

claude bot commented Jan 15, 2026 •

edited

Loading

github-actions bot commented Jan 15, 2026 •

edited

Loading

codecov bot commented Jan 15, 2026 •

edited

Loading

julienrbrt Jan 16, 2026 •

edited

Loading