Skip to content

Conversation

@alpe
Copy link
Contributor

@alpe alpe commented Jan 15, 2026

Resolves #2967

Overview

  • Also check timestamp for Idempotency
  • Implement rollback for evm EngineClient

@claude
Copy link
Contributor

claude bot commented Jan 15, 2026

Claude finished @alpe's task in 2m 51s —— View job


Comprehensive Code Review - Complete ✅

I've analyzed the PR changes and performed a thorough review focusing on code quality, architecture, concurrency, security, error handling, and testing. Overall, this is a solid implementation that addresses important state consistency issues. Below are my detailed findings:


✅ Strengths

1. Well-Designed Architecture

  • The Rollbackable interface is properly defined with clear documentation in core/execution/execution.go:106-116
  • Interface segregation is well-applied (optional interfaces: HeightProvider, Rollbackable)
  • Clean separation between consensus and execution layers

2. Improved State Reconciliation

  • The rename from checkIdempotency to reconcileExecutionAtHeight better reflects the function's expanded responsibilities
  • Timestamp verification for promoted blocks prevents dual-store conflicts (lines 714-728)
  • PayloadID validation before resume prevents stale payload usage after restarts (lines 732-752)

3. Automatic Recovery

  • Automatic rollback on timestamp mismatch is a clever solution for handling state divergence (lines 776-796)
  • The rollback command now properly syncs the execution layer alongside consensus state

4. Good Error Handling

  • Errors are properly wrapped with context using fmt.Errorf
  • Graceful degradation in rollback command (warning on EL rollback failure instead of hard fail, line 74)
  • Comprehensive logging at appropriate levels

🔍 Issues & Recommendations

1. Race Condition Analysis ⚠️

Finding: The Gemini AI reviewer raised concerns about a race condition in forkchoiceUpdated calls. However, tac0turtle's assessment is correct - this is a deterministic consensus system.

Why the race condition doesn't apply here:

  • The consensus layer ensures sequential, deterministic block execution
  • ExecuteTxs, SetFinal, and Rollback are called sequentially by a single consensus thread
  • The mutex protects concurrent access to internal state fields, which is sufficient for this architecture
  • Holding a lock over network RPC calls would cause performance degradation without providing meaningful safety benefits in this deterministic context

Recommendation: ✅ The current locking strategy is appropriate. No changes needed.


2. Missing Hash Validation 🔴

Location: execution/evm/execution.go:715

The author's own comment acknowledges this:

"This is still a weak check without hash but it fixed the problems on restart"

Issue: The timestamp-only check for idempotency is insufficient. If two different blocks have the same height and timestamp but different transactions, the system would incorrectly reuse the wrong block.

Impact: Medium severity - could lead to incorrect state under edge cases

Recommendation: Add block hash validation to strengthen the idempotency check:

if execMeta.Stage == ExecStagePromoted && len(execMeta.StateRoot) > 0 {
    if execMeta.Timestamp == timestamp.Unix() {
        // Also validate block hash matches (if available from consensus)
        // and/or transaction hash matches
        if len(execMeta.TxHash) > 0 && len(txs) > 0 {
            h := sha256.New()
            for _, tx := range txs {
                h.Write(tx)
            }
            currentTxHash := h.Sum(nil)
            if !bytes.Equal(execMeta.TxHash, currentTxHash) {
                c.logger.Warn().
                    Uint64("height", height).
                    Msg("ExecuteTxs: timestamp matches but txHash differs, ignoring stale record")
                // Fall through to EL check
            } else {
                // Both timestamp and txHash match - safe to reuse
                c.logger.Info().
                    Uint64("height", height).
                    Str("stage", execMeta.Stage).
                    Msg("ExecuteTxs: reusing already-promoted execution (idempotent)")
                return execMeta.StateRoot, nil, true, nil
            }
        } else {
            // No txHash available for comparison, rely on timestamp only
            c.logger.Info().
                Uint64("height", height).
                Str("stage", execMeta.Stage).
                Msg("ExecuteTxs: reusing already-promoted execution (idempotent, timestamp-only check)")
            return execMeta.StateRoot, nil, true, nil
        }
    }
    // Timestamp mismatch handling...
}

Fix weak timestamp check →


3. Rollback ExecMeta Cleanup ⚠️

Location: execution/evm/execution.go:958-993

Issue: The Rollback method updates the EL state but doesn't clean up stale ExecMeta records for heights > targetHeight.

Impact: Low severity - stale metadata could cause confusion or incorrect idempotency behavior on subsequent operations

Recommendation: Add ExecMeta cleanup after successful rollback:

func (c *EngineClient) Rollback(ctx context.Context, targetHeight uint64) error {
    // ... existing rollback logic ...
    
    if err := c.doForkchoiceUpdate(ctx, args, "Rollback"); err != nil {
        return fmt.Errorf("forkchoice update for rollback failed: %w", err)
    }

    // Clean up ExecMeta for heights > targetHeight
    // This prevents stale metadata from interfering with future execution
    for h := targetHeight + 1; h <= targetHeight + 100; h++ {
        if err := c.store.DeleteExecMeta(ctx, h); err != nil {
            c.logger.Warn().Err(err).Uint64("height", h).Msg("Rollback: failed to delete stale ExecMeta")
            // Continue cleanup despite errors
        }
    }

    c.logger.Info().
        Uint64("target_height", targetHeight).
        Msg("execution layer rollback completed")

    return nil
}

Note: This requires adding a DeleteExecMeta method to EVMStore.

Add ExecMeta cleanup to Rollback →


4. Error Handling in Rollback Command ⚠️

Location: apps/evm/cmd/rollback.go:72-80

Issue: The rollback command gracefully skips EL rollback on error (line 74), but this could leave the system in an inconsistent state where the consensus layer is rolled back but the EL is not.

Current behavior:

engineClient, err := createRollbackEngineClient(cmd, rawEvolveDB)
if err != nil {
    cmd.Printf("Warning: failed to create engine client, skipping EL rollback: %v\n", err)
} else {
    if err := engineClient.Rollback(goCtx, height); err != nil {
        return fmt.Errorf("failed to rollback execution layer: %w", err)
    }
    cmd.Printf("Rolled back execution layer to height %d\n", height)
}

Recommendation: Consider the failure modes:

  1. If JWT secret is not provided → graceful skip is appropriate (user may not be using EVM)
  2. If EL is unreachable → should probably fail hard to prevent inconsistency
  3. If rollback fails → already fails hard ✅

Suggested improvement:

engineClient, err := createRollbackEngineClient(cmd, rawEvolveDB)
if err != nil {
    if strings.Contains(err.Error(), "JWT secret file must be provided") {
        cmd.Printf("Info: Skipping EL rollback (no JWT secret provided, assuming non-EVM execution layer)\n")
    } else {
        cmd.Printf("Warning: failed to create engine client, skipping EL rollback: %v\n", err)
        cmd.Printf("Note: This may leave execution layer out of sync with consensus layer\n")
    }
} else {
    if err := engineClient.Rollback(goCtx, height); err != nil {
        return fmt.Errorf("failed to rollback execution layer: %w", err)
    }
    cmd.Printf("Rolled back execution layer to height %d\n", height)
}

5. Testing Coverage 🔴

Issue: No unit or integration tests for the new rollback functionality

Missing test scenarios:

  1. Rollback to previous height - verify EL head/safe/finalized are reset correctly
  2. Timestamp mismatch detection - verify automatic rollback is triggered
  3. Stale PayloadID handling - verify expired payloadID triggers re-execution
  4. ExecMeta timestamp validation - verify stale promoted records are ignored
  5. Rollback command integration - verify both consensus and EL are rolled back together
  6. Edge case: rollback at height 0 - verify boundary condition handling
  7. Concurrent rollback safety (if applicable)

Recommendation: Add tests to execution/evm/test/execution_test.go:

func TestEngineClient_Rollback(t *testing.T) {
    // Test rollback to previous height
}

func TestEngineClient_TimestampMismatchTriggersRollback(t *testing.T) {
    // Test automatic rollback on timestamp mismatch
}

func TestEngineClient_StalePayloadIDHandling(t *testing.T) {
    // Test stale payloadID detection and re-execution
}

Add rollback tests →


6. Documentation & Comments ℹ️

Minor improvements:

  1. Line 956: Missing period in doc comment

    // This is used for recovery when the EL is ahead of the consensus layer (e.g., during rolling restarts

    Should be:

    // This is used for recovery when the EL is ahead of the consensus layer (e.g., during rolling restarts).
  2. Line 112 (core/execution/execution.go): Consider expanding the documentation

    // Requirements:
    // - Only execution layers supporting in-flight rollback should implement this.
    // - Rollback must be atomic and leave the EL in a consistent state at the target height
    // - Rollback should fail if target height doesn't exist in the EL
    // - Safe and finalized pointers should be reset to the target height

7. Security Considerations

Good practices observed:

  • JWT secret is read from file and trimmed properly (rollback.go:159-163)
  • JWT secret is validated for emptiness (rollback.go:165-167)
  • No secrets logged or exposed in error messages ✅

Minor concern: The JWT secret file path could be logged on error, which might expose filesystem structure. However, this is generally acceptable for CLI tools.


📊 Summary

Category Status Notes
Architecture ✅ Excellent Clean interfaces, good separation of concerns
Concurrency ✅ Good Appropriate locking for deterministic system
Error Handling ✅ Good Proper wrapping, graceful degradation
Security ✅ Good No credential leaks, proper validation
State Consistency ⚠️ Needs improvement Weak timestamp-only check (line 715)
Resource Cleanup ⚠️ Needs improvement Missing ExecMeta cleanup in Rollback
Testing 🔴 Missing No tests for new rollback functionality
Documentation ℹ️ Minor issues Small doc improvements needed

🎯 Verdict

This PR makes important improvements to state consistency and recovery. The core logic is sound and the architecture is well-designed. However, I recommend addressing:

Must fix before merge:

  • 🔴 Add transaction hash validation to strengthen idempotency check (line 715)
  • 🔴 Add test coverage for rollback functionality

Should fix:

  • ⚠️ Add ExecMeta cleanup in Rollback method
  • ⚠️ Improve error messaging in rollback command

Nice to have:

  • ℹ️ Fix documentation typos and expand comments

The changes are production-ready once the idempotency check is strengthened and basic test coverage is added. The automatic rollback mechanism is a clever solution that will significantly improve operator experience during rolling restarts and state divergence scenarios.


Note: The reviewers' approvals are appropriate given that the timestamp-only check "fixed the problems on restart" (as noted by alpe). For production hardening, I recommend the improvements above.


@github-actions
Copy link
Contributor

github-actions bot commented Jan 15, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedJan 16, 2026, 2:04 PM

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @alpe, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the resilience and consistency of the execution layer by introducing a robust rollback mechanism and refining idempotency checks. These changes enable the system to automatically recover from state inconsistencies, such as those arising from rolling restarts or consensus layer divergences, by ensuring the execution layer can revert to a correct historical state and re-synchronize with the authoritative chain.

Highlights

  • New Rollbackable Interface: Introduced a new Rollbackable interface in core/execution/execution.go that execution clients can implement to support automatic state rollback when the execution layer is ahead of the target height, facilitating recovery during rolling restarts.
  • EngineClient Rollback Implementation: The EngineClient in execution/evm/execution.go now implements the Rollbackable interface. Its Rollback method resets the execution layer head to a specified height by performing a forkchoiceUpdate, forcing the EL to reorg its canonical chain.
  • Enhanced Idempotency and State Reconciliation: The checkIdempotency method in EngineClient has been renamed to reconcileExecutionAtHeight and significantly improved. It now includes timestamp verification for already-promoted blocks to detect stale records and validates the existence of payloadID for in-progress executions, preventing the use of expired or invalid payload IDs after node restarts.
  • Automatic Rollback for Timestamp Mismatches: A new mechanism has been added to reconcileExecutionAtHeight to automatically trigger a rollback of the execution layer when a timestamp mismatch is detected for an existing block. This addresses 'Dual-Store Conflicts' by reverting the EL to height-1 to allow re-execution with the correct, Raft-replicated block.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

if execMeta.Stage == ExecStagePromoted && len(execMeta.StateRoot) > 0 {
c.logger.Info().
if execMeta.Timestamp == timestamp.Unix() {
c.logger.Info().
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a weak check without hash but it fixed the problems on restart

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces important fixes for handling inconsistent states between the consensus and execution layers, particularly around idempotency checks and automatic rollback. The addition of the Rollbackable interface and its implementation in the EVM EngineClient are well-designed. The logic to handle stale metadata and trigger rollbacks on timestamp mismatches significantly improves the system's robustness. My main concern is a critical race condition in the locking strategy for forkchoiceUpdated calls, which could lead to state inconsistencies under concurrent operations.

@codecov
Copy link

codecov bot commented Jan 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.13%. Comparing base (52825bf) to head (5aaa7ae).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2983      +/-   ##
==========================================
+ Coverage   58.77%   59.13%   +0.36%     
==========================================
  Files         101      102       +1     
  Lines        9685     9822     +137     
==========================================
+ Hits         5692     5808     +116     
- Misses       3381     3397      +16     
- Partials      612      617       +5     
Flag Coverage Δ
combined 59.13% <ø> (+0.36%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

// where the EL committed blocks that were not replicated to Raft).
//
// Implements the execution.Rollbackable interface.
func (c *EngineClient) Rollback(ctx context.Context, targetHeight uint64) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we wire the command directly then?
This would fix #2967.

For consitency, a good follow-up would be to implement this in ev-abci and use it in the rollback command as well instead of directly calling: https://github.com/evstack/ev-abci/blob/e905b0b/server/rollback_cmd.go#L91-L95.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I have added it to the rollback cmd.
Not started abci. I will DM you.

@alpe alpe changed the title Fix inconsistent state detection and rollback fix: inconsistent state detection and rollback Jan 16, 2026
// Timestamp mismatch - log warning but proceed
// Timestamp mismatch - this indicates a Dual-Store Conflict where the EL produced a block
// that was not replicated via Raft before leadership changed. The new leader created a
// different block at the same height with a different timestamp, and that block is now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we reduce some of these comments

Copy link
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

// Rollback resets the execution layer head to the specified height using forkchoice update.
// This is used for recovery when the EL is ahead of the consensus layer (e.g., during rolling restarts
// where the EL committed blocks that were not replicated to Raft).
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this function be called while the node is running?

Copy link
Member

@julienrbrt julienrbrt Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For EVM yes (as it resends a payload), for the SDK no. (and our DM discussion #2983 (comment), showed that ev-abci won't implement it)
So maybe we should add that only execution layer supporting in-flight rollback should implement this then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want inflight? is that a goal for later on?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the doc to mention in-flight updates on the interface.
This method is used during startup reconcile only.
Another use-case could be within HA when the leader lock is lost during block production. But this is an edge case.

Copy link
Contributor

@tac0turtle tac0turtle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left one question.

nice job!!

@alpe alpe enabled auto-merge January 16, 2026 14:15
@alpe alpe added this pull request to the merge queue Jan 16, 2026
Merged via the queue into main with commit 73297c1 Jan 16, 2026
31 checks passed
@alpe alpe deleted the alex/fix_check branch January 16, 2026 14:29
alpe added a commit that referenced this pull request Jan 16, 2026
* main:
  fix: inconsistent state detection and rollback (#2983)
  chore: improve graceful shutdown restarts (#2985)
  feat(submitting): add posting strategies (#2973)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve EVM rollback

4 participants