Skip to content

Conversation

@alpe
Copy link
Contributor

@alpe alpe commented Jan 6, 2026

Replaces #2836

alpe added 28 commits November 12, 2025 15:16
* main:
  fix: remove duplicate error logging in light node shutdown (#2841)
  chore: fix incorrect function name in comment (#2840)
  chore: remove sequencer go.mod (#2837)
* main:
  build(deps): Bump the go_modules group across 2 directories with 3 updates (#2846)
  build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.7.0 to 1.8.0 in /test/e2e (#2851)
  build(deps): Bump github.com/consensys/gnark-crypto from 0.18.0 to 0.18.1 in /test/e2e (#2844)
  build(deps): Bump github.com/cometbft/cometbft from 0.38.17 to 0.38.19 in /test/e2e (#2843)
  build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.6.0 to 1.7.0 in /test/e2e (#2845)
(cherry picked from commit c44cd77e665f6d5d463295c6ed61c59a56d88db3)
* main:
  chore: reduce log noise (#2864)
  fix: sync service for non zero height starts with empty store (#2834)
  build(deps): Bump golang.org/x/crypto from 0.43.0 to 0.45.0 in /execution/evm (#2861)
  chore: minor improvement for docs (#2862)
* main:
  chore: bump da (#2866)
  chore: bump  core (#2865)
* main:
  chore: fix some comments (#2874)
  chore: bump node in evm-single (#2875)
  refactor(syncer,cache): use compare and swap loop and add comments (#2873)
  refactor: use state da height as well (#2872)
  refactor: retrieve highest da height in cache (#2870)
  chore: change from event count to start and end height (#2871)
* main:
  chore: remove extra github action yml file (#2882)
  fix(execution/evm): verify payload status (#2863)
  feat: fetch included da height from store (#2880)
  chore: better output on errors (#2879)
  refactor!: create da client and split cache interface (#2878)
  chore!: rename `evm-single` and `grpc-single` (#2839)
  build(deps): Bump golang.org/x/crypto from 0.42.0 to 0.45.0 in /tools/da-debug in the go_modules group across 1 directory (#2876)
  chore: parallel cache de/serialization (#2868)
  chore: bump blob size (#2877)
* main:
  build(deps): Bump mdast-util-to-hast from 13.2.0 to 13.2.1 in /docs in the npm_and_yarn group across 1 directory (#2900)
  refactor(block): centralize timeout in client (#2903)
  build(deps): Bump the all-go group across 2 directories with 3 updates (#2898)
  chore: bump default timeout (#2902)
  fix: revert default db (#2897)
  refactor: remove obsolete // +build tag (#2899)
  fix:da visualiser namespace  (#2895)
  refactor: omit unnecessary reassignment (#2892)
  build(deps): Bump the all-go group across 5 directories with 6 updates (#2881)
  chore: fix inconsistent method name in retryWithBackoffOnPayloadStatus comment (#2889)
  fix: ensure consistent network ID usage in P2P subscriber (#2884)
  build(deps): Bump golangci/golangci-lint-action from 9.0.0 to 9.1.0 (#2885)
  build(deps): Bump actions/checkout from 5 to 6 (#2886)
* main: (34 commits)
  feat: make reaper poll duration configurable (#2951)
  chore!: move sequencers to pkg (#2931)
  feat: Ensure Header integrity on DA (#2948)
  feat(testda): add header support with GetHeaderByHeight method (#2946)
  chore: improve code comments clarity (#2947)
  chore(sequencers): optimize store check (#2945)
  fix: make evm_execution more robust (#2942)
  fix(sequencers/single): deterministic queue (#2938)
  fix(block): fix init logic sequencer for da epoch fetching (#2926)
  feat: use DA timestamp (#2939)
  chore: improve code comments clarity (#2943)
  build(deps): bump libp2p (#2937)
  build(deps): Bump actions/cache from 4 to 5 (#2934)
  build(deps): Bump actions/download-artifact from 6 to 7 (#2933)
  build(deps): Bump actions/upload-artifact from 5 to 6 (#2932)
  feat: DA Client remove interface part 3, replace types with new code (#2910)
  DA Client remove interface: Part 2.5, create e2e test to validate that a blob is posted in DA layer. (#2920)
  fix(syncing): skip forced txs checks for p2p blocks (#2922)
  build(deps): Bump the all-go group across 5 directories with 5 updates (#2919)
  chore: loosen syncer state check (#2927)
  ...
@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedJan 12, 2026, 2:58 PM

@claude
Copy link
Contributor

claude bot commented Jan 6, 2026

Claude finished @alpe's task in 3m 7s —— View job


Code Review: High Availability via RAFT

I've completed a comprehensive code review of this PR implementing high availability via RAFT consensus. This is an impressive architectural change introducing leader election, state replication, and dynamic role switching. Below is my detailed feedback organized by priority and category.


Critical Issues (Must Fix)

1. Bootstrap Mode Restriction Blocks Non-Bootstrap Nodes

Location: pkg/raft/node.go:108-115

The Start() method fails fast when Bootstrap is false, which prevents any node from joining an existing cluster. This is a blocking issue for cluster expansion.

if !n.config.Bootstrap {
    // it is intended to fail fast here. at this stage only bootstrap mode is supported.
    return fmt.Errorf("raft cluster requires bootstrap mode")
}

Problem: New nodes attempting to join an established cluster will fail immediately. This defeats the purpose of a distributed HA system.

Recommendation: Remove this restriction. Allow non-bootstrap nodes to start and join via peer discovery. The bootstrap logic should only execute when n.config.Bootstrap == true, otherwise return nil to allow normal startup.

Fix this →

2. Error Handling with Naked Returns

Location: node/full.go:113-119

Two critical error paths use naked return statements, which will return (nil, nil) instead of propagating errors:

if leaderElection, err = newSingleRoleElector(leaderFactory); err != nil {
    return  // Line 114 - naked return swallows error
}
// ...
if leaderElection, err = newSingleRoleElector(followerFactory); err != nil {
    return  // Line 118 - naked return swallows error
}

Problem: Failures during initialization will be silently ignored, leading to nil pointer dereferences later.

Recommendation: Change both to return nil, err.

Fix this →

3. Potential Data Race in FSM State Access

Location: pkg/raft/node.go:228

The GetState() method clones the state pointer but doesn't guarantee consistency during concurrent Apply operations:

func (n *Node) GetState() *RaftBlockState {
    return proto.Clone(n.fsm.state.Load()).(*RaftBlockState)
}

Problem: While atomic.Pointer provides atomic load, the state could be modified by another goroutine after the load but before cloning. For critical operations like recovery (line 107 in election.go), this could lead to inconsistent state.

Recommendation: Consider adding a read lock or document that callers should handle potential inconsistency. The current implementation may be acceptable if state updates are always complete and atomic, but this should be explicitly documented.


High-Priority Issues

4. Leadership Transfer Without Verification

Location: pkg/raft/election.go:119-123

When a node becomes leader but isn't synced, it transfers leadership without verifying a suitable successor exists:

if err := d.node.leadershipTransfer(); err != nil && !errors.Is(err, raft.ErrNotLeader) {
    // the leadership transfer can fail due to no suitable leader. Better stop than double sign on old state
    return err
}
continue

Problem: The comment acknowledges the risk but continues anyway. If no suitable leader exists and transfer fails silently, the cluster could end up without a leader.

Recommendation: Add logging for failed transfers and consider implementing a backoff/retry mechanism or explicit verification that another node is ready.

5. Magic Numbers Without Constants

Location: pkg/raft/election.go:88

ticker := time.NewTicker(300 * time.Millisecond)

Problem: Hardcoded timing values make tuning difficult and obscure intent.

Recommendation: Extract to a named constant like leaderElectionTickInterval with a comment explaining its purpose.

6. Typo in Log Message

Location: pkg/raft/node.go:122

n.logger.Info().Msg("Boostrap raft cluster")  // "Boostrap" should be "Bootstrap"

Fix this →

7. Incomplete Context Cancellation Handling

Location: pkg/raft/node.go:150-169

The waitForMsgsLanded function doesn't check ctx.Done():

for {
    select {
    case <-ticker.C:
        if n.raft.AppliedIndex() >= n.raft.LastIndex() {
            return nil
        }
    case <-timeoutTicker.C:
        return errors.New("max wait time reached")
    }
}

Problem: If the parent context is cancelled, this function will continue waiting until timeout.

Recommendation: Add a case <-ctx.Done(): return ctx.Err() to respect context cancellation.


Medium-Priority Issues

8. Inconsistent Naming Convention

Location: pkg/rpc/server/http.go:143

NodeId   string `json:"node_id"`  // Should be NodeID per Go conventions

Problem: Go naming conventions dictate that initialisms like ID should be fully capitalized.

Recommendation: Rename to NodeID for consistency.

9. Debug Log at Info Level

Location: block/internal/syncing/syncer.go:184

s.logger.Debug().Uint64("header_height", state.LastSubmittedDAHeaderHeight).Uint64("data_height", state.LastSubmittedDaDataHeight).Msg("received raft block state")

Problem: The message appears to be for debugging (evidenced by "received raft block state" phrasing), but this is actually logged at Info level in the actual code, which will clutter production logs.

Recommendation: Change to Debug level if it's truly debug info, or enhance the message if it should remain at Info level.

10. Redundant Type Conversion

Location: block/internal/executing/executor.go:623-625

for i, tx := range batchData.Transactions {
    data.Txs[i] = tx  // types.Tx is an alias for []byte, tx is already []byte
}

Problem: The loop is unnecessary since types.Tx is just a []byte alias.

Recommendation: Replace with data.Txs = append(data.Txs, batchData.Transactions...) for better performance and clarity.

11. Misleading Method Name

Location: block/internal/cache/pending_headers.go:69-71

func (ph *PendingHeaders) GetLastSubmittedDataHeight() uint64 {  // Method is on PendingHeaders but gets data height
    return ph.base.getLastSubmittedHeight()
}

Problem: The method is in PendingHeaders but retrieves data height, which is confusing.

Recommendation: Rename to GetLastSubmittedHeaderHeight() and update call sites.


Security Review

12. ✅ No Credential Leaks Detected

The code properly handles sensitive data:

  • Auth tokens referenced via flags (FlagDAAuthToken) with #nosec annotations
  • No hardcoded credentials
  • Proper use of JWT secrets in tests

13. ✅ Input Validation

  • Peer addresses validated in splitPeerAddr() (line 355-375, node.go)
  • Proposer validation in assertExpectedProposer() (assert.go)
  • Block state validation in assertValid() (types.go)

14. ⚠️ DOS Risk: Unbounded Channel Blocking

Location: pkg/raft/node.go:293-300

if f.applyCh != nil {
    select {
    case f.applyCh <- RaftApplyMsg{Index: log.Index, State: &state}:
    default:
        // on a slow consumer, the raft cluster should not be blocked.
        f.logger.Warn().Msg("apply channel full, dropping message")
    }
}

Analysis: The implementation correctly uses non-blocking send with a default case to prevent raft from stalling. The channel is buffered (size 1 in raft_retriever.go:62), which may be too small under load.

Recommendation: Document the buffer size choice or make it configurable. Consider metrics on dropped messages.


Performance & Resource Efficiency

15. ✅ Proper Goroutine Management

  • Clean shutdown patterns with WaitGroup and context cancellation
  • Deferred cleanup in failoverState.Run() (lines 176-231)
  • No obvious goroutine leaks

16. ✅ Connection Pooling

  • HTTP servers properly configured with timeouts (line 154-160, failover.go)
  • Raft TCP transport with connection limits (line 88, node.go)

17. ⚠️ Memory Allocation in Hot Paths

Location: pkg/raft/node.go:228

return proto.Clone(n.fsm.state.Load()).(*RaftBlockState)

Analysis: Every state read allocates a new protobuf clone. For high-frequency queries (like the readiness check ticker in election.go:88), this could cause GC pressure.

Recommendation: Consider caching or rate-limiting state queries for read-only operations.

18. Snapshot Store Configuration

Location: pkg/raft/node.go:78

snapshotStore, err := raft.NewFileSnapshotStore(cfg.RaftDir, int(cfg.SnapCount), os.Stderr)

Problem: Snapshot output goes to os.Stderr, which is not ideal for production.

Recommendation: Use a proper logger or log file for snapshot store output.


Concurrency & State Management

19. ✅ Proper Atomic Operations

  • atomic.Pointer[types.State] used correctly for state management
  • atomic.Bool for running flags
  • No obvious race conditions in the reviewed code

20. ⚠️ Potential Deadlock Risk

Location: pkg/raft/election.go:99-128

Analysis: The leader election logic has complex state transitions with multiple locks and waits:

  1. Wait for raft messages to land (line 101)
  2. Check leadership again (line 102)
  3. Check sync status (line 108)
  4. Potentially recover (line 111)
  5. Wait for workers (line 128)

Problem: If recovery blocks indefinitely or workers don't respond to cancellation, the election can deadlock.

Recommendation: Add overall timeout context for the entire transition process. Document expected transition duration.

21. ✅ Double-Sign Prevention

The code properly prevents double-signing:

  • Line 100-101 in election.go: Waits for in-flight raft messages before becoming leader
  • Line 102-105: Verifies leadership hasn't been lost during sync
  • Line 108: Checks sync status before allowing block production

This is critical for blockchain consensus safety.


Testing & Reliability

22. ✅ Comprehensive Test Coverage

  • Unit tests for helper functions (node_test.go, election_test.go)
  • E2E tests for failover scenarios (failover_e2e_test.go)
  • Mock generation configured properly (.mockery.yaml)

23. ⚠️ Low Patch Coverage (37.70%)

Analysis: Codecov reports 636 lines missing coverage, primarily in:

  • pkg/raft/node.go - 174 missing lines
  • pkg/raft/node_mock.go - 95 missing lines
  • block/internal/syncing/syncer.go - 93 missing lines
  • block/internal/syncing/raft_retriever.go - 63 missing lines (0% coverage!)

Recommendation: Add unit tests for raft_retriever.go and integration tests for syncer raft integration paths. The raft_retriever is critical for follower sync and shouldn't ship with 0% coverage.

Add tests →

24. Test Isolation Issues

Location: test/e2e/failover_e2e_test.go

Observation: E2E tests spawn multiple node processes and raft clusters. Ensure:

  • Port conflicts are handled
  • Cleanup is thorough (temp directories, processes)
  • Tests can run in parallel without interference

Architecture & Design

25. ✅ Clean Interface Design

The separation of concerns is excellent:

  • Runnable interface for pluggable components
  • Recoverable interface for state recovery
  • Factory pattern for leader/follower modes
  • Clear abstraction between raft consensus and block processing

26. ✅ Modular Failover Design

The failoverState struct cleanly encapsulates all components that need to be reset during role transitions:

  • P2P client
  • Sync services
  • RPC server
  • Block components

This makes testing and reasoning about state transitions easier.

27. ⚠️ Configuration Complexity

Location: pkg/config/config.go

Analysis: RAFT adds 9 new configuration flags:

  • node_id, raft_addr, raft_dir, bootstrap, peers
  • snap_count, send_timeout, heartbeat_timeout, leader_lease_timeout

Problem: The configuration surface is growing large. The relationship between these settings and their impact on cluster behavior may not be obvious to operators.

Recommendation:

  1. Add validation for common misconfigurations (e.g., all nodes having the same node_id)
  2. Provide configuration examples in documentation
  3. Consider configuration presets (e.g., "development", "production")

28. Nil Check Pattern Inconsistency

Location: block/internal/syncing/syncer.go:182

if raftNode != nil && !reflect.ValueOf(raftNode).IsNil() {

Problem: This double nil check (both nil and reflect check) suggests interface nil gotchas. This pattern should be consistent across the codebase.

Recommendation: Document why both checks are needed or refactor to avoid interface nil issues.


Documentation

29. ⚠️ Missing ADR Reference

The PR title mentions "ADR023" but there's no visible ADR document in the PR changes.

Recommendation: Ensure the ADR is documented, either in the PR description or in a separate document that explains:

  • Why RAFT was chosen over other consensus algorithms
  • Trade-offs considered
  • Operational runbooks for cluster management

30. Configuration Documentation Needed

Location: New RAFT configuration flags

Recommendation: Add inline documentation for each flag explaining:

  • What it does
  • Reasonable ranges/values
  • Performance/safety trade-offs
  • Examples

Summary

This is a substantial and well-architected feature adding critical high-availability capabilities. The code demonstrates strong understanding of distributed systems challenges like double-signing prevention and safe leader transitions.

Strengths:

  • Clean separation of concerns with interfaces
  • Proper double-sign prevention logic
  • Good test coverage for core raft logic
  • Safe goroutine management and cleanup

Must Address Before Merge:

  1. Critical: Fix bootstrap mode restriction (blocks cluster joins)
  2. Critical: Fix naked return errors (causes nil panics)
  3. High: Add tests for raft_retriever.go (currently 0% coverage)
  4. High: Address leadership transfer without verification

Recommended for Follow-up:

  • Add operational metrics (leadership changes, recovery events, dropped raft messages)
  • Document cluster operational procedures (adding nodes, removing nodes, disaster recovery)
  • Performance testing under load (recovery time, throughput impact)
  • Chaos testing (network partitions, node crashes during transition)

Test Status: CI is currently running. The Docs Deploy Preview failed, but that appears unrelated to this PR's core functionality.


@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @alpe, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Rollkit network's resilience and fault tolerance by integrating the Raft consensus algorithm. The changes enable a robust failover mechanism, allowing nodes to seamlessly transition between leader and follower roles. This ensures continuous operation and consistent state across the network, even if individual nodes experience failures. The update touches various core components, from dependency management and block processing to RPC services and node lifecycle management, all aimed at achieving high availability.

Highlights

  • Raft Consensus Integration: Introduced Raft consensus for high availability, enabling dynamic leader election and state replication. This involves new dependencies on github.com/hashicorp/raft and github.com/hashicorp/raft-boltdb.
  • Node Architecture Refactoring: The node's core logic has been refactored to support a failover mechanism, allowing nodes to dynamically switch between aggregator (leader) and sync (follower) modes based on Raft's leadership status. This includes changes to how P2P clients are initialized and passed to node startup commands.
  • Block Processing with Raft: The block executor now checks Raft leadership before producing new blocks and proposes the block state to the Raft cluster after execution, ensuring state consistency across the cluster.
  • Cache and State Management Enhancements: Added methods to retrieve and persist the last submitted DA header and data heights from the cache. Logic was introduced to skip in-flight (temporary) headers and data during cache retrieval to prevent processing incomplete blocks.
  • New RPC Endpoints and Configuration: Added new RPC endpoints for querying Raft node status (e.g., /raft/node) and introduced new configuration flags for Raft-specific settings like node_id, raft_addr, bootstrap, and peers.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces high availability via RAFT consensus, which is a significant architectural change. The implementation is extensive, touching many parts of the system from node startup to block processing and syncing. Overall, the changes are well-structured, introducing clear interfaces for RAFT integration and refactoring the node components to support dynamic role switching between leader and follower. I've identified a few critical issues related to error handling and startup logic that need to be addressed, along with some medium-severity issues for improving code clarity and maintainability. Great work on this complex feature.

I am having trouble creating individual review comments. Click here to see my feedback.

node/full.go (113-115)

high

The naked return here will not propagate the error from newSingleRoleElector. This will cause the function to return (nil, nil), masking the underlying error. The error should be returned to the caller.

		if leaderElection, err = newSingleRoleElector(leaderFactory); err != nil {
			return nil, err
		}

node/full.go (117-119)

high

Similar to the previous case, the naked return here will swallow the error from newSingleRoleElector. The error should be propagated up to the caller.

		if leaderElection, err = newSingleRoleElector(followerFactory); err != nil {
			return nil, err
		}

pkg/raft/node.go (111-113)

high

This check prevents a node from starting if Bootstrap is false, which is problematic for nodes joining an existing cluster. A new node attempting to join will fail to start. The bootstrap logic should only execute if n.config.Bootstrap is true, and the function should return nil otherwise, allowing non-bootstrap nodes to start and join a cluster.

block/internal/cache/pending_headers.go (69-71)

medium

The method name GetLastSubmittedDataHeight is misleading as it's part of the PendingHeaders struct. For clarity and consistency, it should be renamed to GetLastSubmittedHeaderHeight.

This change will also require updating the call site in block/internal/cache/manager.go.

func (ph *PendingHeaders) GetLastSubmittedHeaderHeight() uint64 {
	return ph.base.getLastSubmittedHeight()
}

block/internal/executing/executor.go (570-572)

medium

The explicit type conversion types.Tx(tx) is redundant since types.Tx is an alias for []byte, and tx is already of type []byte. The change to a direct assignment is good, but it seems this loop could be replaced with a single, more efficient append call.

	data.Txs = append(data.Txs, batchData.Transactions...)

block/internal/syncing/syncer.go (184)

medium

This log message seems to be for debugging purposes, indicated by the +++ prefix. It should be logged at the Debug level instead of Info to avoid cluttering the logs in a production environment.

			s.logger.Debug().Uint64("header_height", state.LastSubmittedDAHeaderHeight).Uint64("data_height", state.LastSubmittedDADataHeight).Msg("received raft block state")

pkg/rpc/server/http.go (143-146)

medium

To adhere to Go's naming conventions for initialisms, the struct field NodeId should be renamed to NodeID.

				NodeID   string `json:"node_id"`
			}{
				IsLeader: raftNode.IsLeader(),
				NodeID:   raftNode.NodeID(),

@codecov
Copy link

codecov bot commented Jan 7, 2026

Codecov Report

❌ Patch coverage is 37.78214% with 634 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.58%. Comparing base (6d132bf) to head (d7fbf79).

Files with missing lines Patch % Lines
pkg/raft/node.go 12.12% 174 Missing ⚠️
pkg/raft/node_mock.go 45.40% 74 Missing and 21 partials ⚠️
block/internal/syncing/syncer.go 13.08% 91 Missing and 2 partials ⚠️
block/internal/syncing/raft_retriever.go 0.00% 63 Missing ⚠️
node/full.go 32.30% 37 Missing and 7 partials ⚠️
block/internal/executing/executor.go 4.54% 38 Missing and 4 partials ⚠️
node/failover.go 72.34% 26 Missing and 13 partials ⚠️
pkg/raft/election.go 78.72% 14 Missing and 6 partials ⚠️
pkg/rpc/server/http.go 6.66% 13 Missing and 1 partial ⚠️
block/components.go 27.27% 7 Missing and 1 partial ⚠️
... and 13 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2954      +/-   ##
==========================================
- Coverage   57.87%   55.58%   -2.29%     
==========================================
  Files          97      104       +7     
  Lines        9303    10146     +843     
==========================================
+ Hits         5384     5640     +256     
- Misses       3315     3864     +549     
- Partials      604      642      +38     
Flag Coverage Δ
combined 55.58% <37.78%> (-2.29%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

alpe added 9 commits January 7, 2026 16:11
* main:
  feat(tracing): add tracing to EngineClient (#2959)
  chore: inject W3C headers into engine client and eth client (#2958)
  feat: adding tracing for Executor and added initial configuration (#2957)
* main:
  fix: Fail fast when executor ahead (#2966)
  feat(block): async epoch fetching (#2952)
  perf: tune badger defaults and add db bench (#2950)
@alpe
Copy link
Contributor Author

alpe commented Jan 13, 2026

@gemini-code-assist review the latest code

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces high availability through RAFT consensus, which is a significant and complex change. The overall architecture, including the use of a leader election mechanism and the separation of roles into leader (aggregator) and follower (sync) modes, is well-designed. The code is modular, leveraging interfaces and factories which is great for testability and maintenance. I've identified a critical issue related to resource management and a medium-severity issue in configuration validation. My feedback focuses on ensuring correctness and robustness, especially in error handling paths.

Comment on lines +337 to +339
if e.raftNode != nil && !e.raftNode.IsLeader() {
return errors.New("not raft leader")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This early return, and others in produceBlock, can cause a resource leak. The function uses a defer to discard a database batch on error, but this return statement doesn't set the err variable that the defer closure checks. To ensure the batch is discarded, you should assign the error to the function-scoped err variable before returning. This applies to all early returns in this function, including the new one for raft broadcast failure.

Suggested change
if e.raftNode != nil && !e.raftNode.IsLeader() {
return errors.New("not raft leader")
}
if e.raftNode != nil && !e.raftNode.IsLeader() {
err = errors.New("not raft leader")
return err
}

Comment on lines +298 to +300
if c.NodeID == "" {
multiErr = fmt.Errorf("node ID is required")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The first validation error is assigned directly to multiErr, while subsequent errors are joined using errors.Join. This can lead to the first error being overwritten if it's not the only one. For consistency and to ensure all validation errors are collected and reported correctly, you should use errors.Join for all error aggregations.

Suggested change
if c.NodeID == "" {
multiErr = fmt.Errorf("node ID is required")
}
if c.NodeID == "" {
multiErr = errors.Join(multiErr, fmt.Errorf("node ID is required"))
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants