Skip to content

Conversation

@Pijukatel
Copy link
Collaborator

@Pijukatel Pijukatel commented Jan 12, 2026

Description

  • Introduces a new argument crawler_id for BasicCrawler. This argument controls the shared state.

Draft for discussion. The main drawback is that this is another unique way of sharing something between different crawlers. Similar, but different existing approaches:

  • Statistics can be shared by explicitly passing an existing instance to the crawler.
  • Storages in general can be shared by properly setting a combination of configuration and storage_client arguments to the crawler
  • Storages can be shared by relying on default values (will be reused by default)
  • RequestManager can be reused by explicitly passing an existing instance to the request_manager argument

What are the alternatives?

  1. Explicitly passing state_kvs instead of crawler_id, otherwise autoincrement state counter - this is more aligned with the existing approach of how Statistics can be re-used
  2. Bring default_kvs_id Configuration value from SDK level to Crawlee level. This would allow to share or not share state based on what would be in the Configuration(default_kvs_id=...) (if using the same storage client...)
  3. Ignore and just document current behavior
  4. ...

Issues

Closes: #1627

Testing

  • TODO

Checklist

  • CI passed

@github-actions github-actions bot added this to the 132nd sprint - Tooling team milestone Jan 12, 2026
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Jan 12, 2026
@Pijukatel Pijukatel changed the title Add crawler fix: Do not share state between different crawlers unless requested Jan 12, 2026
@codecov
Copy link

codecov bot commented Jan 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.38%. Comparing base (0a0995c) to head (31e16d2).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1669      +/-   ##
==========================================
- Coverage   92.41%   92.38%   -0.03%     
==========================================
  Files         157      157              
  Lines       10478    10486       +8     
==========================================
+ Hits         9683     9688       +5     
- Misses        795      798       +3     
Flag Coverage Δ
unit 92.38% <100.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Pijukatel
Copy link
Collaborator Author

Better discuss this. After implementing this draft, I am leaning towards alternative 1 (see description)

@janbuchar , @barjin, @B4nan. Could you please share your point of view?

You can see the usage in code in the updated and new test in this PR.

@barjin
Copy link
Member

barjin commented Jan 12, 2026

Explicitly passing state_kvs instead of crawler_id

The state is just one key in the KVS though, it feels weird to me to make the state this prominent in our API. If it's about the entire KVS (so e.g. get_key_value_store will also return this KVS), then it makes a bit more sense to me.

Maybe it's unclear that the "crawler state" is actually stored in KVS - this we should IMO communicate better in the docs.

Having thought about this a bit more, I see the "state sharing" as a bug again :) Different crawler instances IMO shouldn't influence each other just because they are touching the same storage implementation (if this is intentional, it should be explicit).

Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I am getting lost in this, I thought the id is rather internal thing to ensure two crawler instances created in one app context won't share the state. We expose the id so people can opt-in to sharing the state explicitly, but the important bit to me is that those IDs will be unique automatically. I can't think of a use case where one would want to create multiple crawlers and share their stats. Similarly, I don't think sharing the state object is something people would want to, at least not by default.

status_message_logging_interval: timedelta = timedelta(seconds=10),
status_message_callback: Callable[[StatisticsState, StatisticsState | None, str], Awaitable[str | None]]
| None = None,
crawler_id: int | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the option should be called just id as in the JS version, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Pepa chose that name because id is a bultin function in Python. If that's the case, I think we can safely shadow it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Port fix of multiple crawler instances sharing state from JS version

5 participants