-
Notifications
You must be signed in to change notification settings - Fork 539
fix: Do not share state between different crawlers unless requested #1669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
6ff0b3f to
31e16d2
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1669 +/- ##
==========================================
- Coverage 92.41% 92.38% -0.03%
==========================================
Files 157 157
Lines 10478 10486 +8
==========================================
+ Hits 9683 9688 +5
- Misses 795 798 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Better discuss this. After implementing this draft, I am leaning towards alternative 1 (see description) @janbuchar , @barjin, @B4nan. Could you please share your point of view? You can see the usage in code in the updated and new test in this PR. |
The Maybe it's unclear that the "crawler state" is actually stored in KVS - this we should IMO communicate better in the docs. Having thought about this a bit more, I see the "state sharing" as a bug again :) Different crawler instances IMO shouldn't influence each other just because they are touching the same storage implementation (if this is intentional, it should be explicit). |
B4nan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like I am getting lost in this, I thought the id is rather internal thing to ensure two crawler instances created in one app context won't share the state. We expose the id so people can opt-in to sharing the state explicitly, but the important bit to me is that those IDs will be unique automatically. I can't think of a use case where one would want to create multiple crawlers and share their stats. Similarly, I don't think sharing the state object is something people would want to, at least not by default.
| status_message_logging_interval: timedelta = timedelta(seconds=10), | ||
| status_message_callback: Callable[[StatisticsState, StatisticsState | None, str], Awaitable[str | None]] | ||
| | None = None, | ||
| crawler_id: int | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the option should be called just id as in the JS version, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe Pepa chose that name because id is a bultin function in Python. If that's the case, I think we can safely shadow it.
Description
crawler_idforBasicCrawler. This argument controls the shared state.Draft for discussion. The main drawback is that this is another unique way of sharing something between different crawlers. Similar, but different existing approaches:
Statisticscan be shared by explicitly passing an existing instance to the crawler.configurationandstorage_clientarguments to the crawlerRequestManagercan be reused by explicitly passing an existing instance to therequest_managerargumentWhat are the alternatives?
state_kvsinstead ofcrawler_id, otherwise autoincrement state counter - this is more aligned with the existing approach of howStatisticscan be re-useddefault_kvs_idConfiguration value from SDK level to Crawlee level. This would allow to share or not share state based on what would be in theConfiguration(default_kvs_id=...)(if using the same storage client...)Issues
Closes: #1627
Testing
Checklist