-
Notifications
You must be signed in to change notification settings - Fork 541
Labels
t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.
Description
Minimal reproducible code:
import asyncio
from urllib.parse import quote_plus
from crawlee import Request
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
async def main() -> None:
crawler = BeautifulSoupCrawler()
@crawler.router.default_handler
async def handle_listing(context: BeautifulSoupCrawlingContext) -> None:
await context.enqueue_links(
selector="h4 a", label="DETAIL", limit=1
)
@crawler.router.handler("DETAIL")
async def handle_detail(context: BeautifulSoupCrawlingContext) -> None:
print("Detail page URL:", context.request.url)
await crawler.run(["https://honzajavorek.cz/blog/"])
if __name__ == "__main__":
asyncio.run(main())Run like this:
uv run --with crawlee[beautifulsoup] python crawlee_bug.py
Expected output:
Detail page URL: https://honzajavorek.cz/blog/tydenni-poznamky-vanoce-a-tak/
Actual output:
Detail page URL: https://honzajavorek.cz/blog/tydenni-poznamky-vanoce-a-tak/
Detail page URL: https://honzajavorek.cz/blog/tydenni-poznamky-vylepsovani-seznamu-kandidatu-nove-bydleni-a-odpocinek-v-mlze/
If I change the limit to 5, I get 6 links on the output.
I tried to nail down the problem, but I spent 20 minutes drowning in the sea of type definitions, kwargs passed down as is, and nested _create_... functions. It was impossible to me to follow the flow of the keyword arguments and find the place where the limit gets actually applied.
Metadata
Metadata
Assignees
Labels
t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.