Skip to content

The enqueue_links function misinterprets the limit keyword argument #1673

@honzajavorek

Description

@honzajavorek

Minimal reproducible code:

import asyncio
from urllib.parse import quote_plus

from crawlee import Request
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext


async def main() -> None:
    crawler = BeautifulSoupCrawler()

    @crawler.router.default_handler
    async def handle_listing(context: BeautifulSoupCrawlingContext) -> None:
        await context.enqueue_links(
            selector="h4 a", label="DETAIL", limit=1
        )

    @crawler.router.handler("DETAIL")
    async def handle_detail(context: BeautifulSoupCrawlingContext) -> None:
        print("Detail page URL:", context.request.url)

    await crawler.run(["https://honzajavorek.cz/blog/"])


if __name__ == "__main__":
    asyncio.run(main())

Run like this:

uv run --with crawlee[beautifulsoup] python crawlee_bug.py

Expected output:

Detail page URL: https://honzajavorek.cz/blog/tydenni-poznamky-vanoce-a-tak/

Actual output:

Detail page URL: https://honzajavorek.cz/blog/tydenni-poznamky-vanoce-a-tak/
Detail page URL: https://honzajavorek.cz/blog/tydenni-poznamky-vylepsovani-seznamu-kandidatu-nove-bydleni-a-odpocinek-v-mlze/

If I change the limit to 5, I get 6 links on the output.

I tried to nail down the problem, but I spent 20 minutes drowning in the sea of type definitions, kwargs passed down as is, and nested _create_... functions. It was impossible to me to follow the flow of the keyword arguments and find the place where the limit gets actually applied.

Metadata

Metadata

Assignees

Labels

t-toolingIssues with this label are in the ownership of the tooling team.

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions