Skip to content

Conversation

@agrim-git-hub
Copy link
Contributor

@agrim-git-hub agrim-git-hub commented Dec 31, 2025

Pull Request

Description

This PR implements the requested --validate-checksum flag for the download command to ensure data integrity.

Briefly describe the changes introduced in this PR.

Key Changes:

CLI: Added the --validate-checksum boolean flag to the download command in cli.py.

Logic: Implemented SHA256 verification in _download_file. It uses the existing compute_sha256_and_length utility from webdav.py to avoid code duplication.

Metadata Extraction: Updated _download_version and _download_artifact to extract expected checksums from the JSON-LD metadata and pass them down to the downloader.

Single File Fix: Added specific logic in download() to fetch the parent Version's JSON-LD when downloading a single file, ensuring checksum validation works for individual file downloads as well.

Related Issues
Closes #28

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • [ x ] New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • This change requires a documentation update
  • Housekeeping

Checklist:

  • [ x ] My code follows the ruff code style of this project.
  • [ x ] I have performed a self-review of my own code
  • [ x ] I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (if applicable)
  • I have added tests that prove my fix is effective or that my feature works
  • [ x ] New and existing unit tests pass locally with my changes
    • [ x ] poetry run pytest - all tests passed
    • [ x ] poetry run ruff check - no linting errors

Summary by CodeRabbit

  • New Features

    • Optional per-file checksum validation for downloads; expected checksums are resolved from available metadata and applied across single-file, multi-file, artifact and collection flows.
    • CLI flag --validate-checksum to enable verification.
  • Bug Fixes

    • Non-fatal behavior when checksum retrieval fails (warnings and skips validation).
    • Downloads now fail when checksum verification detects a mismatch.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 31, 2025

📝 Walkthrough

Walkthrough

Adds optional SHA‑256 checksum validation for downloads: extract 64‑char hex checksums from JSON‑LD metadata, build per-file expected checksums, propagate a validate flag through download functions and the CLI, compute SHA‑256 after download, warn if metadata missing, and fail/delete on mismatches. (50 words)

Changes

Cohort / File(s) Summary
Download API (checksum logic)
databusclient/api/download.py
New checksum extraction helpers (_extract_checksum_from_node, _extract_checksums_from_jsonld, _resolve_checksums_for_urls), per-file checksum resolution, and checksum-aware download functions (_download_file, _download_files, _download_collection, _download_version, _download_artifact, _download_group, download) with validate_checksum propagation, best‑effort metadata fetching, warnings on missing checksums, and deletion on mismatch.
Utilities (sha256)
databusclient/api/utils.py
Added compute_sha256_and_length(filepath) (SHA‑256 hex digest + byte length) and hashlib import used by validation.
CLI flag propagation
databusclient/cli.py
Added --validate-checksum Click option and updated download(...) signature to accept and forward validate_checksum to the API.

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as CLI/Download
    participant API as Download API
    participant Meta as Databus Metadata
    participant Web as WebDAV/HTTP
    participant SHA as compute_sha256_and_length

    User->>CLI: run download --validate-checksum ...
    CLI->>API: download(validate_checksum=True, ...)
    alt single-file path
        API->>Meta: fetch parent version JSON-LD
        Meta-->>API: JSON-LD
        API->>API: extract expected checksum for file
    else query/multi-file path
        API->>Meta: fetch per-version JSON-LD for targets
        Meta-->>API: JSON-LD graphs
        API->>API: build map URL -> expected_checksum
    end
    API->>Web: fetch file URL(s)
    Web-->>API: file(s) saved to disk
    API->>SHA: compute_sha256_and_length(file)
    SHA-->>API: actual_checksum
    alt expected checksum present
        API->>API: compare actual vs expected
        alt match
            API-->>CLI: success
        else mismatch
            API-->>CLI: error (file deleted)
        end
    else expected checksum missing
        API-->>CLI: warn and continue
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feature added: --validate checksum flag' accurately summarizes the primary change: adding a new CLI flag for checksum validation during downloads.
Description check ✅ Passed The PR description follows the template structure, includes a clear description of changes, relates to issue #28, specifies the change type as 'New feature', and documents completed checklist items.
Linked Issues check ✅ Passed The implementation addresses all coding requirements from issue #28: CLI option added for validation, SHA256 verification implemented in download flow, and graceful handling when no SHA is available in metadata.
Out of Scope Changes check ✅ Passed All changes are directly related to implementing SHA256 checksum validation: CLI flag addition, checksum extraction utilities, verification logic, and metadata integration are all within scope.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
databusclient/api/download.py (1)

791-802: Remove the expected_checksum parameter on line 801.

Line 801 passes expected_checksum=expected, but:

  1. The variable expected is only defined in the elif file is not None: block (lines 764-790), not in this elif version is not None: block, causing a NameError.
  2. The _download_version function does not accept an expected_checksum parameter (see line 473), which would cause a TypeError.

Version downloads extract checksums internally from their JSON-LD metadata (lines 488-502), so there's no need to pass an external expected checksum.

🔎 Proposed fix
             elif version is not None:
                 print(f"Downloading version: {databusURI}")
                 _download_version(
                     databusURI,
                     localDir,
                     vault_token_file=token,
                     databus_key=databus_key,
                     auth_url=auth_url,
                     client_id=client_id,
                     validate_checksum=validate_checksum,
-                    expected_checksum=expected,
                 )
🧹 Nitpick comments (2)
databusclient/cli.py (1)

177-177: Remove extra blank line.

PEP 8 style guidelines recommend no blank line between the function signature and the docstring.

🔎 Proposed fix
 def download(
     databusuris: List[str],
     localdir,
     databus,
     vault_token,
     databus_key,
     all_versions,
     authurl,
     clientid,
     validate_checksum,
 ):
-    
     """
     Download datasets from databus, optionally using vault access if vault options are provided.
     """
databusclient/api/download.py (1)

235-254: Consider moving the import to module level.

The checksum validation logic is correct (skip if no expected checksum, warn on computation failure, raise on mismatch). However, importing compute_sha256_and_length inside the try block is unconventional. Unless there's a specific reason (e.g., optional dependency or avoiding circular imports), consider importing at the module level for clarity.

🔎 Proposed refactor to move import to module level

Add to the top of the file (after line 13):

 from databusclient.api.utils import (
     fetch_databus_jsonld,
     get_databus_id_parts_from_file_url,
 )
+from databusclient.extensions.webdav import compute_sha256_and_length

Then simplify the validation block:

     # --- 6. Optional checksum validation ---
     if validate_checksum:
-        # reuse compute_sha256_and_length from webdav extension
-        try:
-            from databusclient.extensions.webdav import compute_sha256_and_length
-
-            actual, _ = compute_sha256_and_length(filename)
-        except Exception:
-            actual = None
+        try:
+            actual, _ = compute_sha256_and_length(filename)
+        except Exception:
+            actual = None
 
         if expected_checksum is None:
             print(f"WARNING: no expected checksum available for {filename}; skipping validation")
         elif actual is None:
             print(f"WARNING: could not compute checksum for {filename}; skipping validation")
         else:
             if actual.lower() != expected_checksum.lower():
                 raise IOError(
                     f"Checksum mismatch for {filename}: expected {expected_checksum}, got {actual}"
                 )
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b2c3f1c and 350617c.

📒 Files selected for processing (2)
  • databusclient/api/download.py
  • databusclient/cli.py
🧰 Additional context used
🧬 Code graph analysis (2)
databusclient/cli.py (1)
databusclient/api/download.py (1)
  • download (710-856)
databusclient/api/download.py (2)
databusclient/extensions/webdav.py (1)
  • compute_sha256_and_length (8-18)
databusclient/api/utils.py (1)
  • fetch_databus_jsonld (33-50)
🔇 Additional comments (11)
databusclient/cli.py (2)

161-175: LGTM: CLI option and parameter addition.

The --validate-checksum flag is correctly defined and threaded through to the API call. The parameter follows Click conventions and integrates cleanly with the download command.


191-192: LGTM: Parameter forwarding.

The validate_checksum parameter is correctly passed through to the API download function.

databusclient/api/download.py (9)

16-60: LGTM: Checksum extraction handles diverse JSON-LD structures.

The function correctly extracts SHA-256 checksums from various JSON-LD formats (plain strings, @value objects, nested structures). The 64-character hex validation is appropriate, and the fallback recursive search provides flexibility for unpredictable metadata structures.


263-290: LGTM: Per-URL checksum mapping.

The function correctly extracts expected checksums from the provided mapping and forwards them to _download_file for validation.


437-462: LGTM: Checksum flag propagated through collection downloads.

The validate_checksum flag is correctly threaded through. Note that collection downloads won't have per-file expected checksums available (since collections are query-based), so validation warnings will appear for each file. This is acceptable given the architecture.


473-513: LGTM: Version downloads extract per-file checksums from metadata.

The function correctly parses the version's JSON-LD to build a per-file checksum mapping. The fallback to an empty dict on parsing errors ensures downloads continue with warnings rather than failing completely.


524-571: LGTM: Artifact downloads extract checksums per version.

The function correctly iterates through versions and builds checksum mappings for each. The implementation is consistent with the version download path.


646-672: LGTM: Group downloads propagate checksum validation.

The flag is correctly passed through to artifact downloads, which handle checksum extraction.


719-760: LGTM: Checksum flag added to main download dispatcher.

The validate_checksum parameter is correctly added and propagated through collection downloads.


764-790: LGTM: Single-file downloads fetch checksums from parent version metadata.

The logic to construct the version URI and extract the expected checksum from the parent version's JSON-LD is sound. The fallback warning on errors ensures downloads continue gracefully.


803-856: LGTM: Checksum validation threaded through remaining download paths.

The validate_checksum flag is correctly propagated to artifact, group, and SPARQL query download paths.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
databusclient/api/download.py (1)

545-559: Same duplication as in _download_version.

This checksum extraction logic is identical to lines 487-501 in _download_version. Please apply the same refactoring suggested above to eliminate this duplication.

🧹 Nitpick comments (2)
databusclient/api/download.py (2)

83-84: Checksum validation logic is correct and user-friendly.

The implementation properly computes and compares checksums, provides clear warnings when validation cannot be performed, and raises appropriate errors on mismatches. The case-insensitive comparison at line 249 is correct for hex strings.

💡 Optional: Consider more specific exception handling

At line 241, the broad except Exception catches all errors during checksum computation. While this ensures robustness, consider catching specific exceptions for clarity:

         try:
             actual, _ = compute_sha256_and_length(filename)
-        except Exception:
+        except (OSError, IOError) as e:
+            print(f"WARNING: error computing checksum for {filename}: {e}")
             actual = None

This makes the error handling more explicit while still providing graceful degradation.

Also applies to: 236-252


487-501: Extract duplicated checksum-building logic into a helper function.

The code at lines 487-501 that builds the checksums dictionary from JSON-LD metadata is duplicated in _download_artifact (lines 545-559). This creates a maintenance burden.

🔎 Proposed refactor to eliminate duplication

Extract the common logic into a helper function:

def _extract_checksums_from_jsonld(json_str: str) -> dict:
    """
    Extract file URL -> checksum mapping from artifact version JSON-LD.
    
    Returns:
        Dictionary mapping file URIs to their expected SHA256 checksums.
    """
    checksums: dict = {}
    try:
        json_dict = json.loads(json_str)
        graph = json_dict.get("@graph", [])
        for node in graph:
            if node.get("@type") == "Part":
                file_uri = node.get("file")
                if not isinstance(file_uri, str):
                    continue
                expected = _extract_checksum_from_node(node)
                if expected:
                    checksums[file_uri] = expected
    except Exception:
        return {}
    return checksums

Then use it in both functions:

 def _download_version(...):
     json_str = fetch_databus_jsonld(uri, databus_key=databus_key)
     file_urls = _get_file_download_urls_from_artifact_jsonld(json_str)
-    # build url -> checksum mapping from JSON-LD when available
-    checksums: dict = {}
-    try:
-        json_dict = json.loads(json_str)
-        graph = json_dict.get("@graph", [])
-        for node in graph:
-            if node.get("@type") == "Part":
-                file_uri = node.get("file")
-                if not isinstance(file_uri, str):
-                    continue
-                expected = _extract_checksum_from_node(node)
-                if expected:
-                    checksums[file_uri] = expected
-    except Exception:
-        checksums = {}
+    checksums = _extract_checksums_from_jsonld(json_str)

Apply the same change in _download_artifact.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 350617c and 743c623.

📒 Files selected for processing (2)
  • databusclient/api/download.py
  • databusclient/cli.py
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/api/download.py (2)
databusclient/extensions/webdav.py (1)
  • compute_sha256_and_length (8-18)
databusclient/api/utils.py (1)
  • fetch_databus_jsonld (33-50)
🔇 Additional comments (8)
databusclient/api/download.py (6)

15-15: LGTM! Appropriate import for checksum functionality.

The import of compute_sha256_and_length is correctly used later for computing file checksums during validation.


17-60: Well-structured checksum extraction with comprehensive JSON-LD handling.

The function correctly handles multiple JSON-LD node shapes and uses appropriate validation for 64-character hex strings. The recursive approach with early returns is efficient.


262-263: LGTM! Clean per-file checksum resolution.

The logic correctly extracts expected checksums from the mapping and forwards them to _download_file. The defensive isinstance check at line 278 is good practice.

Also applies to: 277-289


645-645: LGTM! Correct propagation of checksum validation flag.

The parameter is properly threaded through to _download_artifact.

Also applies to: 671-671


763-789: Single-file checksum extraction is well-implemented.

The logic to fetch the parent version's metadata and extract the expected checksum is sound. The try/except block with a clear warning (line 777) provides good user experience when metadata cannot be retrieved.


718-718: LGTM! Checksum validation flag properly propagated throughout download paths.

The validate_checksum parameter is correctly threaded through all download scenarios (collections, versions, artifacts, groups, and custom queries).

Also applies to: 759-759, 799-799, 813-813, 827-827, 853-853

databusclient/cli.py (2)

161-165: LGTM! CLI option properly configured.

The --validate-checksum flag follows Click conventions and has clear help text. The parameter is correctly added to the function signature.

Also applies to: 175-175


190-191: LGTM! Checksum validation flag correctly forwarded to API.

The parameter is properly passed through to api_download.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
databusclient/api/download.py (2)

15-60: Well-implemented checksum extraction with broad compatibility.

The import and _extract_checksum_from_node function are well-structured. The function correctly validates SHA256 checksums (64 hex characters) and handles diverse JSON-LD node shapes by trying known keys first, then falling back to recursive search.

The recursive fallback (lines 55-59) searches all node values for any 64-char hex string. While this provides excellent compatibility with varied JSON-LD formats, it could theoretically match unrelated data. Consider limiting the fallback search to a specific depth or removing it entirely if all Databus metadata uses the known keys.

Optional: Restrict to known keys only
-    # fallback: search all values recursively for a 64-char hex string
-    for v in node.values():
-        res = find_in_value(v)
-        if res:
-            return res
     return None

476-489: Consider more specific exception handling for checksum extraction.

The broad except Exception clauses provide best-effort behavior for the optional checksum feature, which is appropriate. However, more specific exception handling would improve debuggability while maintaining robustness.

🔎 More specific exception types

Replace broad except Exception: with specific types:

-            except Exception:
+            except (requests.RequestException, json.JSONDecodeError, KeyError, ValueError, TypeError):
                 # Best-effort: if fetching a version fails, skip it
                 continue

Apply similar specificity to the other exception handlers at lines 538, 596, and 814. This catches the expected failure modes (network errors, JSON parsing, missing keys, type mismatches) while allowing truly unexpected exceptions to surface for investigation.

Also applies to: 538-539, 596-597, 814-815

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 743c623 and e33ab8c.

📒 Files selected for processing (1)
  • databusclient/api/download.py
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/api/download.py (2)
databusclient/extensions/webdav.py (1)
  • compute_sha256_and_length (8-18)
databusclient/api/utils.py (2)
  • get_databus_id_parts_from_file_url (6-30)
  • fetch_databus_jsonld (33-50)
🔇 Additional comments (7)
databusclient/api/download.py (7)

236-253: Checksum validation logic is correct and well-structured.

The validation implementation correctly:

  • Computes the actual SHA256 checksum using the imported function
  • Handles missing expected checksums and computation errors with warnings (allowing the download to complete)
  • Performs case-insensitive comparison (appropriate for hex strings)
  • Raises a descriptive error on mismatch

This aligns perfectly with the PR objectives to skip validation when checksums are unavailable but fail on mismatches.


256-290: Correct propagation of checksums to individual file downloads.

The function properly extracts expected checksums from the provided mapping and forwards them along with the validation flag to _download_file.


525-550: Version-level checksum extraction is correctly implemented.

The function properly extracts checksums from the version's JSON-LD metadata by scanning Part nodes in the @graph. The best-effort exception handling ensures that failures in checksum extraction don't break the download.


583-608: Artifact-level checksum extraction correctly handles multiple versions.

The implementation properly extracts and propagates checksums for each version within an artifact, maintaining consistency with the _download_version logic.


683-710: Group-level function correctly propagates validation flag.

The function properly passes through the validate_checksum flag to child artifact downloads, where checksum extraction is performed.


801-827: Single-file checksum validation correctly fetches parent metadata.

For single-file downloads with validation enabled, the code appropriately fetches the parent version's JSON-LD metadata to extract the expected checksum, since file URLs don't contain checksum information directly. The matching logic (line 810) checks both file and @id fields to locate the correct Part node.


756-892: Validation flag is consistently propagated through all download paths.

The validate_checksum parameter is correctly added to the main download function signature and properly forwarded to all download path functions (collection, version, artifact, group, and query-based downloads).

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
databusclient/api/download.py (1)

886-894: Checksum validation is ineffective for user-provided SPARQL queries.

When a user provides a raw SPARQL query (the else branch), validate_checksum is passed to _download_files but no checksums mapping is provided. This means every file will trigger the "no expected checksum available" warning at line 246, silently degrading to no validation.

Consider either:

  1. Applying the same version-grouping approach used in _download_collection to extract checksums, or
  2. Warning the user upfront that checksum validation is not supported for raw SPARQL queries.
🔎 Option 2: Warn user upfront
         else:
             print("QUERY {}", databusURI.replace("\n", " "))
+            if validate_checksum:
+                print("WARNING: Checksum validation is not supported for user-defined SPARQL queries")
             if uri_endpoint is None:  # endpoint is required for queries (--databus)
                 raise ValueError("No endpoint given for query")
🧹 Nitpick comments (4)
databusclient/api/download.py (4)

236-254: Consider removing the file on checksum mismatch.

When checksum validation fails, the corrupted/incomplete file remains on disk at filename. A user might inadvertently use this file later if they miss the error. Consider deleting the file before raising the IOError, or at least document this behavior.

🔎 Proposed fix to clean up on mismatch
         else:
             if actual.lower() != expected_checksum.lower():
+                try:
+                    os.remove(filename)
+                except OSError:
+                    pass  # best-effort cleanup
                 raise IOError(
                     f"Checksum mismatch for {filename}: expected {expected_checksum}, got {actual}"
                 )

585-600: Consider extracting checksum parsing into a helper function.

The checksum extraction logic (lines 585-600) is nearly identical to lines 527-542 in _download_version. Extracting this into a shared helper like _extract_checksums_from_jsonld(json_str) -> dict would reduce duplication and centralize the parsing logic.


803-818: Add defensive None-check before constructing version_uri.

If the URL parsing yields None for host, account, group, or artifact, line 807 will produce a malformed URI like "https://None/None/...". While the exception handler catches this, adding an explicit check makes the failure mode clearer.

🔎 Proposed defensive check
                 if validate_checksum:
+                    if not all([host, account, group, artifact, version]):
+                        print(f"WARNING: Cannot construct version URI for checksum lookup from {databusURI}")
+                    else:
                         try:
                             version_uri = f"https://{host}/{account}/{group}/{artifact}/{version}"
                             # ... rest of the block
                         except Exception as e:
                             print(f"WARNING: Could not fetch checksum for single file: {e}")

17-61: Fallback search may match non-checksum 64-char hex strings.

The fallback (lines 55-59) searches all node values for any 64-character hex string. While the primary key-based lookup targets specific checksum fields (checksum, sha256sum, sha256, databus:checksum), the fallback could match other hex identifiers unintentionally. If this is a concern, consider restricting the fallback to specific key patterns or adding logging to identify when the fallback is triggered during debugging.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e33ab8c and dc51aa9.

📒 Files selected for processing (1)
  • databusclient/api/download.py
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/api/download.py (2)
databusclient/extensions/webdav.py (1)
  • compute_sha256_and_length (8-18)
databusclient/api/utils.py (2)
  • get_databus_id_parts_from_file_url (6-30)
  • fetch_databus_jsonld (33-50)
🔇 Additional comments (5)
databusclient/api/download.py (5)

15-16: LGTM!

Import correctly reuses the existing compute_sha256_and_length utility from the webdav extension rather than duplicating the SHA256 computation logic.


277-290: LGTM!

The checksum propagation logic correctly resolves per-URL checksums from the mapping and passes them to _download_file. The defensive type check on line 279 handles edge cases safely.


455-502: LGTM - Previous review feedback addressed.

The defensive None-checks at lines 468-471 properly handle cases where URL parsing fails partially, addressing the previous review comment. The version-grouping approach efficiently minimizes HTTP requests for checksum metadata.


527-552: LGTM!

Efficiently reuses the already-fetched json_str to extract checksums, avoiding an extra HTTP request. The checksum mapping is correctly built and passed to _download_files.


685-712: LGTM!

The validate_checksum parameter is correctly propagated to _download_artifact.

Copy link
Contributor

@Integer-Ctrl Integer-Ctrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You added checksum logic using _extract_checksum_from_node in four different methods: _download_collection, _download_version, _download_artifact, and download. It might make more sense to centralize this logic, for example in the _download_file function.

If validate_checksum is true, _download_file could try to retrieve the checksum from the databus and, if successful, compare it against the downloaded file. This would reduce duplication and simplify the code.

@agrim-git-hub
Copy link
Contributor Author

Hi, i have implemented the requested changes, please let me know if any more are needed !

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
databusclient/api/download.py (1)

218-223: Do not re-enable compression on Vault retry; it can break Content-Length verification.
On the Bearer/Vault path you pop("Accept-Encoding") (Line 251), which can re-enable gzip and make the later size check (Line 293-295) fail.

Proposed fix
         vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id)
         headers["Authorization"] = f"Bearer {vault_token}"
-        headers.pop("Accept-Encoding", None)
+        headers["Accept-Encoding"] = "identity"

Also applies to: 249-255, 292-318

🤖 Fix all issues with AI agents
In @databusclient/api/download.py:
- Around line 103-135: In _resolve_checksums_for_urls, normalize URL schemes
before storing and looking up checksums so http/https mismatches don't skip
validation: when building versions_map and when matching url -> checksum against
extracted_checksums (from _extract_checksums_from_jsonld), canonicalize both the
input file_url and the keys in extracted_checksums to a consistent form (e.g.,
force https or strip scheme) and use those canonical URLs for dictionary keys
and lookups; ensure this normalization is applied for the initial grouping
(get_databus_id_parts_from_file_url flow) and the final mapping loop that
assigns checksums.
- Around line 75-101: _paragraph_: The function _extract_checksums_from_jsonld
assumes json.loads(json_str) returns a dict and calls jd.get("@graph", []),
which fails when the top-level JSON-LD is a list; update
_extract_checksums_from_jsonld to check the type of jd after parsing: if jd is a
dict use jd.get("@graph", []), if jd is a list treat that list as the graph,
otherwise return {}. Keep using _extract_checksum_from_node for Part nodes and
ensure graph is iterated only when it's a list.
- Around line 815-841: The single-file download path has an over-indented try
block and a redundant checksum lookup; fix the indentation so the try/except
aligns properly under the if validate_checksum: block and remove the duplicate
fallback lookup that repeats checks.get(databusURI). Specifically, inside the
branch handling file is not None, adjust the indentation of the lines building
version_uri, calling fetch_databus_jsonld and _extract_checksums_from_jsonld,
and assigning expected = checks.get(databusURI) so they are within the try
block, eliminate the second checks.get(databusURI) fallback, keep the except
printing the warning, and ensure the resulting expected value is passed into
_download_file (function name: _download_file) as before.
🧹 Nitpick comments (2)
databusclient/api/utils.py (1)

2-2: Deduplicate compute_sha256_and_length and add typing to avoid drift.
This function is duplicated (same logic) in databusclient/extensions/webdav.py, which risks divergence.

Proposed tweak (types + docstring + clearer signature)
-from typing import Optional, Tuple
+from __future__ import annotations
+
+from typing import Optional, Tuple
 import hashlib
 import requests
+from os import PathLike
@@
-def compute_sha256_and_length(filepath):
+def compute_sha256_and_length(filepath: str | PathLike[str]) -> tuple[str, int]:
+    """Return (sha256_hex_digest, total_length_bytes) for a local file."""
     sha256 = hashlib.sha256()
     total_length = 0
     with open(filepath, "rb") as f:
         while True:
             chunk = f.read(4096)
             if not chunk:
                 break
             sha256.update(chunk)
             total_length += len(chunk)
     return sha256.hexdigest(), total_length

Also applies to: 52-62

databusclient/api/download.py (1)

16-20: Checksum extraction helper looks solid; consider adding minimal unit tests for node-shape coverage.
_extract_checksum_from_node is intentionally conservative (allowed keys + 64-hex match), which is good.

Also applies to: 21-60

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc51aa9 and e253b81.

📒 Files selected for processing (3)
  • databusclient/api/download.py
  • databusclient/api/utils.py
  • file.txt
🧰 Additional context used
🧬 Code graph analysis (2)
databusclient/api/download.py (2)
databusclient/api/utils.py (3)
  • compute_sha256_and_length (52-62)
  • get_databus_id_parts_from_file_url (6-30)
  • fetch_databus_jsonld (33-50)
databusclient/extensions/webdav.py (1)
  • compute_sha256_and_length (8-18)
databusclient/api/utils.py (1)
databusclient/extensions/webdav.py (1)
  • compute_sha256_and_length (8-18)
🔇 Additional comments (1)
databusclient/api/download.py (1)

296-318: Overall checksum propagation is consistent and matches the intended “best-effort” behavior.
Warnings on missing checksums + hard failure on mismatch are aligned with the PR objectives.

Also applies to: 520-534, 559-575, 608-624, 699-726, 899-915

Comment on lines +103 to +135
def _resolve_checksums_for_urls(file_urls: List[str], databus_key: str | None) -> dict:
"""
Group file URLs by their Version URI, fetch each Version JSON-LD once,
and return a combined url->checksum mapping for the provided URLs.
Best-effort: failures to fetch or parse individual versions are skipped.
"""
versions_map: dict = {}
for file_url in file_urls:
try:
host, accountId, groupId, artifactId, versionId, fileId = get_databus_id_parts_from_file_url(file_url)
except Exception:
continue
if versionId is None:
continue
if host is None or accountId is None or groupId is None or artifactId is None:
continue
version_uri = f"https://{host}/{accountId}/{groupId}/{artifactId}/{versionId}"
versions_map.setdefault(version_uri, []).append(file_url)

checksums: dict = {}
for version_uri, urls_in_version in versions_map.items():
try:
json_str = fetch_databus_jsonld(version_uri, databus_key=databus_key)
extracted_checksums = _extract_checksums_from_jsonld(json_str)
for url in urls_in_version:
if url in extracted_checksums:
checksums[url] = extracted_checksums[url]
except Exception:
# Best-effort: skip versions we cannot fetch or parse
continue
return checksums

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Normalize URL schemes when mapping url -> checksum (http/https mismatch breaks validation).
Right now lookups require exact string match; if metadata uses https://... but the input URL is http://... (or vice-versa), validation will be skipped unexpectedly.

Proposed fix (canonicalize)
+def _canonicalize_url(u: str) -> str:
+    return "https://" + u.removeprefix("http://").removeprefix("https://")
@@
     for version_uri, urls_in_version in versions_map.items():
@@
             extracted_checksums = _extract_checksums_from_jsonld(json_str)
             for url in urls_in_version:
-                if url in extracted_checksums:
-                    checksums[url] = extracted_checksums[url]
+                canon = _canonicalize_url(url)
+                if url in extracted_checksums:
+                    checksums[url] = extracted_checksums[url]
+                elif canon in extracted_checksums:
+                    checksums[url] = extracted_checksums[canon]
@@
     for url in urls:
         expected = None
         if checksums and isinstance(checksums, dict):
-            expected = checksums.get(url)
+            expected = checksums.get(url) or checksums.get(_canonicalize_url(url))

Also applies to: 341-345

🤖 Prompt for AI Agents
In @databusclient/api/download.py around lines 103 - 135, In
_resolve_checksums_for_urls, normalize URL schemes before storing and looking up
checksums so http/https mismatches don't skip validation: when building
versions_map and when matching url -> checksum against extracted_checksums (from
_extract_checksums_from_jsonld), canonicalize both the input file_url and the
keys in extracted_checksums to a consistent form (e.g., force https or strip
scheme) and use those canonical URLs for dictionary keys and lookups; ensure
this normalization is applied for the initial grouping
(get_databus_id_parts_from_file_url flow) and the final mapping loop that
assigns checksums.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @databusclient/api/download.py:
- Around line 87-92: The conditional logic incorrectly checks isinstance(jd,
list) twice and calls .get() on a list; update the checks so you first detect a
dict and extract the "@graph" key (e.g., if isinstance(jd, dict): graph =
jd.get("@graph", [])), then handle the list case (elif isinstance(jd, list):
graph = jd) and keep the else returning {} — ensure you reference the jd
variable and assign to graph accordingly so .get() is only called when jd is a
dict.
- Around line 831-833: The lookup for expected checksum misaligns indentation
and only tries an https fallback, missing the http->https or https->http
symmetry; fix the indentation around the expression that assigns expected
(variable expected) and replace the single fallback with a normalization helper
(e.g., normalize_databus_uri) that strips the scheme and then checks checks for
both "http://" + base and "https://" + base as well as the raw databusURI, or
alternatively always compare on the scheme-less base by trying
checks.get(databusURI) then checks.get("http://" + base) and
checks.get("https://" + base); update the code that references databusURI and
checks to call this helper so both http and https variants are supported and
indentation is correct.
🧹 Nitpick comments (2)
databusclient/api/download.py (2)

11-16: Consider consolidating imports from the same module.

The compute_sha256_and_length import on line 16 can be combined with the existing import block from lines 11-14.

Suggested consolidation
 from databusclient.api.utils import (
+    compute_sha256_and_length,
     fetch_databus_jsonld,
     get_databus_id_parts_from_file_url,
 )
-
-from databusclient.api.utils import compute_sha256_and_length

302-323: Checksum validation logic looks good; minor stale comment.

The validation logic correctly handles missing expected checksums, computation failures, and mismatches (with file cleanup). The case-insensitive hex comparison is appropriate.

Line 304's comment references "webdav extension" but compute_sha256_and_length is now imported from databusclient.api.utils.

Fix stale comment
     # --- 6. Optional checksum validation ---
     if validate_checksum:
-        # reuse compute_sha256_and_length from webdav extension
+        # compute SHA-256 checksum for downloaded file
         try:
             actual, _ = compute_sha256_and_length(filename)
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e253b81 and 278ee5e.

📒 Files selected for processing (1)
  • databusclient/api/download.py
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/api/download.py (2)
databusclient/api/utils.py (3)
  • compute_sha256_and_length (52-62)
  • get_databus_id_parts_from_file_url (6-30)
  • fetch_databus_jsonld (33-50)
databusclient/extensions/webdav.py (1)
  • compute_sha256_and_length (8-18)
🔇 Additional comments (6)
databusclient/api/download.py (6)

21-59: LGTM!

The recursive checksum extraction logic handles common JSON-LD structures well (plain strings, @value objects, nested dicts, and lists) while restricting search to explicitly allowed keys to avoid false positives.


109-140: LGTM!

The batching strategy—grouping URLs by version URI, fetching each version's JSON-LD once, and extracting checksums—is efficient. The best-effort exception handling is appropriate for this use case.


347-360: LGTM!

The per-URL checksum lookup and propagation to _download_file is implemented correctly with appropriate null-safety checks.


565-580: LGTM!

The checksum extraction and propagation is correctly implemented. Minor observation: the checksums = {} reassignment in the except block (line 570) is redundant since it's already initialized to {} on line 566, but this doesn't affect correctness.


614-630: LGTM!

Per-version checksum extraction within the artifact download loop is correctly implemented.


904-921: LGTM!

The SPARQL query path correctly builds the checksum mapping and provides helpful user feedback when no checksums are found.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @databusclient/api/download.py:
- Around line 75-107: In _extract_checksums_from_jsonld, there's a formatting
typo: change the malformed "return{}" to "return {}" so the function returns an
empty dict correctly when jd isn't a dict or list; locate this in the
_extract_checksums_from_jsonld function and fix that single return statement.
- Line 257: The duplicate headers["Accept-Encoding"] = "identity" assignment
should be removed from the download flow: drop the later redundant assignment
that re-sets Accept-Encoding on the headers dict (leave the original declaration
where Accept-Encoding is first set), ensuring only the initial headers setup in
the download logic is used and no other behavior relies on reassigning
headers["Accept-Encoding"].
- Around line 823-847: Add a short inline comment explaining why we normalize
databusURI to https when looking up checksums: in the block that sets expected
from checks.get(databusURI) or checks.get("https://" +
databusURI.removeprefix(...)), annotate that this is a defensive fallback to
handle callers passing http:// URIs while metadata entries may be stored with
https://, and reference databusURI, checks, expected and
_extract_checksums_from_jsonld to clarify the relationship.
🧹 Nitpick comments (3)
databusclient/api/download.py (3)

347-360: Consider simplifying the checksums check.

Line 349's isinstance(checksums, dict) check is defensive but redundant given the dict | None type hint. Since the function already handles None gracefully, you could simplify to if checksums:.

♻️ Optional simplification
     for url in urls:
         expected = None
-        if checksums and isinstance(checksums, dict):
+        if checksums:
             expected = checksums.get(url)

525-540: Consider simplifying the checksums parameter.

Line 539's conditional checksums if checksums else None is redundant since an empty dict is falsy. You can simply pass checksums=checksums or checksums=checksums or None for clarity.

♻️ Optional simplification
         auth_url=auth_url,
         client_id=client_id,
         validate_checksum=validate_checksum,
-        checksums=checksums if checksums else None,
+        checksums=checksums or None,
     )

905-920: Consider simplifying the checksums parameter (same as earlier).

Line 920 has the same redundant conditional as line 539. Consider simplifying to checksums=checksums or None.

♻️ Optional simplification
             auth_url=auth_url,
             client_id=client_id,
             validate_checksum=validate_checksum,
-            checksums=checksums if checksums else None,
+            checksums=checksums or None,
         )
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 278ee5e and ab28258.

📒 Files selected for processing (1)
  • databusclient/api/download.py
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/api/download.py (2)
databusclient/api/utils.py (3)
  • compute_sha256_and_length (52-62)
  • get_databus_id_parts_from_file_url (6-30)
  • fetch_databus_jsonld (33-50)
databusclient/extensions/webdav.py (1)
  • compute_sha256_and_length (8-18)
🔇 Additional comments (8)
databusclient/api/download.py (8)

3-3: LGTM: Import and regex setup is correct.

The regex pattern correctly matches 64-character SHA-256 hex strings, and the import of compute_sha256_and_length aligns with the utility function available in databusclient.api.utils.

Also applies to: 16-19


21-60: LGTM: Robust checksum extraction logic.

The function appropriately handles various JSON-LD structures (plain strings, @value objects, nested values, lists) while limiting inspection to known checksum fields to avoid false positives.


109-141: LGTM: Efficient batch checksum resolution.

The function intelligently groups file URLs by version to minimize metadata fetches, and the best-effort exception handling aligns with the PR's design goal to skip validation when checksums are unavailable.


302-324: LGTM: Solid checksum validation implementation.

The validation logic correctly:

  • Reuses the existing compute_sha256_and_length utility
  • Provides non-fatal warnings when checksums are missing or computation fails
  • Performs case-insensitive comparison
  • Deletes corrupted files on mismatch with appropriate error handling

565-581: LGTM: Proper checksum extraction for version downloads.

The function correctly extracts checksums from the version's JSON-LD metadata and handles failures gracefully with a fallback to an empty dict, maintaining the best-effort validation approach.


614-630: LGTM: Consistent checksum handling for artifact downloads.

The implementation mirrors the pattern in _download_version, correctly extracting checksums per version and maintaining graceful fallback behavior.


705-705: LGTM: Proper flag propagation through group downloads.

The function correctly threads the validate_checksum parameter to downstream artifact downloads.

Also applies to: 731-731


778-778: LGTM: Complete integration of checksum validation into the public API.

The function properly:

  • Adds the validate_checksum parameter as a non-breaking change with a sensible default
  • Handles single-file downloads by fetching parent version metadata
  • Propagates the flag through all download paths (collections, versions, artifacts, groups, queries)
  • Maintains backward compatibility

This completes the checksum validation feature as specified in the PR objectives.

Also applies to: 819-886

Comment on lines +823 to 847
# Try to fetch expected checksum from the parent Version metadata
expected = None
if validate_checksum:
try:
if version is not None:
version_uri = f"https://{host}/{account}/{group}/{artifact}/{version}"
json_str = fetch_databus_jsonld(version_uri, databus_key=databus_key)
checks = _extract_checksums_from_jsonld(json_str)
expected = checks.get(databusURI) or checks.get(
"https://" + databusURI.removeprefix("http://").removeprefix("https://")
)
except Exception as e:
print(f"WARNING: Could not fetch checksum for single file: {e}")

# Call the worker to download the single file (passes expected checksum)
_download_file(
databusURI,
localDir,
vault_token_file=token,
databus_key=databus_key,
auth_url=auth_url,
client_id=client_id,
validate_checksum=validate_checksum,
expected_checksum=expected,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's explore the repository structure
find . -type f -name "*.py" -o -name "*.json" -o -name "*.jsonld" | head -20

Repository: dbpedia/databus-python-client

Length of output: 498


🏁 Script executed:

# Look at the download.py file around the lines in question
sed -n '820,850p' databusclient/api/download.py

Repository: dbpedia/databus-python-client

Length of output: 1635


🏁 Script executed:

# Search for how URIs are used and formatted in the codebase
rg "databusURI|@id|file.*http" --type py -A 2 -B 2 | head -50

Repository: dbpedia/databus-python-client

Length of output: 2944


🏁 Script executed:

# Look for test fixtures with JSON-LD examples
find . -type f \( -name "*.json" -o -name "*.jsonld" -o -name "*fixture*" -o -name "*test*data*" \) | grep -E "\.(json|jsonld)$|fixture|test_data" | head -20

Repository: dbpedia/databus-python-client

Length of output: 55


🏁 Script executed:

# Check for documentation about URI formatting
find . -type f \( -name "*.md" -o -name "*.rst" -o -name "*.txt" \) | xargs grep -l "URI\|http\|checksum" 2>/dev/null | head -10

Repository: dbpedia/databus-python-client

Length of output: 119


🏁 Script executed:

# Look at _extract_checksums_from_jsonld function in detail
sed -n '780,810p' databusclient/api/download.py

Repository: dbpedia/databus-python-client

Length of output: 1693


🏁 Script executed:

# Check test files for actual JSON-LD examples with checksums
rg "checksum|sha256" tests/ --type py -A 3 -B 3 | head -60

Repository: dbpedia/databus-python-client

Length of output: 1821


🏁 Script executed:

# Search for any code that might pass http:// URIs vs https://
rg "http://" databusclient/ --type py -B 2 -A 2

Repository: dbpedia/databus-python-client

Length of output: 2041


🏁 Script executed:

# Look at the entire Part node extraction logic to understand URI handling
sed -n '760,820p' databusclient/api/download.py

Repository: dbpedia/databus-python-client

Length of output: 2558


🏁 Script executed:

# Check if there are test cases that would catch protocol mismatches
rg "_extract_checksums" tests/ -A 10 -B 5

Repository: dbpedia/databus-python-client

Length of output: 55


🏁 Script executed:

# Look at the complete _extract_checksums_from_jsonld function
sed -n '750,820p' databusclient/api/download.py

Repository: dbpedia/databus-python-client

Length of output: 2844


🏁 Script executed:

# Check if there are any tests that test URI protocol mismatches
rg "http://" tests/ --type py -B 5 -A 5

Repository: dbpedia/databus-python-client

Length of output: 657


🏁 Script executed:

# Look for test cases that specifically test checksum validation
rg "validate_checksum|expected_checksum" tests/ --type py -B 5 -A 5

Repository: dbpedia/databus-python-client

Length of output: 55


🏁 Script executed:

# Check what the actual URIs passed to download look like in practice/tests
rg "download\(" tests/ --type py -B 2 -A 2 | head -40

Repository: dbpedia/databus-python-client

Length of output: 775


🏁 Script executed:

# Look at line 539 mentioned in the original scratchpad to see if there's a pattern
sed -n '535,545p' databusclient/api/download.py

Repository: dbpedia/databus-python-client

Length of output: 322


🏁 Script executed:

# Look at the complete _extract_checksums_from_jsonld function definition
sed -n '680,750p' databusclient/api/download.py

Repository: dbpedia/databus-python-client

Length of output: 2371


🏁 Script executed:

# Check if there are any comments explaining the protocol handling
grep -n "http\|protocol" databusclient/api/download.py | head -20

Repository: dbpedia/databus-python-client

Length of output: 985


🏁 Script executed:

# Look for version of the code to understand context better
sed -n '1,50p' databusclient/api/download.py

Repository: dbpedia/databus-python-client

Length of output: 1573


🏁 Script executed:

# Find the _extract_checksums_from_jsonld function
sed -n '600,680p' databusclient/api/download.py

Repository: dbpedia/databus-python-client

Length of output: 3016


🏁 Script executed:

# Check if there's any documentation about URI normalization needs
grep -rn "removeprefix\|protocol" databusclient/ --type py

Repository: dbpedia/databus-python-client

Length of output: 190


🏁 Script executed:

# Look at what URIs are passed into the checksum extraction
sed -n '690,750p' databusclient/api/download.py

Repository: dbpedia/databus-python-client

Length of output: 2115


🏁 Script executed:

# Find the complete _extract_checksums_from_jsonld function
rg "_extract_checksums_from_jsonld" databusclient/api/download.py -A 30

Repository: dbpedia/databus-python-client

Length of output: 5363


🏁 Script executed:

# Check utils.py to understand how URIs are handled
cat databusclient/api/utils.py

Repository: dbpedia/databus-python-client

Length of output: 1861


Add inline documentation explaining the protocol normalization workaround.

Lines 831-833 normalize databusURI by removing protocol prefixes and re-adding https://. This is defensive programming for cases where a caller might pass an http:// URI that matches a metadata entry only when normalized to https://. Since _extract_checksums_from_jsonld already maps both file and @id URIs from Part nodes, the protocol normalization is a separate (reasonable) safeguard. However, adding a brief comment explaining the reason would clarify intent:

expected = checks.get(databusURI) or checks.get(
    # Fallback: normalize to https in case metadata uses different protocol
    "https://" + databusURI.removeprefix("http://").removeprefix("https://")
)
🤖 Prompt for AI Agents
In @databusclient/api/download.py around lines 823 - 847, Add a short inline
comment explaining why we normalize databusURI to https when looking up
checksums: in the block that sets expected from checks.get(databusURI) or
checks.get("https://" + databusURI.removeprefix(...)), annotate that this is a
defensive fallback to handle callers passing http:// URIs while metadata entries
may be stored with https://, and reference databusURI, checks, expected and
_extract_checksums_from_jsonld to clarify the relationship.

@Integer-Ctrl Integer-Ctrl mentioned this pull request Jan 13, 2026
Copy link
Contributor

@Integer-Ctrl Integer-Ctrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine. The only thing you forgot is to update or add information to the README.md file. However, this is ok since I'll tackle it later in a clean up task (#49)

Thanks!

@Integer-Ctrl Integer-Ctrl merged commit 54c7ff1 into dbpedia:main Jan 13, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SHA checksum validation

2 participants