-
Notifications
You must be signed in to change notification settings - Fork 12
feature added: --validate checksum flag #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature added: --validate checksum flag #44
Conversation
📝 WalkthroughWalkthroughAdds optional SHA‑256 checksum validation for downloads: extract 64‑char hex checksums from JSON‑LD metadata, build per-file expected checksums, propagate a validate flag through download functions and the CLI, compute SHA‑256 after download, warn if metadata missing, and fail/delete on mismatches. (50 words) Changes
Sequence DiagramsequenceDiagram
participant User
participant CLI as CLI/Download
participant API as Download API
participant Meta as Databus Metadata
participant Web as WebDAV/HTTP
participant SHA as compute_sha256_and_length
User->>CLI: run download --validate-checksum ...
CLI->>API: download(validate_checksum=True, ...)
alt single-file path
API->>Meta: fetch parent version JSON-LD
Meta-->>API: JSON-LD
API->>API: extract expected checksum for file
else query/multi-file path
API->>Meta: fetch per-version JSON-LD for targets
Meta-->>API: JSON-LD graphs
API->>API: build map URL -> expected_checksum
end
API->>Web: fetch file URL(s)
Web-->>API: file(s) saved to disk
API->>SHA: compute_sha256_and_length(file)
SHA-->>API: actual_checksum
alt expected checksum present
API->>API: compare actual vs expected
alt match
API-->>CLI: success
else mismatch
API-->>CLI: error (file deleted)
end
else expected checksum missing
API-->>CLI: warn and continue
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
databusclient/api/download.py (1)
791-802: Remove theexpected_checksumparameter on line 801.Line 801 passes
expected_checksum=expected, but:
- The variable
expectedis only defined in theelif file is not None:block (lines 764-790), not in thiselif version is not None:block, causing aNameError.- The
_download_versionfunction does not accept anexpected_checksumparameter (see line 473), which would cause aTypeError.Version downloads extract checksums internally from their JSON-LD metadata (lines 488-502), so there's no need to pass an external expected checksum.
🔎 Proposed fix
elif version is not None: print(f"Downloading version: {databusURI}") _download_version( databusURI, localDir, vault_token_file=token, databus_key=databus_key, auth_url=auth_url, client_id=client_id, validate_checksum=validate_checksum, - expected_checksum=expected, )
🧹 Nitpick comments (2)
databusclient/cli.py (1)
177-177: Remove extra blank line.PEP 8 style guidelines recommend no blank line between the function signature and the docstring.
🔎 Proposed fix
def download( databusuris: List[str], localdir, databus, vault_token, databus_key, all_versions, authurl, clientid, validate_checksum, ): - """ Download datasets from databus, optionally using vault access if vault options are provided. """databusclient/api/download.py (1)
235-254: Consider moving the import to module level.The checksum validation logic is correct (skip if no expected checksum, warn on computation failure, raise on mismatch). However, importing
compute_sha256_and_lengthinside the try block is unconventional. Unless there's a specific reason (e.g., optional dependency or avoiding circular imports), consider importing at the module level for clarity.🔎 Proposed refactor to move import to module level
Add to the top of the file (after line 13):
from databusclient.api.utils import ( fetch_databus_jsonld, get_databus_id_parts_from_file_url, ) +from databusclient.extensions.webdav import compute_sha256_and_lengthThen simplify the validation block:
# --- 6. Optional checksum validation --- if validate_checksum: - # reuse compute_sha256_and_length from webdav extension - try: - from databusclient.extensions.webdav import compute_sha256_and_length - - actual, _ = compute_sha256_and_length(filename) - except Exception: - actual = None + try: + actual, _ = compute_sha256_and_length(filename) + except Exception: + actual = None if expected_checksum is None: print(f"WARNING: no expected checksum available for {filename}; skipping validation") elif actual is None: print(f"WARNING: could not compute checksum for {filename}; skipping validation") else: if actual.lower() != expected_checksum.lower(): raise IOError( f"Checksum mismatch for {filename}: expected {expected_checksum}, got {actual}" )
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
databusclient/api/download.pydatabusclient/cli.py
🧰 Additional context used
🧬 Code graph analysis (2)
databusclient/cli.py (1)
databusclient/api/download.py (1)
download(710-856)
databusclient/api/download.py (2)
databusclient/extensions/webdav.py (1)
compute_sha256_and_length(8-18)databusclient/api/utils.py (1)
fetch_databus_jsonld(33-50)
🔇 Additional comments (11)
databusclient/cli.py (2)
161-175: LGTM: CLI option and parameter addition.The
--validate-checksumflag is correctly defined and threaded through to the API call. The parameter follows Click conventions and integrates cleanly with the download command.
191-192: LGTM: Parameter forwarding.The
validate_checksumparameter is correctly passed through to the API download function.databusclient/api/download.py (9)
16-60: LGTM: Checksum extraction handles diverse JSON-LD structures.The function correctly extracts SHA-256 checksums from various JSON-LD formats (plain strings,
@valueobjects, nested structures). The 64-character hex validation is appropriate, and the fallback recursive search provides flexibility for unpredictable metadata structures.
263-290: LGTM: Per-URL checksum mapping.The function correctly extracts expected checksums from the provided mapping and forwards them to
_download_filefor validation.
437-462: LGTM: Checksum flag propagated through collection downloads.The
validate_checksumflag is correctly threaded through. Note that collection downloads won't have per-file expected checksums available (since collections are query-based), so validation warnings will appear for each file. This is acceptable given the architecture.
473-513: LGTM: Version downloads extract per-file checksums from metadata.The function correctly parses the version's JSON-LD to build a per-file checksum mapping. The fallback to an empty dict on parsing errors ensures downloads continue with warnings rather than failing completely.
524-571: LGTM: Artifact downloads extract checksums per version.The function correctly iterates through versions and builds checksum mappings for each. The implementation is consistent with the version download path.
646-672: LGTM: Group downloads propagate checksum validation.The flag is correctly passed through to artifact downloads, which handle checksum extraction.
719-760: LGTM: Checksum flag added to main download dispatcher.The
validate_checksumparameter is correctly added and propagated through collection downloads.
764-790: LGTM: Single-file downloads fetch checksums from parent version metadata.The logic to construct the version URI and extract the expected checksum from the parent version's JSON-LD is sound. The fallback warning on errors ensures downloads continue gracefully.
803-856: LGTM: Checksum validation threaded through remaining download paths.The
validate_checksumflag is correctly propagated to artifact, group, and SPARQL query download paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
databusclient/api/download.py (1)
545-559: Same duplication as in_download_version.This checksum extraction logic is identical to lines 487-501 in
_download_version. Please apply the same refactoring suggested above to eliminate this duplication.
🧹 Nitpick comments (2)
databusclient/api/download.py (2)
83-84: Checksum validation logic is correct and user-friendly.The implementation properly computes and compares checksums, provides clear warnings when validation cannot be performed, and raises appropriate errors on mismatches. The case-insensitive comparison at line 249 is correct for hex strings.
💡 Optional: Consider more specific exception handling
At line 241, the broad
except Exceptioncatches all errors during checksum computation. While this ensures robustness, consider catching specific exceptions for clarity:try: actual, _ = compute_sha256_and_length(filename) - except Exception: + except (OSError, IOError) as e: + print(f"WARNING: error computing checksum for {filename}: {e}") actual = NoneThis makes the error handling more explicit while still providing graceful degradation.
Also applies to: 236-252
487-501: Extract duplicated checksum-building logic into a helper function.The code at lines 487-501 that builds the checksums dictionary from JSON-LD metadata is duplicated in
_download_artifact(lines 545-559). This creates a maintenance burden.🔎 Proposed refactor to eliminate duplication
Extract the common logic into a helper function:
def _extract_checksums_from_jsonld(json_str: str) -> dict: """ Extract file URL -> checksum mapping from artifact version JSON-LD. Returns: Dictionary mapping file URIs to their expected SHA256 checksums. """ checksums: dict = {} try: json_dict = json.loads(json_str) graph = json_dict.get("@graph", []) for node in graph: if node.get("@type") == "Part": file_uri = node.get("file") if not isinstance(file_uri, str): continue expected = _extract_checksum_from_node(node) if expected: checksums[file_uri] = expected except Exception: return {} return checksumsThen use it in both functions:
def _download_version(...): json_str = fetch_databus_jsonld(uri, databus_key=databus_key) file_urls = _get_file_download_urls_from_artifact_jsonld(json_str) - # build url -> checksum mapping from JSON-LD when available - checksums: dict = {} - try: - json_dict = json.loads(json_str) - graph = json_dict.get("@graph", []) - for node in graph: - if node.get("@type") == "Part": - file_uri = node.get("file") - if not isinstance(file_uri, str): - continue - expected = _extract_checksum_from_node(node) - if expected: - checksums[file_uri] = expected - except Exception: - checksums = {} + checksums = _extract_checksums_from_jsonld(json_str)Apply the same change in
_download_artifact.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
databusclient/api/download.pydatabusclient/cli.py
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/api/download.py (2)
databusclient/extensions/webdav.py (1)
compute_sha256_and_length(8-18)databusclient/api/utils.py (1)
fetch_databus_jsonld(33-50)
🔇 Additional comments (8)
databusclient/api/download.py (6)
15-15: LGTM! Appropriate import for checksum functionality.The import of
compute_sha256_and_lengthis correctly used later for computing file checksums during validation.
17-60: Well-structured checksum extraction with comprehensive JSON-LD handling.The function correctly handles multiple JSON-LD node shapes and uses appropriate validation for 64-character hex strings. The recursive approach with early returns is efficient.
262-263: LGTM! Clean per-file checksum resolution.The logic correctly extracts expected checksums from the mapping and forwards them to
_download_file. The defensiveisinstancecheck at line 278 is good practice.Also applies to: 277-289
645-645: LGTM! Correct propagation of checksum validation flag.The parameter is properly threaded through to
_download_artifact.Also applies to: 671-671
763-789: Single-file checksum extraction is well-implemented.The logic to fetch the parent version's metadata and extract the expected checksum is sound. The try/except block with a clear warning (line 777) provides good user experience when metadata cannot be retrieved.
718-718: LGTM! Checksum validation flag properly propagated throughout download paths.The
validate_checksumparameter is correctly threaded through all download scenarios (collections, versions, artifacts, groups, and custom queries).Also applies to: 759-759, 799-799, 813-813, 827-827, 853-853
databusclient/cli.py (2)
161-165: LGTM! CLI option properly configured.The
--validate-checksumflag follows Click conventions and has clear help text. The parameter is correctly added to the function signature.Also applies to: 175-175
190-191: LGTM! Checksum validation flag correctly forwarded to API.The parameter is properly passed through to
api_download.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
databusclient/api/download.py (2)
15-60: Well-implemented checksum extraction with broad compatibility.The import and
_extract_checksum_from_nodefunction are well-structured. The function correctly validates SHA256 checksums (64 hex characters) and handles diverse JSON-LD node shapes by trying known keys first, then falling back to recursive search.The recursive fallback (lines 55-59) searches all node values for any 64-char hex string. While this provides excellent compatibility with varied JSON-LD formats, it could theoretically match unrelated data. Consider limiting the fallback search to a specific depth or removing it entirely if all Databus metadata uses the known keys.
Optional: Restrict to known keys only
- # fallback: search all values recursively for a 64-char hex string - for v in node.values(): - res = find_in_value(v) - if res: - return res return None
476-489: Consider more specific exception handling for checksum extraction.The broad
except Exceptionclauses provide best-effort behavior for the optional checksum feature, which is appropriate. However, more specific exception handling would improve debuggability while maintaining robustness.🔎 More specific exception types
Replace broad
except Exception:with specific types:- except Exception: + except (requests.RequestException, json.JSONDecodeError, KeyError, ValueError, TypeError): # Best-effort: if fetching a version fails, skip it continueApply similar specificity to the other exception handlers at lines 538, 596, and 814. This catches the expected failure modes (network errors, JSON parsing, missing keys, type mismatches) while allowing truly unexpected exceptions to surface for investigation.
Also applies to: 538-539, 596-597, 814-815
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
databusclient/api/download.py
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/api/download.py (2)
databusclient/extensions/webdav.py (1)
compute_sha256_and_length(8-18)databusclient/api/utils.py (2)
get_databus_id_parts_from_file_url(6-30)fetch_databus_jsonld(33-50)
🔇 Additional comments (7)
databusclient/api/download.py (7)
236-253: Checksum validation logic is correct and well-structured.The validation implementation correctly:
- Computes the actual SHA256 checksum using the imported function
- Handles missing expected checksums and computation errors with warnings (allowing the download to complete)
- Performs case-insensitive comparison (appropriate for hex strings)
- Raises a descriptive error on mismatch
This aligns perfectly with the PR objectives to skip validation when checksums are unavailable but fail on mismatches.
256-290: Correct propagation of checksums to individual file downloads.The function properly extracts expected checksums from the provided mapping and forwards them along with the validation flag to
_download_file.
525-550: Version-level checksum extraction is correctly implemented.The function properly extracts checksums from the version's JSON-LD metadata by scanning Part nodes in the @graph. The best-effort exception handling ensures that failures in checksum extraction don't break the download.
583-608: Artifact-level checksum extraction correctly handles multiple versions.The implementation properly extracts and propagates checksums for each version within an artifact, maintaining consistency with the
_download_versionlogic.
683-710: Group-level function correctly propagates validation flag.The function properly passes through the
validate_checksumflag to child artifact downloads, where checksum extraction is performed.
801-827: Single-file checksum validation correctly fetches parent metadata.For single-file downloads with validation enabled, the code appropriately fetches the parent version's JSON-LD metadata to extract the expected checksum, since file URLs don't contain checksum information directly. The matching logic (line 810) checks both
fileand@idfields to locate the correct Part node.
756-892: Validation flag is consistently propagated through all download paths.The
validate_checksumparameter is correctly added to the maindownloadfunction signature and properly forwarded to all download path functions (collection, version, artifact, group, and query-based downloads).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
databusclient/api/download.py (1)
886-894: Checksum validation is ineffective for user-provided SPARQL queries.When a user provides a raw SPARQL query (the
elsebranch),validate_checksumis passed to_download_filesbut nochecksumsmapping is provided. This means every file will trigger the "no expected checksum available" warning at line 246, silently degrading to no validation.Consider either:
- Applying the same version-grouping approach used in
_download_collectionto extract checksums, or- Warning the user upfront that checksum validation is not supported for raw SPARQL queries.
🔎 Option 2: Warn user upfront
else: print("QUERY {}", databusURI.replace("\n", " ")) + if validate_checksum: + print("WARNING: Checksum validation is not supported for user-defined SPARQL queries") if uri_endpoint is None: # endpoint is required for queries (--databus) raise ValueError("No endpoint given for query")
🧹 Nitpick comments (4)
databusclient/api/download.py (4)
236-254: Consider removing the file on checksum mismatch.When checksum validation fails, the corrupted/incomplete file remains on disk at
filename. A user might inadvertently use this file later if they miss the error. Consider deleting the file before raising theIOError, or at least document this behavior.🔎 Proposed fix to clean up on mismatch
else: if actual.lower() != expected_checksum.lower(): + try: + os.remove(filename) + except OSError: + pass # best-effort cleanup raise IOError( f"Checksum mismatch for {filename}: expected {expected_checksum}, got {actual}" )
585-600: Consider extracting checksum parsing into a helper function.The checksum extraction logic (lines 585-600) is nearly identical to lines 527-542 in
_download_version. Extracting this into a shared helper like_extract_checksums_from_jsonld(json_str) -> dictwould reduce duplication and centralize the parsing logic.
803-818: Add defensive None-check before constructing version_uri.If the URL parsing yields
Noneforhost,account,group, orartifact, line 807 will produce a malformed URI like"https://None/None/...". While the exception handler catches this, adding an explicit check makes the failure mode clearer.🔎 Proposed defensive check
if validate_checksum: + if not all([host, account, group, artifact, version]): + print(f"WARNING: Cannot construct version URI for checksum lookup from {databusURI}") + else: try: version_uri = f"https://{host}/{account}/{group}/{artifact}/{version}" # ... rest of the block except Exception as e: print(f"WARNING: Could not fetch checksum for single file: {e}")
17-61: Fallback search may match non-checksum 64-char hex strings.The fallback (lines 55-59) searches all node values for any 64-character hex string. While the primary key-based lookup targets specific checksum fields (checksum, sha256sum, sha256, databus:checksum), the fallback could match other hex identifiers unintentionally. If this is a concern, consider restricting the fallback to specific key patterns or adding logging to identify when the fallback is triggered during debugging.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
databusclient/api/download.py
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/api/download.py (2)
databusclient/extensions/webdav.py (1)
compute_sha256_and_length(8-18)databusclient/api/utils.py (2)
get_databus_id_parts_from_file_url(6-30)fetch_databus_jsonld(33-50)
🔇 Additional comments (5)
databusclient/api/download.py (5)
15-16: LGTM!Import correctly reuses the existing
compute_sha256_and_lengthutility from the webdav extension rather than duplicating the SHA256 computation logic.
277-290: LGTM!The checksum propagation logic correctly resolves per-URL checksums from the mapping and passes them to
_download_file. The defensive type check on line 279 handles edge cases safely.
455-502: LGTM - Previous review feedback addressed.The defensive None-checks at lines 468-471 properly handle cases where URL parsing fails partially, addressing the previous review comment. The version-grouping approach efficiently minimizes HTTP requests for checksum metadata.
527-552: LGTM!Efficiently reuses the already-fetched
json_strto extract checksums, avoiding an extra HTTP request. The checksum mapping is correctly built and passed to_download_files.
685-712: LGTM!The
validate_checksumparameter is correctly propagated to_download_artifact.
Integer-Ctrl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You added checksum logic using _extract_checksum_from_node in four different methods: _download_collection, _download_version, _download_artifact, and download. It might make more sense to centralize this logic, for example in the _download_file function.
If validate_checksum is true, _download_file could try to retrieve the checksum from the databus and, if successful, compare it against the downloaded file. This would reduce duplication and simplify the code.
|
Hi, i have implemented the requested changes, please let me know if any more are needed ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
databusclient/api/download.py (1)
218-223: Do not re-enable compression on Vault retry; it can break Content-Length verification.
On the Bearer/Vault path youpop("Accept-Encoding")(Line 251), which can re-enable gzip and make the later size check (Line 293-295) fail.Proposed fix
vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id) headers["Authorization"] = f"Bearer {vault_token}" - headers.pop("Accept-Encoding", None) + headers["Accept-Encoding"] = "identity"Also applies to: 249-255, 292-318
🤖 Fix all issues with AI agents
In @databusclient/api/download.py:
- Around line 103-135: In _resolve_checksums_for_urls, normalize URL schemes
before storing and looking up checksums so http/https mismatches don't skip
validation: when building versions_map and when matching url -> checksum against
extracted_checksums (from _extract_checksums_from_jsonld), canonicalize both the
input file_url and the keys in extracted_checksums to a consistent form (e.g.,
force https or strip scheme) and use those canonical URLs for dictionary keys
and lookups; ensure this normalization is applied for the initial grouping
(get_databus_id_parts_from_file_url flow) and the final mapping loop that
assigns checksums.
- Around line 75-101: _paragraph_: The function _extract_checksums_from_jsonld
assumes json.loads(json_str) returns a dict and calls jd.get("@graph", []),
which fails when the top-level JSON-LD is a list; update
_extract_checksums_from_jsonld to check the type of jd after parsing: if jd is a
dict use jd.get("@graph", []), if jd is a list treat that list as the graph,
otherwise return {}. Keep using _extract_checksum_from_node for Part nodes and
ensure graph is iterated only when it's a list.
- Around line 815-841: The single-file download path has an over-indented try
block and a redundant checksum lookup; fix the indentation so the try/except
aligns properly under the if validate_checksum: block and remove the duplicate
fallback lookup that repeats checks.get(databusURI). Specifically, inside the
branch handling file is not None, adjust the indentation of the lines building
version_uri, calling fetch_databus_jsonld and _extract_checksums_from_jsonld,
and assigning expected = checks.get(databusURI) so they are within the try
block, eliminate the second checks.get(databusURI) fallback, keep the except
printing the warning, and ensure the resulting expected value is passed into
_download_file (function name: _download_file) as before.
🧹 Nitpick comments (2)
databusclient/api/utils.py (1)
2-2: Deduplicatecompute_sha256_and_lengthand add typing to avoid drift.
This function is duplicated (same logic) indatabusclient/extensions/webdav.py, which risks divergence.Proposed tweak (types + docstring + clearer signature)
-from typing import Optional, Tuple +from __future__ import annotations + +from typing import Optional, Tuple import hashlib import requests +from os import PathLike @@ -def compute_sha256_and_length(filepath): +def compute_sha256_and_length(filepath: str | PathLike[str]) -> tuple[str, int]: + """Return (sha256_hex_digest, total_length_bytes) for a local file.""" sha256 = hashlib.sha256() total_length = 0 with open(filepath, "rb") as f: while True: chunk = f.read(4096) if not chunk: break sha256.update(chunk) total_length += len(chunk) return sha256.hexdigest(), total_lengthAlso applies to: 52-62
databusclient/api/download.py (1)
16-20: Checksum extraction helper looks solid; consider adding minimal unit tests for node-shape coverage.
_extract_checksum_from_nodeis intentionally conservative (allowed keys + 64-hex match), which is good.Also applies to: 21-60
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
databusclient/api/download.pydatabusclient/api/utils.pyfile.txt
🧰 Additional context used
🧬 Code graph analysis (2)
databusclient/api/download.py (2)
databusclient/api/utils.py (3)
compute_sha256_and_length(52-62)get_databus_id_parts_from_file_url(6-30)fetch_databus_jsonld(33-50)databusclient/extensions/webdav.py (1)
compute_sha256_and_length(8-18)
databusclient/api/utils.py (1)
databusclient/extensions/webdav.py (1)
compute_sha256_and_length(8-18)
🔇 Additional comments (1)
databusclient/api/download.py (1)
296-318: Overall checksum propagation is consistent and matches the intended “best-effort” behavior.
Warnings on missing checksums + hard failure on mismatch are aligned with the PR objectives.Also applies to: 520-534, 559-575, 608-624, 699-726, 899-915
| def _resolve_checksums_for_urls(file_urls: List[str], databus_key: str | None) -> dict: | ||
| """ | ||
| Group file URLs by their Version URI, fetch each Version JSON-LD once, | ||
| and return a combined url->checksum mapping for the provided URLs. | ||
| Best-effort: failures to fetch or parse individual versions are skipped. | ||
| """ | ||
| versions_map: dict = {} | ||
| for file_url in file_urls: | ||
| try: | ||
| host, accountId, groupId, artifactId, versionId, fileId = get_databus_id_parts_from_file_url(file_url) | ||
| except Exception: | ||
| continue | ||
| if versionId is None: | ||
| continue | ||
| if host is None or accountId is None or groupId is None or artifactId is None: | ||
| continue | ||
| version_uri = f"https://{host}/{accountId}/{groupId}/{artifactId}/{versionId}" | ||
| versions_map.setdefault(version_uri, []).append(file_url) | ||
|
|
||
| checksums: dict = {} | ||
| for version_uri, urls_in_version in versions_map.items(): | ||
| try: | ||
| json_str = fetch_databus_jsonld(version_uri, databus_key=databus_key) | ||
| extracted_checksums = _extract_checksums_from_jsonld(json_str) | ||
| for url in urls_in_version: | ||
| if url in extracted_checksums: | ||
| checksums[url] = extracted_checksums[url] | ||
| except Exception: | ||
| # Best-effort: skip versions we cannot fetch or parse | ||
| continue | ||
| return checksums | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normalize URL schemes when mapping url -> checksum (http/https mismatch breaks validation).
Right now lookups require exact string match; if metadata uses https://... but the input URL is http://... (or vice-versa), validation will be skipped unexpectedly.
Proposed fix (canonicalize)
+def _canonicalize_url(u: str) -> str:
+ return "https://" + u.removeprefix("http://").removeprefix("https://")
@@
for version_uri, urls_in_version in versions_map.items():
@@
extracted_checksums = _extract_checksums_from_jsonld(json_str)
for url in urls_in_version:
- if url in extracted_checksums:
- checksums[url] = extracted_checksums[url]
+ canon = _canonicalize_url(url)
+ if url in extracted_checksums:
+ checksums[url] = extracted_checksums[url]
+ elif canon in extracted_checksums:
+ checksums[url] = extracted_checksums[canon]
@@
for url in urls:
expected = None
if checksums and isinstance(checksums, dict):
- expected = checksums.get(url)
+ expected = checksums.get(url) or checksums.get(_canonicalize_url(url))Also applies to: 341-345
🤖 Prompt for AI Agents
In @databusclient/api/download.py around lines 103 - 135, In
_resolve_checksums_for_urls, normalize URL schemes before storing and looking up
checksums so http/https mismatches don't skip validation: when building
versions_map and when matching url -> checksum against extracted_checksums (from
_extract_checksums_from_jsonld), canonicalize both the input file_url and the
keys in extracted_checksums to a consistent form (e.g., force https or strip
scheme) and use those canonical URLs for dictionary keys and lookups; ensure
this normalization is applied for the initial grouping
(get_databus_id_parts_from_file_url flow) and the final mapping loop that
assigns checksums.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @databusclient/api/download.py:
- Around line 87-92: The conditional logic incorrectly checks isinstance(jd,
list) twice and calls .get() on a list; update the checks so you first detect a
dict and extract the "@graph" key (e.g., if isinstance(jd, dict): graph =
jd.get("@graph", [])), then handle the list case (elif isinstance(jd, list):
graph = jd) and keep the else returning {} — ensure you reference the jd
variable and assign to graph accordingly so .get() is only called when jd is a
dict.
- Around line 831-833: The lookup for expected checksum misaligns indentation
and only tries an https fallback, missing the http->https or https->http
symmetry; fix the indentation around the expression that assigns expected
(variable expected) and replace the single fallback with a normalization helper
(e.g., normalize_databus_uri) that strips the scheme and then checks checks for
both "http://" + base and "https://" + base as well as the raw databusURI, or
alternatively always compare on the scheme-less base by trying
checks.get(databusURI) then checks.get("http://" + base) and
checks.get("https://" + base); update the code that references databusURI and
checks to call this helper so both http and https variants are supported and
indentation is correct.
🧹 Nitpick comments (2)
databusclient/api/download.py (2)
11-16: Consider consolidating imports from the same module.The
compute_sha256_and_lengthimport on line 16 can be combined with the existing import block from lines 11-14.Suggested consolidation
from databusclient.api.utils import ( + compute_sha256_and_length, fetch_databus_jsonld, get_databus_id_parts_from_file_url, ) - -from databusclient.api.utils import compute_sha256_and_length
302-323: Checksum validation logic looks good; minor stale comment.The validation logic correctly handles missing expected checksums, computation failures, and mismatches (with file cleanup). The case-insensitive hex comparison is appropriate.
Line 304's comment references "webdav extension" but
compute_sha256_and_lengthis now imported fromdatabusclient.api.utils.Fix stale comment
# --- 6. Optional checksum validation --- if validate_checksum: - # reuse compute_sha256_and_length from webdav extension + # compute SHA-256 checksum for downloaded file try: actual, _ = compute_sha256_and_length(filename)
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
databusclient/api/download.py
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/api/download.py (2)
databusclient/api/utils.py (3)
compute_sha256_and_length(52-62)get_databus_id_parts_from_file_url(6-30)fetch_databus_jsonld(33-50)databusclient/extensions/webdav.py (1)
compute_sha256_and_length(8-18)
🔇 Additional comments (6)
databusclient/api/download.py (6)
21-59: LGTM!The recursive checksum extraction logic handles common JSON-LD structures well (plain strings,
@valueobjects, nested dicts, and lists) while restricting search to explicitly allowed keys to avoid false positives.
109-140: LGTM!The batching strategy—grouping URLs by version URI, fetching each version's JSON-LD once, and extracting checksums—is efficient. The best-effort exception handling is appropriate for this use case.
347-360: LGTM!The per-URL checksum lookup and propagation to
_download_fileis implemented correctly with appropriate null-safety checks.
565-580: LGTM!The checksum extraction and propagation is correctly implemented. Minor observation: the
checksums = {}reassignment in the except block (line 570) is redundant since it's already initialized to{}on line 566, but this doesn't affect correctness.
614-630: LGTM!Per-version checksum extraction within the artifact download loop is correctly implemented.
904-921: LGTM!The SPARQL query path correctly builds the checksum mapping and provides helpful user feedback when no checksums are found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In @databusclient/api/download.py:
- Around line 75-107: In _extract_checksums_from_jsonld, there's a formatting
typo: change the malformed "return{}" to "return {}" so the function returns an
empty dict correctly when jd isn't a dict or list; locate this in the
_extract_checksums_from_jsonld function and fix that single return statement.
- Line 257: The duplicate headers["Accept-Encoding"] = "identity" assignment
should be removed from the download flow: drop the later redundant assignment
that re-sets Accept-Encoding on the headers dict (leave the original declaration
where Accept-Encoding is first set), ensuring only the initial headers setup in
the download logic is used and no other behavior relies on reassigning
headers["Accept-Encoding"].
- Around line 823-847: Add a short inline comment explaining why we normalize
databusURI to https when looking up checksums: in the block that sets expected
from checks.get(databusURI) or checks.get("https://" +
databusURI.removeprefix(...)), annotate that this is a defensive fallback to
handle callers passing http:// URIs while metadata entries may be stored with
https://, and reference databusURI, checks, expected and
_extract_checksums_from_jsonld to clarify the relationship.
🧹 Nitpick comments (3)
databusclient/api/download.py (3)
347-360: Consider simplifying the checksums check.Line 349's
isinstance(checksums, dict)check is defensive but redundant given thedict | Nonetype hint. Since the function already handlesNonegracefully, you could simplify toif checksums:.♻️ Optional simplification
for url in urls: expected = None - if checksums and isinstance(checksums, dict): + if checksums: expected = checksums.get(url)
525-540: Consider simplifying the checksums parameter.Line 539's conditional
checksums if checksums else Noneis redundant since an empty dict is falsy. You can simply passchecksums=checksumsorchecksums=checksums or Nonefor clarity.♻️ Optional simplification
auth_url=auth_url, client_id=client_id, validate_checksum=validate_checksum, - checksums=checksums if checksums else None, + checksums=checksums or None, )
905-920: Consider simplifying the checksums parameter (same as earlier).Line 920 has the same redundant conditional as line 539. Consider simplifying to
checksums=checksums or None.♻️ Optional simplification
auth_url=auth_url, client_id=client_id, validate_checksum=validate_checksum, - checksums=checksums if checksums else None, + checksums=checksums or None, )
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
databusclient/api/download.py
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/api/download.py (2)
databusclient/api/utils.py (3)
compute_sha256_and_length(52-62)get_databus_id_parts_from_file_url(6-30)fetch_databus_jsonld(33-50)databusclient/extensions/webdav.py (1)
compute_sha256_and_length(8-18)
🔇 Additional comments (8)
databusclient/api/download.py (8)
3-3: LGTM: Import and regex setup is correct.The regex pattern correctly matches 64-character SHA-256 hex strings, and the import of
compute_sha256_and_lengthaligns with the utility function available indatabusclient.api.utils.Also applies to: 16-19
21-60: LGTM: Robust checksum extraction logic.The function appropriately handles various JSON-LD structures (plain strings,
@valueobjects, nested values, lists) while limiting inspection to known checksum fields to avoid false positives.
109-141: LGTM: Efficient batch checksum resolution.The function intelligently groups file URLs by version to minimize metadata fetches, and the best-effort exception handling aligns with the PR's design goal to skip validation when checksums are unavailable.
302-324: LGTM: Solid checksum validation implementation.The validation logic correctly:
- Reuses the existing
compute_sha256_and_lengthutility- Provides non-fatal warnings when checksums are missing or computation fails
- Performs case-insensitive comparison
- Deletes corrupted files on mismatch with appropriate error handling
565-581: LGTM: Proper checksum extraction for version downloads.The function correctly extracts checksums from the version's JSON-LD metadata and handles failures gracefully with a fallback to an empty dict, maintaining the best-effort validation approach.
614-630: LGTM: Consistent checksum handling for artifact downloads.The implementation mirrors the pattern in
_download_version, correctly extracting checksums per version and maintaining graceful fallback behavior.
705-705: LGTM: Proper flag propagation through group downloads.The function correctly threads the
validate_checksumparameter to downstream artifact downloads.Also applies to: 731-731
778-778: LGTM: Complete integration of checksum validation into the public API.The function properly:
- Adds the
validate_checksumparameter as a non-breaking change with a sensible default- Handles single-file downloads by fetching parent version metadata
- Propagates the flag through all download paths (collections, versions, artifacts, groups, queries)
- Maintains backward compatibility
This completes the checksum validation feature as specified in the PR objectives.
Also applies to: 819-886
| # Try to fetch expected checksum from the parent Version metadata | ||
| expected = None | ||
| if validate_checksum: | ||
| try: | ||
| if version is not None: | ||
| version_uri = f"https://{host}/{account}/{group}/{artifact}/{version}" | ||
| json_str = fetch_databus_jsonld(version_uri, databus_key=databus_key) | ||
| checks = _extract_checksums_from_jsonld(json_str) | ||
| expected = checks.get(databusURI) or checks.get( | ||
| "https://" + databusURI.removeprefix("http://").removeprefix("https://") | ||
| ) | ||
| except Exception as e: | ||
| print(f"WARNING: Could not fetch checksum for single file: {e}") | ||
|
|
||
| # Call the worker to download the single file (passes expected checksum) | ||
| _download_file( | ||
| databusURI, | ||
| localDir, | ||
| vault_token_file=token, | ||
| databus_key=databus_key, | ||
| auth_url=auth_url, | ||
| client_id=client_id, | ||
| validate_checksum=validate_checksum, | ||
| expected_checksum=expected, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's explore the repository structure
find . -type f -name "*.py" -o -name "*.json" -o -name "*.jsonld" | head -20Repository: dbpedia/databus-python-client
Length of output: 498
🏁 Script executed:
# Look at the download.py file around the lines in question
sed -n '820,850p' databusclient/api/download.pyRepository: dbpedia/databus-python-client
Length of output: 1635
🏁 Script executed:
# Search for how URIs are used and formatted in the codebase
rg "databusURI|@id|file.*http" --type py -A 2 -B 2 | head -50Repository: dbpedia/databus-python-client
Length of output: 2944
🏁 Script executed:
# Look for test fixtures with JSON-LD examples
find . -type f \( -name "*.json" -o -name "*.jsonld" -o -name "*fixture*" -o -name "*test*data*" \) | grep -E "\.(json|jsonld)$|fixture|test_data" | head -20Repository: dbpedia/databus-python-client
Length of output: 55
🏁 Script executed:
# Check for documentation about URI formatting
find . -type f \( -name "*.md" -o -name "*.rst" -o -name "*.txt" \) | xargs grep -l "URI\|http\|checksum" 2>/dev/null | head -10Repository: dbpedia/databus-python-client
Length of output: 119
🏁 Script executed:
# Look at _extract_checksums_from_jsonld function in detail
sed -n '780,810p' databusclient/api/download.pyRepository: dbpedia/databus-python-client
Length of output: 1693
🏁 Script executed:
# Check test files for actual JSON-LD examples with checksums
rg "checksum|sha256" tests/ --type py -A 3 -B 3 | head -60Repository: dbpedia/databus-python-client
Length of output: 1821
🏁 Script executed:
# Search for any code that might pass http:// URIs vs https://
rg "http://" databusclient/ --type py -B 2 -A 2Repository: dbpedia/databus-python-client
Length of output: 2041
🏁 Script executed:
# Look at the entire Part node extraction logic to understand URI handling
sed -n '760,820p' databusclient/api/download.pyRepository: dbpedia/databus-python-client
Length of output: 2558
🏁 Script executed:
# Check if there are test cases that would catch protocol mismatches
rg "_extract_checksums" tests/ -A 10 -B 5Repository: dbpedia/databus-python-client
Length of output: 55
🏁 Script executed:
# Look at the complete _extract_checksums_from_jsonld function
sed -n '750,820p' databusclient/api/download.pyRepository: dbpedia/databus-python-client
Length of output: 2844
🏁 Script executed:
# Check if there are any tests that test URI protocol mismatches
rg "http://" tests/ --type py -B 5 -A 5Repository: dbpedia/databus-python-client
Length of output: 657
🏁 Script executed:
# Look for test cases that specifically test checksum validation
rg "validate_checksum|expected_checksum" tests/ --type py -B 5 -A 5Repository: dbpedia/databus-python-client
Length of output: 55
🏁 Script executed:
# Check what the actual URIs passed to download look like in practice/tests
rg "download\(" tests/ --type py -B 2 -A 2 | head -40Repository: dbpedia/databus-python-client
Length of output: 775
🏁 Script executed:
# Look at line 539 mentioned in the original scratchpad to see if there's a pattern
sed -n '535,545p' databusclient/api/download.pyRepository: dbpedia/databus-python-client
Length of output: 322
🏁 Script executed:
# Look at the complete _extract_checksums_from_jsonld function definition
sed -n '680,750p' databusclient/api/download.pyRepository: dbpedia/databus-python-client
Length of output: 2371
🏁 Script executed:
# Check if there are any comments explaining the protocol handling
grep -n "http\|protocol" databusclient/api/download.py | head -20Repository: dbpedia/databus-python-client
Length of output: 985
🏁 Script executed:
# Look for version of the code to understand context better
sed -n '1,50p' databusclient/api/download.pyRepository: dbpedia/databus-python-client
Length of output: 1573
🏁 Script executed:
# Find the _extract_checksums_from_jsonld function
sed -n '600,680p' databusclient/api/download.pyRepository: dbpedia/databus-python-client
Length of output: 3016
🏁 Script executed:
# Check if there's any documentation about URI normalization needs
grep -rn "removeprefix\|protocol" databusclient/ --type pyRepository: dbpedia/databus-python-client
Length of output: 190
🏁 Script executed:
# Look at what URIs are passed into the checksum extraction
sed -n '690,750p' databusclient/api/download.pyRepository: dbpedia/databus-python-client
Length of output: 2115
🏁 Script executed:
# Find the complete _extract_checksums_from_jsonld function
rg "_extract_checksums_from_jsonld" databusclient/api/download.py -A 30Repository: dbpedia/databus-python-client
Length of output: 5363
🏁 Script executed:
# Check utils.py to understand how URIs are handled
cat databusclient/api/utils.pyRepository: dbpedia/databus-python-client
Length of output: 1861
Add inline documentation explaining the protocol normalization workaround.
Lines 831-833 normalize databusURI by removing protocol prefixes and re-adding https://. This is defensive programming for cases where a caller might pass an http:// URI that matches a metadata entry only when normalized to https://. Since _extract_checksums_from_jsonld already maps both file and @id URIs from Part nodes, the protocol normalization is a separate (reasonable) safeguard. However, adding a brief comment explaining the reason would clarify intent:
expected = checks.get(databusURI) or checks.get(
# Fallback: normalize to https in case metadata uses different protocol
"https://" + databusURI.removeprefix("http://").removeprefix("https://")
)🤖 Prompt for AI Agents
In @databusclient/api/download.py around lines 823 - 847, Add a short inline
comment explaining why we normalize databusURI to https when looking up
checksums: in the block that sets expected from checks.get(databusURI) or
checks.get("https://" + databusURI.removeprefix(...)), annotate that this is a
defensive fallback to handle callers passing http:// URIs while metadata entries
may be stored with https://, and reference databusURI, checks, expected and
_extract_checksums_from_jsonld to clarify the relationship.
Integer-Ctrl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine. The only thing you forgot is to update or add information to the README.md file. However, this is ok since I'll tackle it later in a clean up task (#49)
Thanks!
Pull Request
Description
This PR implements the requested --validate-checksum flag for the download command to ensure data integrity.
Briefly describe the changes introduced in this PR.
Key Changes:
CLI: Added the --validate-checksum boolean flag to the download command in cli.py.
Logic: Implemented SHA256 verification in _download_file. It uses the existing compute_sha256_and_length utility from webdav.py to avoid code duplication.
Metadata Extraction: Updated _download_version and _download_artifact to extract expected checksums from the JSON-LD metadata and pass them down to the downloader.
Single File Fix: Added specific logic in download() to fetch the parent Version's JSON-LD when downloading a single file, ensuring checksum validation works for individual file downloads as well.
Related Issues
Closes #28
Type of change
Checklist:
poetry run pytest- all tests passedpoetry run ruff check- no linting errorsSummary by CodeRabbit
New Features
Bug Fixes
✏️ Tip: You can customize this high-level summary in your review settings.