-
Notifications
You must be signed in to change notification settings - Fork 12
Add on-the-fly compression conversion during download (Issue #18) #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add on-the-fly compression conversion during download (Issue #18) #45
Conversation
📝 WalkthroughWalkthroughAdds on-the-fly compression conversion to downloads: new CLI options Changes
Sequence Diagram(s)sequenceDiagram
participant User as CLI User
participant CLI as cli.download()
participant API as api.download()
participant DL as _download_file()
participant Conv as Conversion Logic
participant FS as File System
User->>CLI: run download with --convert-to gz --convert-from bz2
CLI->>API: download(..., convert_to='gz', convert_from='bz2')
API->>DL: _download_file(..., convert_to, convert_from)
DL->>FS: fetch original file (e.g., file.bz2)
FS-->>DL: file.bz2
DL->>Conv: _detect_compression_format(file.bz2)
Conv-->>DL: 'bz2'
DL->>Conv: _should_convert_file(..., convert_to='gz', convert_from='bz2')
Conv-->>DL: true
DL->>Conv: _convert_compression_format(file.bz2 → file.gz)
Conv->>Conv: decompress bz2, compress to gz
Conv-->>DL: file.gz
DL->>FS: remove original file.bz2, write file.gz
DL-->>API: download+conversion complete
API-->>CLI: success
CLI-->>User: finished
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (2)
tests/test_compression_conversion.py (1)
66-135: Comprehensive happy-path coverage.The conversion tests validate all format pairs (bz2↔gz↔xz) with proper data integrity checks and verification that original files are removed after successful conversion. The tests use temporary directories for isolation, which is good practice.
Consider adding tests for error conditions such as corrupted source files, permission errors, or disk space issues to improve test robustness and ensure graceful failure handling.
💡 Optional: Add error condition tests
Consider adding tests like:
def test_convert_with_corrupted_file(): """Test conversion fails gracefully with corrupted input""" with tempfile.TemporaryDirectory() as tmpdir: # Create invalid compressed file bad_file = os.path.join(tmpdir, "bad.txt.bz2") with open(bad_file, 'wb') as f: f.write(b"not actually compressed data") target_file = os.path.join(tmpdir, "bad.txt.gz") # Should raise RuntimeError and not create partial target with pytest.raises(RuntimeError): _convert_compression_format(bad_file, target_file, "bz2", "gz") assert os.path.exists(bad_file) # Original preserved on failure assert not os.path.exists(target_file) # No partial outputThis would validate the error handling mentioned in the conversion implementation.
databusclient/api/download.py (1)
134-138: Consider preserving original exception details.Wrapping the exception in
RuntimeErroron Line 138 may obscure the root cause. Consider chaining the exception or providing more context.🔎 Suggested improvement
except Exception as e: # If conversion fails, ensure the partial target file is removed if os.path.exists(target_file): os.remove(target_file) - raise RuntimeError(f"Compression conversion failed: {e}") + raise RuntimeError(f"Compression conversion failed: {e}") from eThis preserves the full exception chain for debugging.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
.gitignoreREADME.mddatabusclient/api/download.pydatabusclient/cli.pytests/test_compression_conversion.pytests/test_download.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_compression_conversion.py (1)
databusclient/api/download.py (5)
download(700-836)_detect_compression_format(31-45)_should_convert_file(48-79)_get_converted_filename(82-99)_convert_compression_format(102-138)
🔇 Additional comments (18)
.gitignore (1)
3-4: LGTM!The additions are appropriate for keeping test artifacts and sensitive authentication data out of version control.
tests/test_download.py (2)
3-3: LGTM!The pytest import is correctly added to support the skip decorator.
30-30: LGTM!Skipping the long-running, network-dependent collection test is a sensible practice for faster local test execution. The reason is clearly documented.
tests/test_compression_conversion.py (4)
17-24: LGTM!The format detection tests comprehensively cover all supported formats (bz2, gz, xz), uncompressed files, and case insensitivity. This validates the detection logic correctly.
26-57: LGTM!The conversion decision logic tests are thorough, covering all key scenarios: no target specified, uncompressed files, same source/target formats, valid conversions, and the convert_from filter behavior.
59-64: LGTM!The filename conversion tests validate all format pair transformations concisely.
137-138: LGTM!Direct pytest execution support is convenient for development and debugging.
README.md (3)
169-173: LGTM!The documentation for the new compression conversion options is clear and accurately describes the functionality. The supported formats and optional filtering behavior are well explained.
185-211: LGTM!The updated help text and reformatted option descriptions improve readability and accurately reflect the new compression conversion capabilities.
264-274: LGTM!The examples section effectively demonstrates the three primary use cases: converting all compressed files, filtering by source format, and converting entire collections. This will help users understand how to leverage the new feature.
databusclient/cli.py (2)
161-170: LGTM!The new CLI options are well-integrated with appropriate validation (restricted choices: bz2, gz, xz) and case-insensitive matching, which aligns with the format detection logic. The help text clearly describes each option's purpose.
180-197: LGTM!The parameters are correctly added to the function signature, the docstring is updated to reflect the new functionality, and the values are properly propagated to the
api_downloadcall.databusclient/api/download.py (6)
3-28: LGTM! Clean compression module setup.The imports and constant mappings are well-structured and use standard library modules for compression handling.
31-45: LGTM! Robust extension detection.The case-insensitive matching ensures reliable format detection across different filename conventions.
48-79: LGTM! Conversion decision logic is sound.The function correctly handles all conversion scenarios including filtering and format matching.
141-263: LGTM! Clean integration of conversion into download flow.The conversion logic is correctly placed after download completion and size verification. The function properly handles filenames and paths when calling conversion helpers.
265-299: LGTM! Consistent parameter propagation across all download functions.All internal download functions properly accept and forward
convert_toandconvert_fromparameters through the call chain, with clear documentation.Also applies to: 437-476, 478-512, 515-557, 624-663
700-836: The CLI already validates compression format parameters using Click's built-inChoicevalidator, which automatically rejects invalid values before they reach the API function. No additional validation is needed.Likely an incorrect or invalid review comment.
| def _convert_compression_format( | ||
| source_file: str, target_file: str, source_format: str, target_format: str | ||
| ) -> None: | ||
| """ | ||
| Convert a compressed file from one format to another. | ||
| Parameters: | ||
| - source_file: Path to source compressed file | ||
| - target_file: Path to target compressed file | ||
| - source_format: Source compression format ('bz2', 'gz', 'xz') | ||
| - target_format: Target compression format ('bz2', 'gz', 'xz') | ||
| """ | ||
| source_module = COMPRESSION_MODULES[source_format] | ||
| target_module = COMPRESSION_MODULES[target_format] | ||
|
|
||
| print(f"Converting {source_format} → {target_format}: {os.path.basename(source_file)}") | ||
|
|
||
| # Decompress and recompress with progress indication | ||
| chunk_size = 8192 | ||
|
|
||
| try: | ||
| with source_module.open(source_file, 'rb') as sf: | ||
| with target_module.open(target_file, 'wb') as tf: | ||
| while True: | ||
| chunk = sf.read(chunk_size) | ||
| if not chunk: | ||
| break | ||
| tf.write(chunk) | ||
|
|
||
| # Remove the original file after successful conversion | ||
| os.remove(source_file) | ||
| print(f"Conversion complete: {os.path.basename(target_file)}") | ||
| except Exception as e: | ||
| # If conversion fails, ensure the partial target file is removed | ||
| if os.path.exists(target_file): | ||
| os.remove(target_file) | ||
| raise RuntimeError(f"Compression conversion failed: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add input validation for compression formats.
Lines 114-115 access COMPRESSION_MODULES using source_format and target_format without validation. If invalid format strings are passed (e.g., from external input or typos), this will raise KeyError.
🔎 Proposed fix with validation
def _convert_compression_format(
source_file: str, target_file: str, source_format: str, target_format: str
) -> None:
"""
Convert a compressed file from one format to another.
Parameters:
- source_file: Path to source compressed file
- target_file: Path to target compressed file
- source_format: Source compression format ('bz2', 'gz', 'xz')
- target_format: Target compression format ('bz2', 'gz', 'xz')
"""
+ # Validate compression formats
+ if source_format not in COMPRESSION_MODULES:
+ raise ValueError(f"Unsupported source compression format: {source_format}")
+ if target_format not in COMPRESSION_MODULES:
+ raise ValueError(f"Unsupported target compression format: {target_format}")
+
source_module = COMPRESSION_MODULES[source_format]
target_module = COMPRESSION_MODULES[target_format]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _convert_compression_format( | |
| source_file: str, target_file: str, source_format: str, target_format: str | |
| ) -> None: | |
| """ | |
| Convert a compressed file from one format to another. | |
| Parameters: | |
| - source_file: Path to source compressed file | |
| - target_file: Path to target compressed file | |
| - source_format: Source compression format ('bz2', 'gz', 'xz') | |
| - target_format: Target compression format ('bz2', 'gz', 'xz') | |
| """ | |
| source_module = COMPRESSION_MODULES[source_format] | |
| target_module = COMPRESSION_MODULES[target_format] | |
| print(f"Converting {source_format} → {target_format}: {os.path.basename(source_file)}") | |
| # Decompress and recompress with progress indication | |
| chunk_size = 8192 | |
| try: | |
| with source_module.open(source_file, 'rb') as sf: | |
| with target_module.open(target_file, 'wb') as tf: | |
| while True: | |
| chunk = sf.read(chunk_size) | |
| if not chunk: | |
| break | |
| tf.write(chunk) | |
| # Remove the original file after successful conversion | |
| os.remove(source_file) | |
| print(f"Conversion complete: {os.path.basename(target_file)}") | |
| except Exception as e: | |
| # If conversion fails, ensure the partial target file is removed | |
| if os.path.exists(target_file): | |
| os.remove(target_file) | |
| raise RuntimeError(f"Compression conversion failed: {e}") | |
| def _convert_compression_format( | |
| source_file: str, target_file: str, source_format: str, target_format: str | |
| ) -> None: | |
| """ | |
| Convert a compressed file from one format to another. | |
| Parameters: | |
| - source_file: Path to source compressed file | |
| - target_file: Path to target compressed file | |
| - source_format: Source compression format ('bz2', 'gz', 'xz') | |
| - target_format: Target compression format ('bz2', 'gz', 'xz') | |
| """ | |
| # Validate compression formats | |
| if source_format not in COMPRESSION_MODULES: | |
| raise ValueError(f"Unsupported source compression format: {source_format}") | |
| if target_format not in COMPRESSION_MODULES: | |
| raise ValueError(f"Unsupported target compression format: {target_format}") | |
| source_module = COMPRESSION_MODULES[source_format] | |
| target_module = COMPRESSION_MODULES[target_format] | |
| print(f"Converting {source_format} → {target_format}: {os.path.basename(source_file)}") | |
| # Decompress and recompress with progress indication | |
| chunk_size = 8192 | |
| try: | |
| with source_module.open(source_file, 'rb') as sf: | |
| with target_module.open(target_file, 'wb') as tf: | |
| while True: | |
| chunk = sf.read(chunk_size) | |
| if not chunk: | |
| break | |
| tf.write(chunk) | |
| # Remove the original file after successful conversion | |
| os.remove(source_file) | |
| print(f"Conversion complete: {os.path.basename(target_file)}") | |
| except Exception as e: | |
| # If conversion fails, ensure the partial target file is removed | |
| if os.path.exists(target_file): | |
| os.remove(target_file) | |
| raise RuntimeError(f"Compression conversion failed: {e}") |
🤖 Prompt for AI Agents
In databusclient/api/download.py around lines 102 to 138, the function
_convert_compression_format accesses COMPRESSION_MODULES[source_format] and
COMPRESSION_MODULES[target_format] without validating the inputs; add explicit
validation at the top of the function to check that source_format and
target_format are strings and keys present in COMPRESSION_MODULES, and if not
raise a clear ValueError listing allowed formats; after validation, safely
retrieve the modules (or use dict access knowing keys exist) and proceed as
before; ensure error messages are descriptive (e.g., "Unsupported compression
format: <format>. Supported: <list>") so callers can correct invalid inputs.
|
Hi @DhanashreePetare! Just a reminder that this PR has conflicts. No rush |
98e4f35 to
5277f26
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @databusclient/api/download.py:
- Around line 82-99: The filename comparison is case-sensitive in
_get_converted_filename, causing mismatches with _detect_compression_format
which lowercases names; change the check to use a case-insensitive test (e.g.,
filename.lower().endswith(source_ext)) while still slicing the original filename
by len(source_ext) so the returned name preserves original casing; keep using
COMPRESSION_EXTENSIONS for source_ext/target_ext and only alter the conditional
and slicing logic accordingly.
- Around line 257-262: The bug is that the filename variable `file` is being
shadowed by a file handle in the `with` block so calls to
`_should_convert_file(file, ...)` end up passing a closed file handle instead of
the original filename string; fix by using the original filename string when
calling `_should_convert_file`, `_get_converted_filename`, and
`_convert_compression_format` (e.g., pass the variable that holds the
URL-derived filename) or rename the `with` block file handle (e.g., `fh` or
`fhandle`) so `file` continues to refer to the filename string; update calls to
`_should_convert_file`, `_get_converted_filename`, and
`_convert_compression_format` accordingly to avoid calling `.lower()` on a file
object.
🧹 Nitpick comments (1)
tests/test_compression_conversion.py (1)
59-63: Consider adding test for uppercase filenames.These tests cover the happy path but don't exercise the case-sensitivity edge case (e.g.,
DATA.TXT.GZ). Adding such a test would help catch the bug flagged in_get_converted_filename.Suggested test addition
def test_get_converted_filename(): """Test filename conversion""" assert _get_converted_filename("data.txt.bz2", "bz2", "gz") == "data.txt.gz" assert _get_converted_filename("data.txt.gz", "gz", "xz") == "data.txt.xz" assert _get_converted_filename("data.txt.xz", "xz", "bz2") == "data.txt.bz2" # Uppercase extension handling assert _get_converted_filename("DATA.TXT.GZ", "gz", "xz") == "DATA.TXT.xz"
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
.gitignoreREADME.mddatabusclient/api/download.pydatabusclient/cli.pytests/test_compression_conversion.pytests/test_download.py
🚧 Files skipped from review as they are similar to previous changes (2)
- README.md
- databusclient/cli.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_compression_conversion.py (1)
databusclient/api/download.py (5)
download(700-836)_detect_compression_format(31-45)_should_convert_file(48-79)_get_converted_filename(82-99)_convert_compression_format(102-138)
🔇 Additional comments (14)
.gitignore (1)
3-4: LGTM!Appropriate additions to ignore test artifacts (
test-download/) and sensitive token files (vault-token.dat).databusclient/api/download.py (6)
17-28: LGTM!Clean dictionary mappings for compression formats. This design allows easy extension to additional formats in the future.
31-45: LGTM!Case-insensitive detection logic is correct and well-documented.
48-79: LGTM!Decision logic correctly handles all edge cases: no target specified, uncompressed files, same source/target formats, and source format filtering.
102-138: LGTM!The chunked decompress → recompress pipeline is correctly implemented with proper cleanup on failure. The 8 KiB chunk size is reasonable for streaming.
265-298: LGTM!The
convert_toandconvert_fromparameters are correctly propagated through_download_filesto the underlying_download_filecalls.
700-727: LGTM!The public
download()API correctly exposes and documents the new compression conversion parameters, with clear docstrings explaining the supported formats.tests/test_compression_conversion.py (5)
1-14: LGTM!Appropriate imports for testing the compression conversion utilities.
17-23: LGTM!Good test coverage including case-insensitive matching.
26-56: LGTM!Comprehensive test coverage for all decision branches in
_should_convert_file.
66-134: LGTM!Well-structured conversion tests covering all format pairs with proper data integrity verification. The use of
tempfile.TemporaryDirectoryensures proper cleanup.
137-138: LGTM!Standard pytest main block for direct execution.
tests/test_download.py (2)
3-4: LGTM!Added pytest import to support the skip decorator.
30-32: LGTM!Skipping the long-running collection download test is appropriate for CI stability. The skip reason clearly documents why this test is disabled.
| # --- 6. Convert compression format if requested --- | ||
| should_convert, source_format = _should_convert_file(file, convert_to, convert_from) | ||
| if should_convert and source_format: | ||
| target_filename = _get_converted_filename(file, source_format, convert_to) | ||
| target_filepath = os.path.join(localDir, target_filename) | ||
| _convert_compression_format(filename, target_filepath, source_format, convert_to) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical bug: file variable shadowed by file handle.
At line 177, file is assigned the filename from the URL. At line 247, file is reassigned to the open file handle inside the with block. At line 258, _should_convert_file(file, ...) is called outside the with block, but file still references the closed file handle object, not the filename string.
This will cause _detect_compression_format to call .lower() on a file handle, raising an AttributeError.
Proposed fix
Use the original filename variable instead of file:
# --- 6. Convert compression format if requested ---
- should_convert, source_format = _should_convert_file(file, convert_to, convert_from)
+ original_filename = url.split("/")[-1]
+ should_convert, source_format = _should_convert_file(original_filename, convert_to, convert_from)
if should_convert and source_format:
- target_filename = _get_converted_filename(file, source_format, convert_to)
+ target_filename = _get_converted_filename(original_filename, source_format, convert_to)
target_filepath = os.path.join(localDir, target_filename)
_convert_compression_format(filename, target_filepath, source_format, convert_to)Alternatively, rename the file handle variable in the with statement to avoid shadowing:
- with open(filename, "wb") as file:
+ with open(filename, "wb") as f:
for data in response.iter_content(block_size):
progress_bar.update(len(data))
- file.write(data)
+ f.write(data)🤖 Prompt for AI Agents
In @databusclient/api/download.py around lines 257 - 262, The bug is that the
filename variable `file` is being shadowed by a file handle in the `with` block
so calls to `_should_convert_file(file, ...)` end up passing a closed file
handle instead of the original filename string; fix by using the original
filename string when calling `_should_convert_file`, `_get_converted_filename`,
and `_convert_compression_format` (e.g., pass the variable that holds the
URL-derived filename) or rename the `with` block file handle (e.g., `fh` or
`fhandle`) so `file` continues to refer to the filename string; update calls to
`_should_convert_file`, `_get_converted_filename`, and
`_convert_compression_format` accordingly to avoid calling `.lower()` on a file
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DhanashreePetare this is a critical issue. The first test I did failed immediately:
databusclient download https://databus.dbpedia.org/dbpedia/dbpedia-wikipedia-kg-dump/labels/2025-12-01/labels_graph=de-dbpedia-org_partition=rdfs-label.ttl.gz --convert-to bz2
It is good practice to test your own implementation using real-world examples. This is an effective way of identifying errors quickly.
Additionally, once problems have been found and fixed, it is often a good idea to add a test for them right away 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (11)
databusclient/api/download.py (8)
269-303: Critical syntax error causing pipeline failure.The function
_download_filehas malformed syntax with duplicate function signatures and docstrings. Lines 276-290 contain the new docstring with convert parameters, but then lines 291-303 contain what appears to be remnants of the original function definition mixed in, causing aSyntaxError: Simple statements must be separated by newlines or semicolons.The function should have a single, clean signature and docstring.
🐛 Proposed fix - consolidate the function definition
def _download_file( url, localDir, vault_token_file=None, databus_key=None, auth_url=None, client_id=None, convert_to=None, convert_from=None, -) -> None: - """ - Download a file from the internet with a progress bar using tqdm. - - Parameters: - - url: the URL of the file to download - - localDir: Local directory to download file to. If None, the databus folder structure is created in the current working directory. - - vault_token_file: Path to Vault refresh token file - - databus_key: Databus API key for protected downloads - - auth_url: Keycloak token endpoint URL - - client_id: Client ID for token exchange - - convert_to: Target compression format for on-the-fly conversion - - convert_from: Optional source compression format filter validate_checksum: bool = False, expected_checksum: str | None = None, ) -> None: - """Download a file from the internet with a progress bar using tqdm. + """Download a file from the internet with a progress bar using tqdm. Args: url: The URL of the file to download. localDir: Local directory to download file to. If None, the databus folder structure is created in the current working directory. vault_token_file: Path to Vault refresh token file. databus_key: Databus API key for protected downloads. auth_url: Keycloak token endpoint URL. client_id: Client ID for token exchange. + convert_to: Target compression format for on-the-fly conversion. + convert_from: Optional source compression format filter. + validate_checksum: Whether to validate the file checksum after download. + expected_checksum: Expected SHA256 checksum for validation. """
443-470: Checksum validation will fail after compression conversion.After conversion, the original file is deleted (in
_convert_compression_formatline 135), but checksum validation at line 453 still attempts to validatefilenamewhich no longer exists. Additionally, the expected checksum from Databus metadata corresponds to the original compressed file, not the converted one.Either:
- Skip checksum validation when conversion is performed, or
- Validate the checksum before conversion
🐛 Proposed fix - validate checksum before conversion
# --- 5. Verify download size --- if total_size_in_bytes != 0 and progress_bar.n != total_size_in_bytes: raise IOError("Downloaded size does not match Content-Length header") - # --- 6. Convert compression format if requested --- - should_convert, source_format = _should_convert_file(file, convert_to, convert_from) - if should_convert and source_format: - target_filename = _get_converted_filename(file, source_format, convert_to) - target_filepath = os.path.join(localDir, target_filename) - _convert_compression_format(filename, target_filepath, source_format, convert_to) - # --- 6. Optional checksum validation --- + # --- 6. Optional checksum validation (must happen before conversion) --- if validate_checksum: # reuse compute_sha256_and_length from webdav extension try: actual, _ = compute_sha256_and_length(filename) except (OSError, IOError) as e: print(f"WARNING: error computing checksum for {filename}: {e}") actual = None if expected_checksum is None: print(f"WARNING: no expected checksum available for {filename}; skipping validation") elif actual is None: print(f"WARNING: could not compute checksum for {filename}; skipping validation") else: if actual.lower() != expected_checksum.lower(): try: os.remove(filename) # delete corrupted file except OSError: pass raise IOError( f"Checksum mismatch for {filename}: expected {expected_checksum}, got {actual}" ) + # --- 7. Convert compression format if requested --- + should_convert, source_format = _should_convert_file(file, convert_to, convert_from) + if should_convert and source_format: + target_filename = _get_converted_filename(file, source_format, convert_to) + target_filepath = os.path.join(localDir, target_filename) + _convert_compression_format(filename, target_filepath, source_format, convert_to)
473-507: Same syntax error pattern as _download_file.The
_download_filesfunction has the same malformed definition with duplicate/overlapping docstrings. This needs to be consolidated similar to the fix for_download_file.
659-695: Same syntax error pattern in_download_collection.This function also has overlapping/duplicate docstrings that need consolidation.
720-752: Same syntax error pattern in_download_version.This function also has overlapping/duplicate docstrings that need consolidation.
777-812: Same syntax error pattern in_download_artifact.This function also has overlapping/duplicate docstrings that need consolidation.
906-941: Same syntax error pattern in_download_group.This function also has overlapping/duplicate docstrings that need consolidation.
1004-1029: Same syntax error pattern indownloadfunction.The main
downloadfunction also has overlapping/duplicate docstrings that need consolidation.databusclient/cli.py (3)
165-177: Malformed click option definition.The
--convert-fromoption definition is incomplete. Line 174 shows"--validate-checksum",which appears to be the start of a separate option that was incorrectly merged into the--convert-fromdefinition.🐛 Proposed fix - separate the options properly
@click.option( "--convert-from", type=click.Choice(["bz2", "gz", "xz"], case_sensitive=False), help="Source compression format to convert from (optional filter). Only files with this compression will be converted.", +) +@click.option( "--validate-checksum", is_flag=True, help="Validate checksums of downloaded files" )
178-191: Malformed function signature.The function signature has invalid syntax with a closing parenthesis, then a parameter, then another closing parenthesis on separate lines (189-191). This needs to be a single, properly formatted parameter list.
🐛 Proposed fix
def download( databusuris: List[str], localdir, databus, vault_token, databus_key, all_versions, authurl, clientid, convert_to, convert_from, -): - validate_checksum, -): + validate_checksum, +):
196-221: Duplicate api_download calls - only one should exist.There are two calls to
api_download: one at lines 196-207 (outside try block, has conversion params, missing validate_checksum) and another at lines 209-219 (inside try block, has validate_checksum, missing conversion params). This should be a single call with all parameters inside the try block.🐛 Proposed fix - consolidate into single call
- api_download( - localDir=localdir, - endpoint=databus, - databusURIs=databusuris, - token=vault_token, - databus_key=databus_key, - all_versions=all_versions, - auth_url=authurl, - client_id=clientid, - convert_to=convert_to, - convert_from=convert_from, - ) try: api_download( localDir=localdir, endpoint=databus, databusURIs=databusuris, token=vault_token, databus_key=databus_key, all_versions=all_versions, auth_url=authurl, client_id=clientid, - validate_checksum=validate_checksum + convert_to=convert_to, + convert_from=convert_from, + validate_checksum=validate_checksum, ) except DownloadAuthError as e: raise click.ClickException(str(e))
🤖 Fix all issues with AI agents
In @databusclient/api/download.py:
- Around line 6-8: The imports in this module duplicate List from typing; remove
the redundant "from typing import List" (the second occurrence) and consolidate
typing imports so you only have one line like "from typing import List,
Optional, Tuple" while keeping the existing "import re" intact; update the
import block (references: List, Optional, Tuple, re) accordingly.
🧹 Nitpick comments (1)
databusclient/api/download.py (1)
85-102: Consider case-insensitive extension matching.The
_get_converted_filenamefunction usesfilename.endswith(source_ext)which is case-sensitive, but_detect_compression_formatlowercases the filename before checking. If a file has a mixed-case extension like.BZ2, detection will work but filename conversion will append rather than replace the extension.♻️ Proposed fix for case-insensitive handling
def _get_converted_filename(filename: str, source_format: str, target_format: str) -> str: source_ext = COMPRESSION_EXTENSIONS[source_format] target_ext = COMPRESSION_EXTENSIONS[target_format] - if filename.endswith(source_ext): + if filename.lower().endswith(source_ext): return filename[:-len(source_ext)] + target_ext return filename + target_ext
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
README.mddatabusclient/api/download.pydatabusclient/cli.pytests/test_download.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_download.py
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/api/download.py (1)
databusclient/cli.py (1)
download(178-190)
🪛 GitHub Actions: Python CI (Lint & pytest)
databusclient/api/download.py
[error] 294-294: ruff check failed: SyntaxError: Simple statements must be separated by newlines or semicolons.
🔇 Additional comments (6)
README.md (2)
177-180: LGTM!The documentation for
--convert-toand--convert-fromoptions is clear and provides helpful examples. The descriptions accurately reflect the implementation behavior (decompress → recompress pipeline, format filtering).
272-282: LGTM!The examples section effectively demonstrates the three key use cases for compression conversion: converting all files to a target format, filtering by source format, and unifying a collection. This aligns well with the PR objectives.
databusclient/api/download.py (4)
20-31: LGTM!The compression format mappings are clean and provide a single source of truth for supported formats. The module-to-format mapping correctly uses Python's built-in compression libraries.
34-48: LGTM!The compression detection is case-insensitive which is a good defensive approach for handling filenames with varied casing.
51-82: LGTM!The conversion decision logic correctly handles all edge cases: no target format, uncompressed files, same source/target format, and optional source format filtering.
105-141: LGTM!The conversion implementation properly:
- Uses chunked streaming (8KB) to avoid memory issues with large files
- Cleans up partial target files on failure
- Removes the original file only after successful conversion
- Provides user feedback during conversion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
databusclient/api/download.py (1)
432-459: Critical: Variable shadowing causes runtime crash and checksum validation on deleted file.Two critical issues:
Variable shadowing bug: At line 306,
file = url.split("/")[-1]stores the filename string. But at line 422,with open(filename, "wb") as file:shadows this with a file handle. At line 433,_should_convert_file(file, ...)passes the closed file handle, not the filename string. This will crash withAttributeError: '_io.BufferedWriter' object has no attribute 'lower'.Checksum on deleted file: If conversion succeeds,
_convert_compression_formatdeletes the original file atfilename, but line 442 tries to compute checksum onfilenamewhich no longer exists.🐛 Proposed fix
- file = url.split("/")[-1] - filename = os.path.join(localDir, file) + base_filename = url.split("/")[-1] + filename = os.path.join(localDir, base_filename) print(f"Download file: {url}")And update the download and conversion section:
# --- 4. Download with progress bar --- total_size_in_bytes = int(response.headers.get("content-length", 0)) block_size = 1024 # 1 KiB progress_bar = tqdm(total=total_size_in_bytes, unit="iB", unit_scale=True) - with open(filename, "wb") as file: + with open(filename, "wb") as f: for data in response.iter_content(block_size): progress_bar.update(len(data)) - file.write(data) + f.write(data) progress_bar.close() # --- 5. Verify download size --- if total_size_in_bytes != 0 and progress_bar.n != total_size_in_bytes: raise IOError("Downloaded size does not match Content-Length header") # --- 6. Convert compression format if requested --- - should_convert, source_format = _should_convert_file(file, convert_to, convert_from) + should_convert, source_format = _should_convert_file(base_filename, convert_to, convert_from) + checksum_filepath = filename # Track which file to validate if should_convert and source_format: - target_filename = _get_converted_filename(file, source_format, convert_to) + target_filename = _get_converted_filename(base_filename, source_format, convert_to) target_filepath = os.path.join(localDir, target_filename) _convert_compression_format(filename, target_filepath, source_format, convert_to) - # --- 6. Optional checksum validation --- + checksum_filepath = target_filepath # Validate converted file + # --- 7. Optional checksum validation --- if validate_checksum: # reuse compute_sha256_and_length from webdav extension try: - actual, _ = compute_sha256_and_length(filename) + actual, _ = compute_sha256_and_length(checksum_filepath) except (OSError, IOError) as e: - print(f"WARNING: error computing checksum for {filename}: {e}") + print(f"WARNING: error computing checksum for {checksum_filepath}: {e}") actual = NoneNote: Checksum validation after conversion will compare against the original expected checksum, which won't match the converted file. Consider whether checksum validation should be skipped or adapted when conversion is performed.
🤖 Fix all issues with AI agents
In @databusclient/api/download.py:
- Around line 85-102: The function _get_converted_filename fails for uppercase
extensions; change the conditional to perform a case-insensitive suffix check
(e.g., use filename.lower().endswith(source_ext.lower())) so files like
"file.BZ2" are recognized, then slice off the matched extension length from the
original filename (filename[:-len(source_ext)]) and append target_ext as before;
keep using COMPRESSION_EXTENSIONS for source_ext/target_ext and leave the append
behavior when no match unchanged.
🧹 Nitpick comments (1)
databusclient/api/download.py (1)
105-141: LGTM with optional improvement note.Good error handling with cleanup of partial output on failure. The chunked I/O approach is memory-efficient.
Consider using
logginginstead ofprint()for better control over verbosity, but this is fine for a CLI tool.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
databusclient/api/download.pydatabusclient/cli.py
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/api/download.py (2)
databusclient/api/utils.py (1)
compute_sha256_and_length(62-72)databusclient/extensions/webdav.py (1)
compute_sha256_and_length(16-34)
🔇 Additional comments (12)
databusclient/api/download.py (10)
3-6: LGTM!Standard library imports for compression modules and typing are correctly added.
20-31: LGTM!Clean and correct mappings for the supported compression formats.
34-48: LGTM!Case-insensitive extension detection is correctly implemented.
51-82: LGTM!Conversion decision logic correctly handles all cases: no target format, uncompressed files, same-format skip, and source format filtering.
462-503: LGTM!Parameters correctly propagated to
_download_filesand forwarded to individual file downloads.
639-686: LGTM!Collection download correctly propagates conversion parameters.
689-733: LGTM!Version download correctly propagates conversion parameters.
736-788: LGTM!Artifact download correctly propagates conversion parameters through version iterations.
852-893: LGTM!Group download correctly propagates conversion parameters to artifact downloads.
930-1098: LGTM!Main
downloadfunction correctly propagates conversion parameters to all download paths (collection, file, version, artifact, group, and query-based downloads).databusclient/cli.py (2)
165-174: LGTM!CLI options are correctly defined with appropriate choices and help text. Case-insensitive matching is a good UX choice.
180-212: LGTM!Parameters correctly added to function signature and properly passed to
api_download. The docstring update accurately describes the new capability.
| def _get_converted_filename(filename: str, source_format: str, target_format: str) -> str: | ||
| """ | ||
| Generate the new filename after compression format conversion. | ||
| Parameters: | ||
| - filename: Original filename | ||
| - source_format: Source compression format ('bz2', 'gz', 'xz') | ||
| - target_format: Target compression format ('bz2', 'gz', 'xz') | ||
| Returns: | ||
| - New filename with updated extension | ||
| """ | ||
| source_ext = COMPRESSION_EXTENSIONS[source_format] | ||
| target_ext = COMPRESSION_EXTENSIONS[target_format] | ||
|
|
||
| if filename.endswith(source_ext): | ||
| return filename[:-len(source_ext)] + target_ext | ||
| return filename + target_ext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Case sensitivity mismatch causes incorrect filename generation.
The detection function uses lowercase comparison, but this function checks filename.endswith(source_ext) with the lowercase extension. If a file has an uppercase extension (e.g., file.BZ2), the replacement won't match and will incorrectly append the target extension (e.g., file.BZ2.gz instead of file.gz).
🔧 Proposed fix
def _get_converted_filename(filename: str, source_format: str, target_format: str) -> str:
source_ext = COMPRESSION_EXTENSIONS[source_format]
target_ext = COMPRESSION_EXTENSIONS[target_format]
- if filename.endswith(source_ext):
+ if filename.lower().endswith(source_ext):
return filename[:-len(source_ext)] + target_ext
return filename + target_ext📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _get_converted_filename(filename: str, source_format: str, target_format: str) -> str: | |
| """ | |
| Generate the new filename after compression format conversion. | |
| Parameters: | |
| - filename: Original filename | |
| - source_format: Source compression format ('bz2', 'gz', 'xz') | |
| - target_format: Target compression format ('bz2', 'gz', 'xz') | |
| Returns: | |
| - New filename with updated extension | |
| """ | |
| source_ext = COMPRESSION_EXTENSIONS[source_format] | |
| target_ext = COMPRESSION_EXTENSIONS[target_format] | |
| if filename.endswith(source_ext): | |
| return filename[:-len(source_ext)] + target_ext | |
| return filename + target_ext | |
| def _get_converted_filename(filename: str, source_format: str, target_format: str) -> str: | |
| """ | |
| Generate the new filename after compression format conversion. | |
| Parameters: | |
| - filename: Original filename | |
| - source_format: Source compression format ('bz2', 'gz', 'xz') | |
| - target_format: Target compression format ('bz2', 'gz', 'xz') | |
| Returns: | |
| - New filename with updated extension | |
| """ | |
| source_ext = COMPRESSION_EXTENSIONS[source_format] | |
| target_ext = COMPRESSION_EXTENSIONS[target_format] | |
| if filename.lower().endswith(source_ext): | |
| return filename[:-len(source_ext)] + target_ext | |
| return filename + target_ext |
🤖 Prompt for AI Agents
In @databusclient/api/download.py around lines 85 - 102, The function
_get_converted_filename fails for uppercase extensions; change the conditional
to perform a case-insensitive suffix check (e.g., use
filename.lower().endswith(source_ext.lower())) so files like "file.BZ2" are
recognized, then slice off the matched extension length from the original
filename (filename[:-len(source_ext)]) and append target_ext as before; keep
using COMPRESSION_EXTENSIONS for source_ext/target_ext and leave the append
behavior when no match unchanged.
| """ | ||
| Detect compression format from file extension. | ||
| Parameters: | ||
| - filename: Name of the file | ||
| Returns: | ||
| - Compression format string ('bz2', 'gz', 'xz') or None if not compressed | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use python google style docstring (see example https://google.github.io/styleguide/pyguide.html#244-decision)
| """ | ||
| Determine if a file should be converted and what the source format is. | ||
| Parameters: | ||
| - filename: Name of the file | ||
| - convert_to: Target compression format | ||
| - convert_from: Optional source compression format filter | ||
| Returns: | ||
| - Tuple of (should_convert: bool, source_format: Optional[str]) | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use python google style docstring (see example https://google.github.io/styleguide/pyguide.html#244-decision)
| """ | ||
| Generate the new filename after compression format conversion. | ||
| Parameters: | ||
| - filename: Original filename | ||
| - source_format: Source compression format ('bz2', 'gz', 'xz') | ||
| - target_format: Target compression format ('bz2', 'gz', 'xz') | ||
| Returns: | ||
| - New filename with updated extension | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use python google style docstring (see example https://google.github.io/styleguide/pyguide.html#244-decision)
| # --- 6. Convert compression format if requested --- | ||
| should_convert, source_format = _should_convert_file(file, convert_to, convert_from) | ||
| if should_convert and source_format: | ||
| target_filename = _get_converted_filename(file, source_format, convert_to) | ||
| target_filepath = os.path.join(localDir, target_filename) | ||
| _convert_compression_format(filename, target_filepath, source_format, convert_to) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DhanashreePetare this is a critical issue. The first test I did failed immediately:
databusclient download https://databus.dbpedia.org/dbpedia/dbpedia-wikipedia-kg-dump/labels/2025-12-01/labels_graph=de-dbpedia-org_partition=rdfs-label.ttl.gz --convert-to bz2
It is good practice to test your own implementation using real-world examples. This is an effective way of identifying errors quickly.
Additionally, once problems have been found and fixed, it is often a good idea to add a test for them right away 😄
Pull Request
Description
This PR implements on-the-fly compression format conversion during download, allowing users to convert downloaded files between
bz2,gz, andxzformats automatically during the download process. This feature makes it easier to unify datasets with consistent compression formats, save disk space, or integrate data into pipelines that expect specific formats.Key Features:
--convert-toCLI option to specify target compression format (bz2, gz, xz)--convert-fromCLI option to filter which source formats to convertRelated Issues
Resolves #18
Type of change
Checklist:
poetry run pytest- all tests passedpoetry run ruff check- no linting errorsSummary by CodeRabbit
New Features
Documentation
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.