Skip to content

Conversation

@DhanashreePetare
Copy link
Contributor

@DhanashreePetare DhanashreePetare commented Jan 2, 2026

Pull Request

Description

This PR implements on-the-fly compression format conversion during download, allowing users to convert downloaded files between bz2, gz, and xz formats automatically during the download process. This feature makes it easier to unify datasets with consistent compression formats, save disk space, or integrate data into pipelines that expect specific formats.

Key Features:

  • New --convert-to CLI option to specify target compression format (bz2, gz, xz)
  • Optional --convert-from CLI option to filter which source formats to convert
  • Automatic compression format detection based on file extensions
  • Smart conversion logic that skips when source equals target format
  • Decompress → Recompress pipeline using Python's built-in compression modules
  • Comprehensive test coverage for all conversion scenarios

Related Issues
Resolves #18

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • This change requires a documentation update
  • Housekeeping

Checklist:

  • My code follows the ruff code style of this project.
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (if applicable)
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
    • poetry run pytest - all tests passed
    • poetry run ruff check - no linting errors

Summary by CodeRabbit

  • New Features

    • On-the-fly compression conversion for downloads via new CLI options --convert-to and --convert-from (supports bz2, gz, xz); files are auto-decompressed/recompressed during download.
  • Documentation

    • Added "Download with Compression Conversion" section, examples, and clarified download option descriptions.
  • Tests

    • Added comprehensive compression conversion tests; marked a long-running collection download test as skipped.
  • Chores

    • Updated project ignore rules (added test-download/ and vault-token.dat).

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 2, 2026

📝 Walkthrough

Walkthrough

Adds on-the-fly compression conversion to downloads: new CLI options --convert-to and --convert-from (bz2, gz, xz), conversion detection and utilities in the download pipeline, propagates parameters through CLI/API, updates docs and .gitignore, and adds tests for conversion behavior.

Changes

Cohort / File(s) Summary
Config & Docs
\.gitignore`, `README.md``
Adds ignore entries test-download/ and vault-token.dat; documents --convert-to / --convert-from, updates download help text and usage examples.
Core Download API
\databusclient/api/download.py``
Adds compression mappings and helpers (_detect_compression_format, _should_convert_file, _get_converted_filename, _convert_compression_format); introduces convert_to / convert_from parameters across download call chain; performs post-download conversion with cleanup and integrates with checksum flow.
CLI Layer
\databusclient/cli.py``
Adds --convert-to and --convert-from click options (choices: bz2, gz, xz), updates command docstring, and forwards new args to the API.
Tests
\tests/test_compression_conversion.py`, `tests/test_download.py``
New tests for format detection, decision logic, filename generation, and end-to-end conversions (bz2↔gz↔xz) with integrity checks; marks long-running collection download test as skipped.

Sequence Diagram(s)

sequenceDiagram
    participant User as CLI User
    participant CLI as cli.download()
    participant API as api.download()
    participant DL as _download_file()
    participant Conv as Conversion Logic
    participant FS as File System

    User->>CLI: run download with --convert-to gz --convert-from bz2
    CLI->>API: download(..., convert_to='gz', convert_from='bz2')
    API->>DL: _download_file(..., convert_to, convert_from)
    DL->>FS: fetch original file (e.g., file.bz2)
    FS-->>DL: file.bz2
    DL->>Conv: _detect_compression_format(file.bz2)
    Conv-->>DL: 'bz2'
    DL->>Conv: _should_convert_file(..., convert_to='gz', convert_from='bz2')
    Conv-->>DL: true
    DL->>Conv: _convert_compression_format(file.bz2 → file.gz)
    Conv->>Conv: decompress bz2, compress to gz
    Conv-->>DL: file.gz
    DL->>FS: remove original file.bz2, write file.gz
    DL-->>API: download+conversion complete
    API-->>CLI: success
    CLI-->>User: finished
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding on-the-fly compression conversion during download, with reference to the related issue #18.
Description check ✅ Passed The description covers all required template sections: detailed feature description, type of change (new feature and documentation update selected), and comprehensive checklist with all items marked complete.
Linked Issues check ✅ Passed The PR implements all core requirements from issue #18: --convert-to CLI option for bz2/gz/xz formats, --convert-from filter option, compression detection, skip logic for identical formats, and decompress-recompress pipeline.
Out of Scope Changes check ✅ Passed All changes are in scope: compression conversion implementation in download.py/cli.py, documentation updates in README.md, comprehensive test coverage, .gitignore updates for test artifacts, and test fixes for pre-existing flakiness.
Docstring Coverage ✅ Passed Docstring coverage is 90.48% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
tests/test_compression_conversion.py (1)

66-135: Comprehensive happy-path coverage.

The conversion tests validate all format pairs (bz2↔gz↔xz) with proper data integrity checks and verification that original files are removed after successful conversion. The tests use temporary directories for isolation, which is good practice.

Consider adding tests for error conditions such as corrupted source files, permission errors, or disk space issues to improve test robustness and ensure graceful failure handling.

💡 Optional: Add error condition tests

Consider adding tests like:

def test_convert_with_corrupted_file():
    """Test conversion fails gracefully with corrupted input"""
    with tempfile.TemporaryDirectory() as tmpdir:
        # Create invalid compressed file
        bad_file = os.path.join(tmpdir, "bad.txt.bz2")
        with open(bad_file, 'wb') as f:
            f.write(b"not actually compressed data")
        
        target_file = os.path.join(tmpdir, "bad.txt.gz")
        
        # Should raise RuntimeError and not create partial target
        with pytest.raises(RuntimeError):
            _convert_compression_format(bad_file, target_file, "bz2", "gz")
        
        assert os.path.exists(bad_file)  # Original preserved on failure
        assert not os.path.exists(target_file)  # No partial output

This would validate the error handling mentioned in the conversion implementation.

databusclient/api/download.py (1)

134-138: Consider preserving original exception details.

Wrapping the exception in RuntimeError on Line 138 may obscure the root cause. Consider chaining the exception or providing more context.

🔎 Suggested improvement
     except Exception as e:
         # If conversion fails, ensure the partial target file is removed
         if os.path.exists(target_file):
             os.remove(target_file)
-        raise RuntimeError(f"Compression conversion failed: {e}")
+        raise RuntimeError(f"Compression conversion failed: {e}") from e

This preserves the full exception chain for debugging.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b2c3f1c and 46df914.

📒 Files selected for processing (6)
  • .gitignore
  • README.md
  • databusclient/api/download.py
  • databusclient/cli.py
  • tests/test_compression_conversion.py
  • tests/test_download.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_compression_conversion.py (1)
databusclient/api/download.py (5)
  • download (700-836)
  • _detect_compression_format (31-45)
  • _should_convert_file (48-79)
  • _get_converted_filename (82-99)
  • _convert_compression_format (102-138)
🔇 Additional comments (18)
.gitignore (1)

3-4: LGTM!

The additions are appropriate for keeping test artifacts and sensitive authentication data out of version control.

tests/test_download.py (2)

3-3: LGTM!

The pytest import is correctly added to support the skip decorator.


30-30: LGTM!

Skipping the long-running, network-dependent collection test is a sensible practice for faster local test execution. The reason is clearly documented.

tests/test_compression_conversion.py (4)

17-24: LGTM!

The format detection tests comprehensively cover all supported formats (bz2, gz, xz), uncompressed files, and case insensitivity. This validates the detection logic correctly.


26-57: LGTM!

The conversion decision logic tests are thorough, covering all key scenarios: no target specified, uncompressed files, same source/target formats, valid conversions, and the convert_from filter behavior.


59-64: LGTM!

The filename conversion tests validate all format pair transformations concisely.


137-138: LGTM!

Direct pytest execution support is convenient for development and debugging.

README.md (3)

169-173: LGTM!

The documentation for the new compression conversion options is clear and accurately describes the functionality. The supported formats and optional filtering behavior are well explained.


185-211: LGTM!

The updated help text and reformatted option descriptions improve readability and accurately reflect the new compression conversion capabilities.


264-274: LGTM!

The examples section effectively demonstrates the three primary use cases: converting all compressed files, filtering by source format, and converting entire collections. This will help users understand how to leverage the new feature.

databusclient/cli.py (2)

161-170: LGTM!

The new CLI options are well-integrated with appropriate validation (restricted choices: bz2, gz, xz) and case-insensitive matching, which aligns with the format detection logic. The help text clearly describes each option's purpose.


180-197: LGTM!

The parameters are correctly added to the function signature, the docstring is updated to reflect the new functionality, and the values are properly propagated to the api_download call.

databusclient/api/download.py (6)

3-28: LGTM! Clean compression module setup.

The imports and constant mappings are well-structured and use standard library modules for compression handling.


31-45: LGTM! Robust extension detection.

The case-insensitive matching ensures reliable format detection across different filename conventions.


48-79: LGTM! Conversion decision logic is sound.

The function correctly handles all conversion scenarios including filtering and format matching.


141-263: LGTM! Clean integration of conversion into download flow.

The conversion logic is correctly placed after download completion and size verification. The function properly handles filenames and paths when calling conversion helpers.


265-299: LGTM! Consistent parameter propagation across all download functions.

All internal download functions properly accept and forward convert_to and convert_from parameters through the call chain, with clear documentation.

Also applies to: 437-476, 478-512, 515-557, 624-663


700-836: The CLI already validates compression format parameters using Click's built-in Choice validator, which automatically rejects invalid values before they reach the API function. No additional validation is needed.

Likely an incorrect or invalid review comment.

Comment on lines +102 to +138
def _convert_compression_format(
source_file: str, target_file: str, source_format: str, target_format: str
) -> None:
"""
Convert a compressed file from one format to another.
Parameters:
- source_file: Path to source compressed file
- target_file: Path to target compressed file
- source_format: Source compression format ('bz2', 'gz', 'xz')
- target_format: Target compression format ('bz2', 'gz', 'xz')
"""
source_module = COMPRESSION_MODULES[source_format]
target_module = COMPRESSION_MODULES[target_format]

print(f"Converting {source_format}{target_format}: {os.path.basename(source_file)}")

# Decompress and recompress with progress indication
chunk_size = 8192

try:
with source_module.open(source_file, 'rb') as sf:
with target_module.open(target_file, 'wb') as tf:
while True:
chunk = sf.read(chunk_size)
if not chunk:
break
tf.write(chunk)

# Remove the original file after successful conversion
os.remove(source_file)
print(f"Conversion complete: {os.path.basename(target_file)}")
except Exception as e:
# If conversion fails, ensure the partial target file is removed
if os.path.exists(target_file):
os.remove(target_file)
raise RuntimeError(f"Compression conversion failed: {e}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Add input validation for compression formats.

Lines 114-115 access COMPRESSION_MODULES using source_format and target_format without validation. If invalid format strings are passed (e.g., from external input or typos), this will raise KeyError.

🔎 Proposed fix with validation
 def _convert_compression_format(
     source_file: str, target_file: str, source_format: str, target_format: str
 ) -> None:
     """
     Convert a compressed file from one format to another.
     
     Parameters:
     - source_file: Path to source compressed file
     - target_file: Path to target compressed file
     - source_format: Source compression format ('bz2', 'gz', 'xz')
     - target_format: Target compression format ('bz2', 'gz', 'xz')
     """
+    # Validate compression formats
+    if source_format not in COMPRESSION_MODULES:
+        raise ValueError(f"Unsupported source compression format: {source_format}")
+    if target_format not in COMPRESSION_MODULES:
+        raise ValueError(f"Unsupported target compression format: {target_format}")
+    
     source_module = COMPRESSION_MODULES[source_format]
     target_module = COMPRESSION_MODULES[target_format]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _convert_compression_format(
source_file: str, target_file: str, source_format: str, target_format: str
) -> None:
"""
Convert a compressed file from one format to another.
Parameters:
- source_file: Path to source compressed file
- target_file: Path to target compressed file
- source_format: Source compression format ('bz2', 'gz', 'xz')
- target_format: Target compression format ('bz2', 'gz', 'xz')
"""
source_module = COMPRESSION_MODULES[source_format]
target_module = COMPRESSION_MODULES[target_format]
print(f"Converting {source_format}{target_format}: {os.path.basename(source_file)}")
# Decompress and recompress with progress indication
chunk_size = 8192
try:
with source_module.open(source_file, 'rb') as sf:
with target_module.open(target_file, 'wb') as tf:
while True:
chunk = sf.read(chunk_size)
if not chunk:
break
tf.write(chunk)
# Remove the original file after successful conversion
os.remove(source_file)
print(f"Conversion complete: {os.path.basename(target_file)}")
except Exception as e:
# If conversion fails, ensure the partial target file is removed
if os.path.exists(target_file):
os.remove(target_file)
raise RuntimeError(f"Compression conversion failed: {e}")
def _convert_compression_format(
source_file: str, target_file: str, source_format: str, target_format: str
) -> None:
"""
Convert a compressed file from one format to another.
Parameters:
- source_file: Path to source compressed file
- target_file: Path to target compressed file
- source_format: Source compression format ('bz2', 'gz', 'xz')
- target_format: Target compression format ('bz2', 'gz', 'xz')
"""
# Validate compression formats
if source_format not in COMPRESSION_MODULES:
raise ValueError(f"Unsupported source compression format: {source_format}")
if target_format not in COMPRESSION_MODULES:
raise ValueError(f"Unsupported target compression format: {target_format}")
source_module = COMPRESSION_MODULES[source_format]
target_module = COMPRESSION_MODULES[target_format]
print(f"Converting {source_format}{target_format}: {os.path.basename(source_file)}")
# Decompress and recompress with progress indication
chunk_size = 8192
try:
with source_module.open(source_file, 'rb') as sf:
with target_module.open(target_file, 'wb') as tf:
while True:
chunk = sf.read(chunk_size)
if not chunk:
break
tf.write(chunk)
# Remove the original file after successful conversion
os.remove(source_file)
print(f"Conversion complete: {os.path.basename(target_file)}")
except Exception as e:
# If conversion fails, ensure the partial target file is removed
if os.path.exists(target_file):
os.remove(target_file)
raise RuntimeError(f"Compression conversion failed: {e}")
🤖 Prompt for AI Agents
In databusclient/api/download.py around lines 102 to 138, the function
_convert_compression_format accesses COMPRESSION_MODULES[source_format] and
COMPRESSION_MODULES[target_format] without validating the inputs; add explicit
validation at the top of the function to check that source_format and
target_format are strings and keys present in COMPRESSION_MODULES, and if not
raise a clear ValueError listing allowed formats; after validation, safely
retrieve the modules (or use dict access knowing keys exist) and proceed as
before; ensure error messages are descriptive (e.g., "Unsupported compression
format: <format>. Supported: <list>") so callers can correct invalid inputs.

@Integer-Ctrl
Copy link
Contributor

Hi @DhanashreePetare! Just a reminder that this PR has conflicts. No rush

@DhanashreePetare DhanashreePetare force-pushed the feature/on-the-fly-conversion-issue-18 branch from 98e4f35 to 5277f26 Compare January 13, 2026 19:14
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @databusclient/api/download.py:
- Around line 82-99: The filename comparison is case-sensitive in
_get_converted_filename, causing mismatches with _detect_compression_format
which lowercases names; change the check to use a case-insensitive test (e.g.,
filename.lower().endswith(source_ext)) while still slicing the original filename
by len(source_ext) so the returned name preserves original casing; keep using
COMPRESSION_EXTENSIONS for source_ext/target_ext and only alter the conditional
and slicing logic accordingly.
- Around line 257-262: The bug is that the filename variable `file` is being
shadowed by a file handle in the `with` block so calls to
`_should_convert_file(file, ...)` end up passing a closed file handle instead of
the original filename string; fix by using the original filename string when
calling `_should_convert_file`, `_get_converted_filename`, and
`_convert_compression_format` (e.g., pass the variable that holds the
URL-derived filename) or rename the `with` block file handle (e.g., `fh` or
`fhandle`) so `file` continues to refer to the filename string; update calls to
`_should_convert_file`, `_get_converted_filename`, and
`_convert_compression_format` accordingly to avoid calling `.lower()` on a file
object.
🧹 Nitpick comments (1)
tests/test_compression_conversion.py (1)

59-63: Consider adding test for uppercase filenames.

These tests cover the happy path but don't exercise the case-sensitivity edge case (e.g., DATA.TXT.GZ). Adding such a test would help catch the bug flagged in _get_converted_filename.

Suggested test addition
def test_get_converted_filename():
    """Test filename conversion"""
    assert _get_converted_filename("data.txt.bz2", "bz2", "gz") == "data.txt.gz"
    assert _get_converted_filename("data.txt.gz", "gz", "xz") == "data.txt.xz"
    assert _get_converted_filename("data.txt.xz", "xz", "bz2") == "data.txt.bz2"
    # Uppercase extension handling
    assert _get_converted_filename("DATA.TXT.GZ", "gz", "xz") == "DATA.TXT.xz"
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 46df914 and 5277f26.

📒 Files selected for processing (6)
  • .gitignore
  • README.md
  • databusclient/api/download.py
  • databusclient/cli.py
  • tests/test_compression_conversion.py
  • tests/test_download.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • README.md
  • databusclient/cli.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_compression_conversion.py (1)
databusclient/api/download.py (5)
  • download (700-836)
  • _detect_compression_format (31-45)
  • _should_convert_file (48-79)
  • _get_converted_filename (82-99)
  • _convert_compression_format (102-138)
🔇 Additional comments (14)
.gitignore (1)

3-4: LGTM!

Appropriate additions to ignore test artifacts (test-download/) and sensitive token files (vault-token.dat).

databusclient/api/download.py (6)

17-28: LGTM!

Clean dictionary mappings for compression formats. This design allows easy extension to additional formats in the future.


31-45: LGTM!

Case-insensitive detection logic is correct and well-documented.


48-79: LGTM!

Decision logic correctly handles all edge cases: no target specified, uncompressed files, same source/target formats, and source format filtering.


102-138: LGTM!

The chunked decompress → recompress pipeline is correctly implemented with proper cleanup on failure. The 8 KiB chunk size is reasonable for streaming.


265-298: LGTM!

The convert_to and convert_from parameters are correctly propagated through _download_files to the underlying _download_file calls.


700-727: LGTM!

The public download() API correctly exposes and documents the new compression conversion parameters, with clear docstrings explaining the supported formats.

tests/test_compression_conversion.py (5)

1-14: LGTM!

Appropriate imports for testing the compression conversion utilities.


17-23: LGTM!

Good test coverage including case-insensitive matching.


26-56: LGTM!

Comprehensive test coverage for all decision branches in _should_convert_file.


66-134: LGTM!

Well-structured conversion tests covering all format pairs with proper data integrity verification. The use of tempfile.TemporaryDirectory ensures proper cleanup.


137-138: LGTM!

Standard pytest main block for direct execution.

tests/test_download.py (2)

3-4: LGTM!

Added pytest import to support the skip decorator.


30-32: LGTM!

Skipping the long-running collection download test is appropriate for CI stability. The skip reason clearly documents why this test is disabled.

Comment on lines +257 to +262
# --- 6. Convert compression format if requested ---
should_convert, source_format = _should_convert_file(file, convert_to, convert_from)
if should_convert and source_format:
target_filename = _get_converted_filename(file, source_format, convert_to)
target_filepath = os.path.join(localDir, target_filename)
_convert_compression_format(filename, target_filepath, source_format, convert_to)
Copy link

@coderabbitai coderabbitai bot Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical bug: file variable shadowed by file handle.

At line 177, file is assigned the filename from the URL. At line 247, file is reassigned to the open file handle inside the with block. At line 258, _should_convert_file(file, ...) is called outside the with block, but file still references the closed file handle object, not the filename string.

This will cause _detect_compression_format to call .lower() on a file handle, raising an AttributeError.

Proposed fix

Use the original filename variable instead of file:

     # --- 6. Convert compression format if requested ---
-    should_convert, source_format = _should_convert_file(file, convert_to, convert_from)
+    original_filename = url.split("/")[-1]
+    should_convert, source_format = _should_convert_file(original_filename, convert_to, convert_from)
     if should_convert and source_format:
-        target_filename = _get_converted_filename(file, source_format, convert_to)
+        target_filename = _get_converted_filename(original_filename, source_format, convert_to)
         target_filepath = os.path.join(localDir, target_filename)
         _convert_compression_format(filename, target_filepath, source_format, convert_to)

Alternatively, rename the file handle variable in the with statement to avoid shadowing:

-    with open(filename, "wb") as file:
+    with open(filename, "wb") as f:
         for data in response.iter_content(block_size):
             progress_bar.update(len(data))
-            file.write(data)
+            f.write(data)
🤖 Prompt for AI Agents
In @databusclient/api/download.py around lines 257 - 262, The bug is that the
filename variable `file` is being shadowed by a file handle in the `with` block
so calls to `_should_convert_file(file, ...)` end up passing a closed file
handle instead of the original filename string; fix by using the original
filename string when calling `_should_convert_file`, `_get_converted_filename`,
and `_convert_compression_format` (e.g., pass the variable that holds the
URL-derived filename) or rename the `with` block file handle (e.g., `fh` or
`fhandle`) so `file` continues to refer to the filename string; update calls to
`_should_convert_file`, `_get_converted_filename`, and
`_convert_compression_format` accordingly to avoid calling `.lower()` on a file
object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DhanashreePetare this is a critical issue. The first test I did failed immediately:
databusclient download https://databus.dbpedia.org/dbpedia/dbpedia-wikipedia-kg-dump/labels/2025-12-01/labels_graph=de-dbpedia-org_partition=rdfs-label.ttl.gz --convert-to bz2

It is good practice to test your own implementation using real-world examples. This is an effective way of identifying errors quickly.

Additionally, once problems have been found and fixed, it is often a good idea to add a test for them right away 😄

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (11)
databusclient/api/download.py (8)

269-303: Critical syntax error causing pipeline failure.

The function _download_file has malformed syntax with duplicate function signatures and docstrings. Lines 276-290 contain the new docstring with convert parameters, but then lines 291-303 contain what appears to be remnants of the original function definition mixed in, causing a SyntaxError: Simple statements must be separated by newlines or semicolons.

The function should have a single, clean signature and docstring.

🐛 Proposed fix - consolidate the function definition
 def _download_file(
     url,
     localDir,
     vault_token_file=None,
     databus_key=None,
     auth_url=None,
     client_id=None,
     convert_to=None,
     convert_from=None,
-) -> None:
-    """
-    Download a file from the internet with a progress bar using tqdm.
-
-    Parameters:
-    - url: the URL of the file to download
-    - localDir: Local directory to download file to. If None, the databus folder structure is created in the current working directory.
-    - vault_token_file: Path to Vault refresh token file
-    - databus_key: Databus API key for protected downloads
-    - auth_url: Keycloak token endpoint URL
-    - client_id: Client ID for token exchange
-    - convert_to: Target compression format for on-the-fly conversion
-    - convert_from: Optional source compression format filter
     validate_checksum: bool = False,
     expected_checksum: str | None = None,
 ) -> None:
-    """Download a file from the internet with a progress bar using tqdm.
+    """Download a file from the internet with a progress bar using tqdm.
 
     Args:
         url: The URL of the file to download.
         localDir: Local directory to download file to. If None, the databus folder structure is created in the current working directory.
         vault_token_file: Path to Vault refresh token file.
         databus_key: Databus API key for protected downloads.
         auth_url: Keycloak token endpoint URL.
         client_id: Client ID for token exchange.
+        convert_to: Target compression format for on-the-fly conversion.
+        convert_from: Optional source compression format filter.
+        validate_checksum: Whether to validate the file checksum after download.
+        expected_checksum: Expected SHA256 checksum for validation.
     """

443-470: Checksum validation will fail after compression conversion.

After conversion, the original file is deleted (in _convert_compression_format line 135), but checksum validation at line 453 still attempts to validate filename which no longer exists. Additionally, the expected checksum from Databus metadata corresponds to the original compressed file, not the converted one.

Either:

  1. Skip checksum validation when conversion is performed, or
  2. Validate the checksum before conversion
🐛 Proposed fix - validate checksum before conversion
     # --- 5. Verify download size ---
     if total_size_in_bytes != 0 and progress_bar.n != total_size_in_bytes:
         raise IOError("Downloaded size does not match Content-Length header")

-    # --- 6. Convert compression format if requested ---
-    should_convert, source_format = _should_convert_file(file, convert_to, convert_from)
-    if should_convert and source_format:
-        target_filename = _get_converted_filename(file, source_format, convert_to)
-        target_filepath = os.path.join(localDir, target_filename)
-        _convert_compression_format(filename, target_filepath, source_format, convert_to)
-    # --- 6. Optional checksum validation ---
+    # --- 6. Optional checksum validation (must happen before conversion) ---
     if validate_checksum:
         # reuse compute_sha256_and_length from webdav extension
         try:
             actual, _ = compute_sha256_and_length(filename)
         except (OSError, IOError) as e:
             print(f"WARNING: error computing checksum for {filename}: {e}")
             actual = None

         if expected_checksum is None:
             print(f"WARNING: no expected checksum available for {filename}; skipping validation")
         elif actual is None:
             print(f"WARNING: could not compute checksum for {filename}; skipping validation")
         else:
             if actual.lower() != expected_checksum.lower():
                 try: 
                     os.remove(filename)  # delete corrupted file
                 except OSError: 
                     pass
                 raise IOError(
                     f"Checksum mismatch for {filename}: expected {expected_checksum}, got {actual}"
                 )

+    # --- 7. Convert compression format if requested ---
+    should_convert, source_format = _should_convert_file(file, convert_to, convert_from)
+    if should_convert and source_format:
+        target_filename = _get_converted_filename(file, source_format, convert_to)
+        target_filepath = os.path.join(localDir, target_filename)
+        _convert_compression_format(filename, target_filepath, source_format, convert_to)

473-507: Same syntax error pattern as _download_file.

The _download_files function has the same malformed definition with duplicate/overlapping docstrings. This needs to be consolidated similar to the fix for _download_file.


659-695: Same syntax error pattern in _download_collection.

This function also has overlapping/duplicate docstrings that need consolidation.


720-752: Same syntax error pattern in _download_version.

This function also has overlapping/duplicate docstrings that need consolidation.


777-812: Same syntax error pattern in _download_artifact.

This function also has overlapping/duplicate docstrings that need consolidation.


906-941: Same syntax error pattern in _download_group.

This function also has overlapping/duplicate docstrings that need consolidation.


1004-1029: Same syntax error pattern in download function.

The main download function also has overlapping/duplicate docstrings that need consolidation.

databusclient/cli.py (3)

165-177: Malformed click option definition.

The --convert-from option definition is incomplete. Line 174 shows "--validate-checksum", which appears to be the start of a separate option that was incorrectly merged into the --convert-from definition.

🐛 Proposed fix - separate the options properly
 @click.option(
     "--convert-from",
     type=click.Choice(["bz2", "gz", "xz"], case_sensitive=False),
     help="Source compression format to convert from (optional filter). Only files with this compression will be converted.",
+)
+@click.option(
     "--validate-checksum",
     is_flag=True,
     help="Validate checksums of downloaded files"
 )

178-191: Malformed function signature.

The function signature has invalid syntax with a closing parenthesis, then a parameter, then another closing parenthesis on separate lines (189-191). This needs to be a single, properly formatted parameter list.

🐛 Proposed fix
 def download(
     databusuris: List[str],
     localdir,
     databus,
     vault_token,
     databus_key,
     all_versions,
     authurl,
     clientid,
     convert_to,
     convert_from,
-):
-    validate_checksum,
-):    
+    validate_checksum,
+):

196-221: Duplicate api_download calls - only one should exist.

There are two calls to api_download: one at lines 196-207 (outside try block, has conversion params, missing validate_checksum) and another at lines 209-219 (inside try block, has validate_checksum, missing conversion params). This should be a single call with all parameters inside the try block.

🐛 Proposed fix - consolidate into single call
-    api_download(
-        localDir=localdir,
-        endpoint=databus,
-        databusURIs=databusuris,
-        token=vault_token,
-        databus_key=databus_key,
-        all_versions=all_versions,
-        auth_url=authurl,
-        client_id=clientid,
-        convert_to=convert_to,
-        convert_from=convert_from,
-    )
     try:
         api_download(
             localDir=localdir,
             endpoint=databus,
             databusURIs=databusuris,
             token=vault_token,
             databus_key=databus_key,
             all_versions=all_versions,
             auth_url=authurl,
             client_id=clientid,
-            validate_checksum=validate_checksum
+            convert_to=convert_to,
+            convert_from=convert_from,
+            validate_checksum=validate_checksum,
         )            
     except DownloadAuthError as e:
         raise click.ClickException(str(e))
🤖 Fix all issues with AI agents
In @databusclient/api/download.py:
- Around line 6-8: The imports in this module duplicate List from typing; remove
the redundant "from typing import List" (the second occurrence) and consolidate
typing imports so you only have one line like "from typing import List,
Optional, Tuple" while keeping the existing "import re" intact; update the
import block (references: List, Optional, Tuple, re) accordingly.
🧹 Nitpick comments (1)
databusclient/api/download.py (1)

85-102: Consider case-insensitive extension matching.

The _get_converted_filename function uses filename.endswith(source_ext) which is case-sensitive, but _detect_compression_format lowercases the filename before checking. If a file has a mixed-case extension like .BZ2, detection will work but filename conversion will append rather than replace the extension.

♻️ Proposed fix for case-insensitive handling
 def _get_converted_filename(filename: str, source_format: str, target_format: str) -> str:
     source_ext = COMPRESSION_EXTENSIONS[source_format]
     target_ext = COMPRESSION_EXTENSIONS[target_format]
     
-    if filename.endswith(source_ext):
+    if filename.lower().endswith(source_ext):
         return filename[:-len(source_ext)] + target_ext
     return filename + target_ext
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5277f26 and 390d205.

📒 Files selected for processing (4)
  • README.md
  • databusclient/api/download.py
  • databusclient/cli.py
  • tests/test_download.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_download.py
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/api/download.py (1)
databusclient/cli.py (1)
  • download (178-190)
🪛 GitHub Actions: Python CI (Lint & pytest)
databusclient/api/download.py

[error] 294-294: ruff check failed: SyntaxError: Simple statements must be separated by newlines or semicolons.

🔇 Additional comments (6)
README.md (2)

177-180: LGTM!

The documentation for --convert-to and --convert-from options is clear and provides helpful examples. The descriptions accurately reflect the implementation behavior (decompress → recompress pipeline, format filtering).


272-282: LGTM!

The examples section effectively demonstrates the three key use cases for compression conversion: converting all files to a target format, filtering by source format, and unifying a collection. This aligns well with the PR objectives.

databusclient/api/download.py (4)

20-31: LGTM!

The compression format mappings are clean and provide a single source of truth for supported formats. The module-to-format mapping correctly uses Python's built-in compression libraries.


34-48: LGTM!

The compression detection is case-insensitive which is a good defensive approach for handling filenames with varied casing.


51-82: LGTM!

The conversion decision logic correctly handles all edge cases: no target format, uncompressed files, same source/target format, and optional source format filtering.


105-141: LGTM!

The conversion implementation properly:

  • Uses chunked streaming (8KB) to avoid memory issues with large files
  • Cleans up partial target files on failure
  • Removes the original file only after successful conversion
  • Provides user feedback during conversion

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
databusclient/api/download.py (1)

432-459: Critical: Variable shadowing causes runtime crash and checksum validation on deleted file.

Two critical issues:

  1. Variable shadowing bug: At line 306, file = url.split("/")[-1] stores the filename string. But at line 422, with open(filename, "wb") as file: shadows this with a file handle. At line 433, _should_convert_file(file, ...) passes the closed file handle, not the filename string. This will crash with AttributeError: '_io.BufferedWriter' object has no attribute 'lower'.

  2. Checksum on deleted file: If conversion succeeds, _convert_compression_format deletes the original file at filename, but line 442 tries to compute checksum on filename which no longer exists.

🐛 Proposed fix
-    file = url.split("/")[-1]
-    filename = os.path.join(localDir, file)
+    base_filename = url.split("/")[-1]
+    filename = os.path.join(localDir, base_filename)
     print(f"Download file: {url}")

And update the download and conversion section:

     # --- 4. Download with progress bar ---
     total_size_in_bytes = int(response.headers.get("content-length", 0))
     block_size = 1024  # 1 KiB
 
     progress_bar = tqdm(total=total_size_in_bytes, unit="iB", unit_scale=True)
-    with open(filename, "wb") as file:
+    with open(filename, "wb") as f:
         for data in response.iter_content(block_size):
             progress_bar.update(len(data))
-            file.write(data)
+            f.write(data)
     progress_bar.close()
 
     # --- 5. Verify download size ---
     if total_size_in_bytes != 0 and progress_bar.n != total_size_in_bytes:
         raise IOError("Downloaded size does not match Content-Length header")
 
     # --- 6. Convert compression format if requested ---
-    should_convert, source_format = _should_convert_file(file, convert_to, convert_from)
+    should_convert, source_format = _should_convert_file(base_filename, convert_to, convert_from)
+    checksum_filepath = filename  # Track which file to validate
     if should_convert and source_format:
-        target_filename = _get_converted_filename(file, source_format, convert_to)
+        target_filename = _get_converted_filename(base_filename, source_format, convert_to)
         target_filepath = os.path.join(localDir, target_filename)
         _convert_compression_format(filename, target_filepath, source_format, convert_to)
-    # --- 6. Optional checksum validation ---
+        checksum_filepath = target_filepath  # Validate converted file
+    # --- 7. Optional checksum validation ---
     if validate_checksum:
         # reuse compute_sha256_and_length from webdav extension
         try:
-            actual, _ = compute_sha256_and_length(filename)
+            actual, _ = compute_sha256_and_length(checksum_filepath)
         except (OSError, IOError) as e:
-            print(f"WARNING: error computing checksum for {filename}: {e}")
+            print(f"WARNING: error computing checksum for {checksum_filepath}: {e}")
             actual = None

Note: Checksum validation after conversion will compare against the original expected checksum, which won't match the converted file. Consider whether checksum validation should be skipped or adapted when conversion is performed.

🤖 Fix all issues with AI agents
In @databusclient/api/download.py:
- Around line 85-102: The function _get_converted_filename fails for uppercase
extensions; change the conditional to perform a case-insensitive suffix check
(e.g., use filename.lower().endswith(source_ext.lower())) so files like
"file.BZ2" are recognized, then slice off the matched extension length from the
original filename (filename[:-len(source_ext)]) and append target_ext as before;
keep using COMPRESSION_EXTENSIONS for source_ext/target_ext and leave the append
behavior when no match unchanged.
🧹 Nitpick comments (1)
databusclient/api/download.py (1)

105-141: LGTM with optional improvement note.

Good error handling with cleanup of partial output on failure. The chunked I/O approach is memory-efficient.

Consider using logging instead of print() for better control over verbosity, but this is fine for a CLI tool.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 390d205 and cbd59ac.

📒 Files selected for processing (2)
  • databusclient/api/download.py
  • databusclient/cli.py
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/api/download.py (2)
databusclient/api/utils.py (1)
  • compute_sha256_and_length (62-72)
databusclient/extensions/webdav.py (1)
  • compute_sha256_and_length (16-34)
🔇 Additional comments (12)
databusclient/api/download.py (10)

3-6: LGTM!

Standard library imports for compression modules and typing are correctly added.


20-31: LGTM!

Clean and correct mappings for the supported compression formats.


34-48: LGTM!

Case-insensitive extension detection is correctly implemented.


51-82: LGTM!

Conversion decision logic correctly handles all cases: no target format, uncompressed files, same-format skip, and source format filtering.


462-503: LGTM!

Parameters correctly propagated to _download_files and forwarded to individual file downloads.


639-686: LGTM!

Collection download correctly propagates conversion parameters.


689-733: LGTM!

Version download correctly propagates conversion parameters.


736-788: LGTM!

Artifact download correctly propagates conversion parameters through version iterations.


852-893: LGTM!

Group download correctly propagates conversion parameters to artifact downloads.


930-1098: LGTM!

Main download function correctly propagates conversion parameters to all download paths (collection, file, version, artifact, group, and query-based downloads).

databusclient/cli.py (2)

165-174: LGTM!

CLI options are correctly defined with appropriate choices and help text. Case-insensitive matching is a good UX choice.


180-212: LGTM!

Parameters correctly added to function signature and properly passed to api_download. The docstring update accurately describes the new capability.

Comment on lines +85 to +102
def _get_converted_filename(filename: str, source_format: str, target_format: str) -> str:
"""
Generate the new filename after compression format conversion.
Parameters:
- filename: Original filename
- source_format: Source compression format ('bz2', 'gz', 'xz')
- target_format: Target compression format ('bz2', 'gz', 'xz')
Returns:
- New filename with updated extension
"""
source_ext = COMPRESSION_EXTENSIONS[source_format]
target_ext = COMPRESSION_EXTENSIONS[target_format]

if filename.endswith(source_ext):
return filename[:-len(source_ext)] + target_ext
return filename + target_ext
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Case sensitivity mismatch causes incorrect filename generation.

The detection function uses lowercase comparison, but this function checks filename.endswith(source_ext) with the lowercase extension. If a file has an uppercase extension (e.g., file.BZ2), the replacement won't match and will incorrectly append the target extension (e.g., file.BZ2.gz instead of file.gz).

🔧 Proposed fix
 def _get_converted_filename(filename: str, source_format: str, target_format: str) -> str:
     source_ext = COMPRESSION_EXTENSIONS[source_format]
     target_ext = COMPRESSION_EXTENSIONS[target_format]
     
-    if filename.endswith(source_ext):
+    if filename.lower().endswith(source_ext):
         return filename[:-len(source_ext)] + target_ext
     return filename + target_ext
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _get_converted_filename(filename: str, source_format: str, target_format: str) -> str:
"""
Generate the new filename after compression format conversion.
Parameters:
- filename: Original filename
- source_format: Source compression format ('bz2', 'gz', 'xz')
- target_format: Target compression format ('bz2', 'gz', 'xz')
Returns:
- New filename with updated extension
"""
source_ext = COMPRESSION_EXTENSIONS[source_format]
target_ext = COMPRESSION_EXTENSIONS[target_format]
if filename.endswith(source_ext):
return filename[:-len(source_ext)] + target_ext
return filename + target_ext
def _get_converted_filename(filename: str, source_format: str, target_format: str) -> str:
"""
Generate the new filename after compression format conversion.
Parameters:
- filename: Original filename
- source_format: Source compression format ('bz2', 'gz', 'xz')
- target_format: Target compression format ('bz2', 'gz', 'xz')
Returns:
- New filename with updated extension
"""
source_ext = COMPRESSION_EXTENSIONS[source_format]
target_ext = COMPRESSION_EXTENSIONS[target_format]
if filename.lower().endswith(source_ext):
return filename[:-len(source_ext)] + target_ext
return filename + target_ext
🤖 Prompt for AI Agents
In @databusclient/api/download.py around lines 85 - 102, The function
_get_converted_filename fails for uppercase extensions; change the conditional
to perform a case-insensitive suffix check (e.g., use
filename.lower().endswith(source_ext.lower())) so files like "file.BZ2" are
recognized, then slice off the matched extension length from the original
filename (filename[:-len(source_ext)]) and append target_ext as before; keep
using COMPRESSION_EXTENSIONS for source_ext/target_ext and leave the append
behavior when no match unchanged.

@Integer-Ctrl Integer-Ctrl self-requested a review January 14, 2026 15:38
Comment on lines +35 to +43
"""
Detect compression format from file extension.
Parameters:
- filename: Name of the file
Returns:
- Compression format string ('bz2', 'gz', 'xz') or None if not compressed
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use python google style docstring (see example https://google.github.io/styleguide/pyguide.html#244-decision)

Comment on lines +54 to +64
"""
Determine if a file should be converted and what the source format is.
Parameters:
- filename: Name of the file
- convert_to: Target compression format
- convert_from: Optional source compression format filter
Returns:
- Tuple of (should_convert: bool, source_format: Optional[str])
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use python google style docstring (see example https://google.github.io/styleguide/pyguide.html#244-decision)

Comment on lines +86 to +96
"""
Generate the new filename after compression format conversion.
Parameters:
- filename: Original filename
- source_format: Source compression format ('bz2', 'gz', 'xz')
- target_format: Target compression format ('bz2', 'gz', 'xz')
Returns:
- New filename with updated extension
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use python google style docstring (see example https://google.github.io/styleguide/pyguide.html#244-decision)

Comment on lines +257 to +262
# --- 6. Convert compression format if requested ---
should_convert, source_format = _should_convert_file(file, convert_to, convert_from)
if should_convert and source_format:
target_filename = _get_converted_filename(file, source_format, convert_to)
target_filepath = os.path.join(localDir, target_filename)
_convert_compression_format(filename, target_filepath, source_format, convert_to)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DhanashreePetare this is a critical issue. The first test I did failed immediately:
databusclient download https://databus.dbpedia.org/dbpedia/dbpedia-wikipedia-kg-dump/labels/2025-12-01/labels_graph=de-dbpedia-org_partition=rdfs-label.ttl.gz --convert-to bz2

It is good practice to test your own implementation using real-world examples. This is an effective way of identifying errors quickly.

Additionally, once problems have been found and fixed, it is often a good idea to add a test for them right away 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

On-the-fly Conversion During Download

2 participants