-
Notifications
You must be signed in to change notification settings - Fork 0
feat: added media wiki ETL! #15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis pull request introduces a MediaWiki ETL pipeline. It adds new modules to extract, transform, and load MediaWiki XML dumps into structured data models using Pydantic. Additionally, it implements an XML reader, a crawler for handling dumps via an external package, and corresponding test suites. Minor updates were also made to the Changes
Sequence Diagram(s)sequenceDiagram
participant U as User
participant ETL as MediawikiETL
participant C as WikiteamCrawler
participant DG as DumpGenerator
participant PT as parse_mediawiki_xml
participant CIP as CustomIngestionPipeline
U->>ETL: Initiate ETL process
ETL->>C: Call crawl(api_url, dump_path)
C->>DG: Generate XML dump
DG-->>C: Dump file created
C-->>ETL: Return dump file path
ETL->>PT: Parse XML dump
PT-->>ETL: Return list of Document objects
ETL->>CIP: Ingest documents
alt Delete dump enabled
ETL->>ETL: Delete dump file
end
Poem
Tip ⚡💬 Agentic Chat (Pro Plan, General Availability)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Nitpick comments (7)
hivemind_etl/mediawiki/schema.py (1)
6-9: Consider type constraints.
user_idmight benefit from explicit numeric or string constraints. For instance, ifuser_idis strictly numeric, consider validating withconstr(regex="^\d+$")or an integer type.hivemind_etl/mediawiki/transform_xml.py (1)
56-68: Check for large-scale performance.
Gathering all pages in memory can be expensive if the XML is huge. Consider a streaming approach with iterparse if you anticipate large dumps.hivemind_etl/mediawiki/llama_xml_reader.py (1)
1-4: Fix duplicate module docstringsThe file contains two separate docstring literals at the top of the file, which is unconventional. The second docstring indicates the code was copied from another source.
Combine these into a single comprehensive docstring:
-"""XML Reader.""" - -"""Copied from https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/xml/base.py""" +"""XML Reader. + +Adapted from https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/xml/base.py +"""tests/unit/test_mediawiki_etl.py (4)
34-44: Fix grammar in commentMinor grammatical error in the comment.
def test_extract_with_default_path(self): - # Create a ETL instance with mocked wikiteam_crawler + # Create an ETL instance with mocked wikiteam_crawler etl = MediawikiETL(community_id=self.community_id) etl.wikiteam_crawler = Mock() etl.extract(self.api_url) etl.wikiteam_crawler.crawl.assert_called_once_with( self.api_url, f"dumps/{self.community_id}.xml" )
45-55: Fix grammar in commentMinor grammatical error in the comment.
def test_extract_with_custom_path(self): - # Create a ETL instance with mocked wikiteam_crawler + # Create an ETL instance with mocked wikiteam_crawler etl = MediawikiETL(community_id=self.community_id) etl.wikiteam_crawler = Mock() etl.extract(self.api_url, self.custom_path) self.assertEqual(etl.dump_path, self.custom_path) etl.wikiteam_crawler.crawl.assert_called_once_with( self.api_url, self.custom_path )
57-90: Expand test coverage for transform methodThe test validates several metadata fields but doesn't verify all fields, excluded_embed_metadata_keys, or excluded_llm_metadata_keys. This could miss issues with fields that aren't explicitly tested.
Expand the test to verify more metadata fields and configuration options:
@patch("hivemind_etl.mediawiki.etl.parse_mediawiki_xml") def test_transform_success(self, mock_parse_mediawiki_xml): etl = MediawikiETL(community_id=self.community_id) # Mock page data mock_page = Mock() mock_page.page_id = "123" mock_page.title = "Test Page" mock_page.namespace = 0 mock_page.revision = Mock( text="Test content", revision_id="456", parent_revision_id="455", timestamp="2024-01-01T00:00:00Z", comment="Test edit", contributor=Mock(username="testuser", user_id="789"), sha1="abc123", model="wikitext", ) mock_parse_mediawiki_xml.return_value = [mock_page] documents = etl.transform() self.assertEqual(len(documents), 1) doc = documents[0] self.assertIsInstance(doc, Document) self.assertEqual(doc.doc_id, "123") self.assertEqual(doc.text, "Test content") self.assertEqual(doc.metadata["title"], "Test Page") self.assertEqual(doc.metadata["namespace"], 0) self.assertEqual(doc.metadata["revision_id"], "456") self.assertEqual(doc.metadata["contributor_username"], "testuser") + # Verify additional metadata fields + self.assertEqual(doc.metadata["parent_revision_id"], "455") + self.assertEqual(doc.metadata["timestamp"], "2024-01-01T00:00:00Z") + self.assertEqual(doc.metadata["comment"], "Test edit") + self.assertEqual(doc.metadata["contributor_user_id"], "789") + self.assertEqual(doc.metadata["sha1"], "abc123") + self.assertEqual(doc.metadata["model"], "wikitext") + + # Verify excluded_embed_metadata_keys + self.assertIn("namespace", doc.excluded_embed_metadata_keys) + self.assertIn("revision_id", doc.excluded_embed_metadata_keys) + self.assertIn("parent_revision_id", doc.excluded_embed_metadata_keys) + self.assertIn("sha1", doc.excluded_embed_metadata_keys) + self.assertIn("model", doc.excluded_embed_metadata_keys) + self.assertIn("contributor_user_id", doc.excluded_embed_metadata_keys) + self.assertIn("comment", doc.excluded_embed_metadata_keys) + self.assertIn("timestamp", doc.excluded_embed_metadata_keys) + + # Verify excluded_llm_metadata_keys + self.assertIn("namespace", doc.excluded_llm_metadata_keys) + self.assertIn("revision_id", doc.excluded_llm_metadata_keys) + self.assertIn("parent_revision_id", doc.excluded_llm_metadata_keys) + self.assertIn("sha1", doc.excluded_llm_metadata_keys) + self.assertIn("model", doc.excluded_llm_metadata_keys) + self.assertIn("contributor_user_id", doc.excluded_llm_metadata_keys)
9-24: Use a context manager for directory cleanupThe current implementation creates and cleans up directories in
setUpandtearDown, but if a test fails or is interrupted, cleanup might not occur.Consider using a context manager like
tempfile.TemporaryDirectoryfor more reliable cleanup:import os import unittest +import tempfile from unittest.mock import Mock, patch from llama_index.core import Document from hivemind_etl.mediawiki.etl import MediawikiETL class TestMediawikiETL(unittest.TestCase): def setUp(self): self.community_id = "test_community" self.api_url = "https://example.com/api.php" self.custom_path = "custom/path/dump.xml" + # Create a temporary directory for dumps + self.temp_dir = tempfile.TemporaryDirectory() + self.original_dumps_dir = "dumps" + # Create a symlink from dumps to our temporary directory + if os.path.exists(self.original_dumps_dir): + self.original_dumps_backup = tempfile.TemporaryDirectory() + os.rename(self.original_dumps_dir, os.path.join(self.original_dumps_backup.name, "dumps")) + os.symlink(self.temp_dir.name, self.original_dumps_dir) def tearDown(self): + # Remove the symlink + if os.path.islink(self.original_dumps_dir): + os.unlink(self.original_dumps_dir) + # Restore original dumps directory if it existed + if hasattr(self, 'original_dumps_backup'): + os.rename(os.path.join(self.original_dumps_backup.name, "dumps"), self.original_dumps_dir) + self.original_dumps_backup.cleanup() + # Clean up the temporary directory + self.temp_dir.cleanup()This approach ensures cleanup happens even if tests fail, and it doesn't interfere with existing files in the current directory.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
.gitignore(1 hunks)hivemind_etl/mediawiki/etl.py(1 hunks)hivemind_etl/mediawiki/llama_xml_reader.py(1 hunks)hivemind_etl/mediawiki/schema.py(1 hunks)hivemind_etl/mediawiki/transform_xml.py(1 hunks)hivemind_etl/mediawiki/wikiteam_crawler.py(1 hunks)requirements.txt(1 hunks)tests/unit/test_mediawiki_etl.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (3)
hivemind_etl/mediawiki/transform_xml.py (1)
hivemind_etl/mediawiki/schema.py (4)
Contributor(6-8)Page(23-27)Revision(11-20)SiteInfo(30-34)
hivemind_etl/mediawiki/etl.py (2)
hivemind_etl/mediawiki/transform_xml.py (1)
parse_mediawiki_xml(7-122)hivemind_etl/mediawiki/wikiteam_crawler.py (2)
WikiteamCrawler(7-78)crawl(22-59)
tests/unit/test_mediawiki_etl.py (2)
hivemind_etl/mediawiki/etl.py (4)
MediawikiETL(10-80)extract(18-24)transform(26-71)load(73-80)hivemind_etl/mediawiki/wikiteam_crawler.py (1)
crawl(22-59)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: ci / lint / Lint
🔇 Additional comments (9)
requirements.txt (1)
11-11:Details
❓ Verification inconclusive
Ensure package version security.
The newly addedwikiteam3==4.4.1might have known vulnerabilities or require version updates. Consider verifying it with safety or similarly recognized tools.
🏁 Script executed:
#!/bin/bash # Checking for known vulnerabilities in wikiteam3==4.4.1 pip install safety safety check --full-report --file requirements.txtLength of output: 6329
Attention: Manual Verification Needed for
wikiteam3==4.4.1Vulnerability Scan
- The automated check using Safety did not produce any vulnerability report because the
safetycommand wasn’t found. It appears Safety was installed in/home/jailuser/.local/binwhich isn’t in the current PATH.- Action Required:
- Ensure that your PATH includes
/home/jailuser/.local/binor run Safety via its absolute path.- Manually verify the package version (e.g., by running
~/home/jailuser/.local/bin/safety check --full-report --file requirements.txt) to confirm thatwikiteam3==4.4.1is not affected by known vulnerabilities..gitignore (1)
164-164: Confirm ignoring XML files is intentional.
Adding*.xmlto.gitignoreis helpful to avoid large or temporary support files. However, ensure that XML files aren’t crucial for version control or partial diffs that might be needed for debugging or data reconstruction.Also applies to: 166-166
hivemind_etl/mediawiki/schema.py (4)
1-4: Imports look appropriate.
Good use oftyping.Optionalandpydantic.BaseModel. No immediate concerns here.
11-21: Revise default contributor construction.
contributor: Contributor = Contributor()means allRevisioninstances share the same defaultContributor. This is correct because Pydantic defaults re-instantiate mutable fields. However, if you intended a single shared contributor object, note it carefully. Most of the time, separateContributor()for eachRevisionis the right approach.
23-28: Model looks consistent.
Pagemodel aligns well with optional fields. For some data ingestion, you might add validations (e.g., title string length).
30-34: SiteInfo usage.
All fields are optional, which is practical for diverse MediaWiki dumps. Consider clarifying or validating at leastsitenameordbnameif your downstream logic relies on them.hivemind_etl/mediawiki/transform_xml.py (3)
7-26: Documentation is thorough.
The docstring provides a clear understanding of the function’s purpose, parameters, and return value. This is very helpful.
47-54: Optional SiteInfo usage.
Your code properly checks if<siteinfo>exists. Ensure the missing site info scenario is indeed valid for your downstream logic. If site details are absolutely required, you may want to log a warning when it’s missing.
119-122: Logging usage is good.
You log the total pages processed, which is helpful. Consider structured logging (e.g., JSON format) if you need advanced analytics or container-based orchestration logs.
| # Extract revision details | ||
| revision = page.find("mw:revision", namespaces) | ||
| if revision is not None: | ||
| rev_data = Revision() | ||
| rev_id_el = revision.find("mw:id", namespaces) | ||
| rev_data.revision_id = rev_id_el.text if rev_id_el is not None else None | ||
|
|
||
| parentid_el = revision.find("mw:parentid", namespaces) | ||
| rev_data.parent_revision_id = ( | ||
| parentid_el.text if parentid_el is not None else None | ||
| ) | ||
|
|
||
| timestamp_el = revision.find("mw:timestamp", namespaces) | ||
| rev_data.timestamp = timestamp_el.text if timestamp_el is not None else None | ||
|
|
||
| # Revision comment (present only on some pages) | ||
| comment_el = revision.find("mw:comment", namespaces) | ||
| rev_data.comment = comment_el.text if comment_el is not None else "" | ||
|
|
||
| # Contributor information | ||
| contributor = revision.find("mw:contributor", namespaces) | ||
| if contributor is not None: | ||
| cont_data = Contributor() | ||
| username_el = contributor.find("mw:username", namespaces) | ||
| cont_data.username = ( | ||
| username_el.text if username_el is not None else None | ||
| ) | ||
|
|
||
| user_id_el = contributor.find("mw:id", namespaces) | ||
| cont_data.user_id = user_id_el.text if user_id_el is not None else None | ||
|
|
||
| rev_data.contributor = cont_data | ||
|
|
||
| # Other revision details like model and format | ||
| model_el = revision.find("mw:model", namespaces) | ||
| rev_data.model = model_el.text if model_el is not None else None | ||
|
|
||
| format_el = revision.find("mw:format", namespaces) | ||
| rev_data.format = format_el.text if format_el is not None else None | ||
|
|
||
| # Extract the full text content; note that XML escapes are retained (e.g., <) | ||
| text_el = revision.find("mw:text", namespaces) | ||
| rev_data.text = text_el.text if text_el is not None else "" | ||
|
|
||
| # Capture sha1 if needed | ||
| sha1_el = revision.find("mw:sha1", namespaces) | ||
| rev_data.sha1 = sha1_el.text if sha1_el is not None else None | ||
|
|
||
| page_data.revision = rev_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Optional deeper contributor logic.
You have robust coverage for revision extraction. If you want to handle multiple revisions, you might extend the logic for multiple <revision> elements per page. Right now, you only grab the first (or last) <revision> block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
hivemind_etl/activities.py (1)
1-21: 🛠️ Refactor suggestionConsider adding explicit re-exports if the imported functions should be available through this module.
If the intention is to make the website activity functions available through this module, you should explicitly re-export them to maintain the same public API.
import logging from typing import Any from hivemind_etl.website.activities import ( get_hivemind_website_comminities, extract_website, transform_website_data, load_website_data, ) from temporalio import activity logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) @activity.defn async def say_hello(): return 7 + +# Re-export website activities to maintain the public API +__all__ = [ + "get_hivemind_website_comminities", + "extract_website", + "transform_website_data", + "load_website_data", + "say_hello", +]Additionally, update the function name if you fix the typo in the original function:
+# Re-export website activities to maintain the public API +__all__ = [ + "get_hivemind_website_communities", # Fixed typo + "extract_website", + "transform_website_data", + "load_website_data", + "say_hello", +]🧰 Tools
🪛 Ruff (0.8.2)
2-2:
typing.Anyimported but unusedRemove unused import:
typing.Any(F401)
5-5:
hivemind_etl.website.activities.get_hivemind_website_comminitiesimported but unusedRemove unused import
(F401)
6-6:
hivemind_etl.website.activities.extract_websiteimported but unusedRemove unused import
(F401)
7-7:
hivemind_etl.website.activities.transform_website_dataimported but unusedRemove unused import
(F401)
8-8:
hivemind_etl.website.activities.load_website_dataimported but unusedRemove unused import
(F401)
🧹 Nitpick comments (5)
workflows.py (1)
7-8: Remove unused imports.
CommunityWebsiteWorkflowandWebsiteIngestionSchedulerWorkfloware not referenced in this file. Consider removing them to satisfy lint and reduce clutter.-from hivemind_etl.website.workflows import ( - CommunityWebsiteWorkflow, - WebsiteIngestionSchedulerWorkflow, -)🧰 Tools
🪛 Ruff (0.8.2)
7-7:
hivemind_etl.website.workflows.CommunityWebsiteWorkflowimported but unusedRemove unused import
(F401)
8-8:
hivemind_etl.website.workflows.WebsiteIngestionSchedulerWorkflowimported but unusedRemove unused import
(F401)
registry.py (1)
3-6: Check spelling for function name.
get_hivemind_website_comminitieslikely contains a typographical error.- get_hivemind_website_comminities + get_hivemind_website_communitieshivemind_etl/website/workflows.py (1)
16-63: Consider unifying retry configurations or making them configurable.Each activity call has individual retry policies. For maintainability, store these in a shared constant or config if you need to adjust them in multiple places later.
hivemind_etl/website/activities.py (2)
1-10: Imports look good but consider organizing by category.Your imports are correctly structured, but consider organizing them by standard library, third-party, and local imports with blank lines separating each category for better readability.
import logging from typing import Any from temporalio import activity, workflow + with workflow.unsafe.imports_passed_through(): from hivemind_etl.website.module import ModulesWebsite from hivemind_etl.website.website_etl import WebsiteETL from llama_index.core import Document
13-42: Fix typo in function name "comminities" → "communities".There's a spelling error in the function name that should be corrected for consistency and clarity.
-async def get_hivemind_website_comminities( +async def get_hivemind_website_communities( platform_id: str | None = None, ) -> list[dict[str, Any]]:Also ensure you update all references to this function in other files, particularly in hivemind_etl/activities.py.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
hivemind_etl/activities.py(1 hunks)hivemind_etl/website/activities.py(1 hunks)hivemind_etl/website/workflows.py(1 hunks)registry.py(2 hunks)workflows.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (5)
registry.py (1)
hivemind_etl/website/activities.py (4)
get_hivemind_website_comminities(13-42)load_website_data(78-87)transform_website_data(62-74)extract_website(46-58)
hivemind_etl/website/activities.py (2)
hivemind_etl/website/module.py (2)
ModulesWebsite(6-74)get_learning_platforms(11-74)hivemind_etl/website/website_etl.py (1)
WebsiteETL(9-106)
hivemind_etl/activities.py (1)
hivemind_etl/website/activities.py (4)
get_hivemind_website_comminities(13-42)extract_website(46-58)transform_website_data(62-74)load_website_data(78-87)
hivemind_etl/website/workflows.py (1)
hivemind_etl/website/activities.py (4)
extract_website(46-58)get_hivemind_website_comminities(13-42)load_website_data(78-87)transform_website_data(62-74)
workflows.py (2)
hivemind_etl/activities.py (1)
say_hello(19-20)hivemind_etl/website/workflows.py (2)
CommunityWebsiteWorkflow(18-61)WebsiteIngestionSchedulerWorkflow(66-93)
🪛 Ruff (0.8.2)
hivemind_etl/activities.py
5-5: hivemind_etl.website.activities.get_hivemind_website_comminities imported but unused
Remove unused import
(F401)
6-6: hivemind_etl.website.activities.extract_website imported but unused
Remove unused import
(F401)
7-7: hivemind_etl.website.activities.transform_website_data imported but unused
Remove unused import
(F401)
8-8: hivemind_etl.website.activities.load_website_data imported but unused
Remove unused import
(F401)
workflows.py
7-7: hivemind_etl.website.workflows.CommunityWebsiteWorkflow imported but unused
Remove unused import
(F401)
8-8: hivemind_etl.website.workflows.WebsiteIngestionSchedulerWorkflow imported but unused
Remove unused import
(F401)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: ci / test / Test
- GitHub Check: ci / lint / Lint
🔇 Additional comments (8)
workflows.py (1)
5-5: Import usage is valid.The
say_helloimport is used later in line 25, so no concerns here.registry.py (1)
20-26: ACTIVITIES list update looks consistent.These references align with the new website activity modules.
hivemind_etl/website/workflows.py (2)
1-7: Imports and setup look good.All needed imports (logging, asyncio, Temporal classes, etc.) are correctly specified here.
64-94: Workflow scheduling logic is straightforward.Child workflows are started in parallel for each community. This sets the stage for concurrent ingestion, which looks correct for your use case.
hivemind_etl/website/activities.py (3)
45-58: Good implementation of extract_website activity with proper logging.The function correctly initializes the WebsiteETL class and calls its extract method with appropriate error handling and logging.
61-74: Naming consistency between function and module.The function name
transform_website_datafollows a good naming pattern that clearly indicates it's for website data. This is more explicit than the previous function name mentioned in the summary (transform_data).
77-87: Consistent ETL pattern implementation.The load_website_data function follows the same pattern as the extract and transform functions, providing good consistency across the ETL pipeline.
hivemind_etl/activities.py (1)
18-20:Details
✅ Verification successful
Verify if the say_hello function is still needed.
This function appears to be a simple test function that returns a constant value (7). Check if it's still being used and consider removing it if it's no longer needed.
🏁 Script executed:
#!/bin/bash # Check if say_hello is used anywhere in the codebase echo "Checking if say_hello is used anywhere:" rg -A 2 "say_hello" --exclude="activities.py"Length of output: 544
Below is a new shell script using the correct ripgrep flag. Please re-run it manually to re-verify if any references to
say_helloexist outside ofhivemind_etl/activities.py:Once you run this script, if no references to
say_helloare found, it is safe to remove the function.
🏁 Script executed:
#!/bin/bash echo "Checking if say_hello is referenced in files excluding hivemind_etl/activities.py:" rg -A 2 "say_hello" -g '!hivemind_etl/activities.py'Length of output: 743
Confirmation on say_hello Usage:
The verification shows that thesay_hellofunction is actively referenced in other parts of the codebase (specifically inworkflows.pyandregistry.pyvia imports). No removal is needed at this time.
| from hivemind_etl.website.activities import ( | ||
| get_hivemind_website_comminities, | ||
| extract_website, | ||
| transform_website_data, | ||
| load_website_data, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Fix unused imports identified by static analysis.
The imports are not being used in this file according to static analysis. If they're being imported for use in other modules that import from this file, consider adding a comment explaining this purpose. Otherwise, they should be removed.
-from hivemind_etl.website.activities import (
- get_hivemind_website_comminities,
- extract_website,
- transform_website_data,
- load_website_data,
-)Let's verify if these functions are used elsewhere:
🏁 Script executed:
#!/bin/bash
# Search for imports or uses of these functions in other files
echo "Checking where these functions are imported or used:"
rg -l "from hivemind_etl.activities import" | xargs rg -l "(get_hivemind_website_comminities|extract_website|transform_website_data|load_website_data)"Length of output: 280
Action Required: Clarify Re-Exported Functions Usage
The static analysis warning is a false positive since our grep confirms that these functions are imported in registry.py, meaning they’re intentionally re-exported as part of the module's public API. To prevent future confusion and accidental removal, please add a comment next to the import block in hivemind_etl/activities.py indicating that these imports exist solely for external use (re-export).
- Location:
hivemind_etl/activities.py(Lines 4-9) - Suggestion: Add a comment such as:
# These functions are imported here to expose them as part of the public API for external modules (e.g., used in registry.py).
🧰 Tools
🪛 Ruff (0.8.2)
5-5: hivemind_etl.website.activities.get_hivemind_website_comminities imported but unused
Remove unused import
(F401)
6-6: hivemind_etl.website.activities.extract_website imported but unused
Remove unused import
(F401)
7-7: hivemind_etl.website.activities.transform_website_data imported but unused
Remove unused import
(F401)
8-8: hivemind_etl.website.activities.load_website_data imported but unused
Remove unused import
(F401)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (5)
hivemind_etl/mediawiki/wikiteam_crawler.py (1)
24-65: 🛠️ Refactor suggestionWrap
DumpGeneratorcalls in try-except.
External calls toDumpGeneratorcan fail unexpectedly. Placing them in a try-except allows for logging and controlled error handling.# Directly call the DumpGenerator static __init__ method which will parse these parameters, # execute the dump generation process, and run through the rest of the workflow. - DumpGenerator(params) + try: + DumpGenerator(params) + except Exception as e: + logging.error(f"Error occurred while generating dump: {e}") + raise🧰 Tools
🪛 Ruff (0.8.2)
51-51: f-string without any placeholders
Remove extraneous
fprefix(F541)
hivemind_etl/mediawiki/transform_xml.py (2)
56-57: Add error handling for malformed XML.Wrapping
ET.parse(xml_file)in a try-except block helps gracefully handle malformed or unreadable XML files.Apply a try-except around the parse call:
try: tree = ET.parse(xml_file) root = tree.getroot() except ET.ParseError as e: logging.error(f"Error parsing XML file {xml_file}: {str(e)}") raise
82-83: Consider supporting multiple revisions.Currently,
page.find("mw:revision", ...)retrieves only one revision. If you need full revision history, usefindalland process each<revision>.hivemind_etl/activities.py (1)
4-15: Unused imports flagged by static analysis.These imported functions appear unused here. If they're intentionally re-exported, add a clarifying comment; otherwise, remove them to avoid confusion and reduce clutter.
🧰 Tools
🪛 Ruff (0.8.2)
5-5:
hivemind_etl.website.activities.get_hivemind_website_comminitiesimported but unusedRemove unused import
(F401)
6-6:
hivemind_etl.website.activities.extract_websiteimported but unusedRemove unused import
(F401)
7-7:
hivemind_etl.website.activities.transform_website_dataimported but unusedRemove unused import
(F401)
8-8:
hivemind_etl.website.activities.load_website_dataimported but unusedRemove unused import
(F401)
11-11:
hivemind_etl.mediawiki.activities.get_hivemind_mediawiki_platformsimported but unusedRemove unused import
(F401)
12-12:
hivemind_etl.mediawiki.activities.extract_mediawikiimported but unusedRemove unused import
(F401)
13-13:
hivemind_etl.mediawiki.activities.transform_mediawiki_dataimported but unusedRemove unused import
(F401)
14-14:
hivemind_etl.mediawiki.activities.load_mediawiki_dataimported but unusedRemove unused import
(F401)
hivemind_etl/mediawiki/etl.py (1)
84-85: Add error handling when removing the dump directory.If
shutil.rmtree(self.dump_dir)fails or the directory doesn't exist, an unhandled exception may arise. Consider surrounding it with a try-except block to log or handle errors gracefully.
🧹 Nitpick comments (8)
workflows.py (2)
6-8: Remove unused imports fromhivemind_etl.website.workflows.
Static analysis flagsCommunityWebsiteWorkflowandWebsiteIngestionSchedulerWorkflowas unused. Consider removing them to reduce dead code.from hivemind_etl.website.workflows import ( - CommunityWebsiteWorkflow, - WebsiteIngestionSchedulerWorkflow, )🧰 Tools
🪛 Ruff (0.8.2)
7-7:
hivemind_etl.website.workflows.CommunityWebsiteWorkflowimported but unusedRemove unused import
(F401)
8-8:
hivemind_etl.website.workflows.WebsiteIngestionSchedulerWorkflowimported but unusedRemove unused import
(F401)
10-12: Remove unused import fromhivemind_etl.mediawiki.workflows.
MediaWikiETLWorkflowis not used in this file, so you can safely remove it to clean up the imports.from hivemind_etl.mediawiki.workflows import ( - MediaWikiETLWorkflow, )🧰 Tools
🪛 Ruff (0.8.2)
11-11:
hivemind_etl.mediawiki.workflows.MediaWikiETLWorkflowimported but unusedRemove unused import:
hivemind_etl.mediawiki.workflows.MediaWikiETLWorkflow(F401)
tests/integration/test_mediawiki_modules.py (1)
17-20: Consider adding negative test coverage.
Little or no test coverage exists for cases wherebaseURLorpathmight be malformed or missing namespaces. Adding negative tests can help ensure error handling is robust.hivemind_etl/mediawiki/workflows.py (1)
36-72: Consider parallelizing or using child workflows for each platform.
Currently, the ETL processes each platform sequentially in aforloop. If concurrency is desired (or if a single failure shouldn't stall other communities), you can leverage child workflows orasyncioconcurrency for improved throughput and resilience.registry.py (2)
3-4: Consider correcting the spelling for “comminities”.
The function nameget_hivemind_website_comminitiesseems to have a typographical error; consider renaming it to match the domain concept “communities.”-from hivemind_etl.activities import ( - extract_website, - get_hivemind_website_comminities, - ... +from hivemind_etl.activities import ( + extract_website, + get_hivemind_website_communities, + ...
26-36: Check for code duplication in ACTIVITIES.
say_hellois included at the end ofACTIVITIES. Confirm that reordering or grouping activities by domain (Website vs. MediaWiki) might improve readability.hivemind_etl/mediawiki/wikiteam_crawler.py (1)
51-51: Remove extraneous “f” prefix.
Static analysis indicatesf"--namespaces"has no placeholders. Remove thefto avoid confusion.- params.append(f"--namespaces") + params.append("--namespaces")🧰 Tools
🪛 Ruff (0.8.2)
51-51: f-string without any placeholders
Remove extraneous
fprefix(F541)
hivemind_etl/mediawiki/activities.py (1)
86-96: Review data load concurrency.
Simultaneous calls toload_mediawiki_datafor the samecommunity_idmight overlap. Consider additional safeguards or concurrency checks, e.g., locking or version checks.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
.gitignore(1 hunks)hivemind_etl/activities.py(1 hunks)hivemind_etl/mediawiki/activities.py(1 hunks)hivemind_etl/mediawiki/etl.py(1 hunks)hivemind_etl/mediawiki/module.py(1 hunks)hivemind_etl/mediawiki/transform_xml.py(1 hunks)hivemind_etl/mediawiki/wikiteam_crawler.py(1 hunks)hivemind_etl/mediawiki/workflows.py(1 hunks)registry.py(1 hunks)tests/integration/test_mediawiki_modules.py(1 hunks)workflows.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- .gitignore
🧰 Additional context used
🧬 Code Graph Analysis (8)
workflows.py (2)
hivemind_etl/activities.py (1)
say_hello(25-26)hivemind_etl/website/workflows.py (2)
CommunityWebsiteWorkflow(18-61)WebsiteIngestionSchedulerWorkflow(66-93)
tests/integration/test_mediawiki_modules.py (1)
hivemind_etl/mediawiki/module.py (2)
ModulesMediaWiki(6-80)get_learning_platforms(11-80)
hivemind_etl/mediawiki/activities.py (2)
hivemind_etl/mediawiki/module.py (2)
ModulesMediaWiki(6-80)get_learning_platforms(11-80)hivemind_etl/mediawiki/etl.py (4)
MediawikiETL(10-85)extract(23-29)transform(31-76)load(78-85)
hivemind_etl/mediawiki/etl.py (2)
hivemind_etl/mediawiki/transform_xml.py (1)
parse_mediawiki_xml(9-134)hivemind_etl/mediawiki/wikiteam_crawler.py (2)
WikiteamCrawler(7-83)crawl(24-64)
hivemind_etl/mediawiki/workflows.py (1)
hivemind_etl/mediawiki/activities.py (4)
get_hivemind_mediawiki_platforms(13-49)extract_mediawiki(53-69)transform_mediawiki_data(73-83)load_mediawiki_data(87-96)
hivemind_etl/activities.py (2)
hivemind_etl/website/activities.py (4)
get_hivemind_website_comminities(13-42)extract_website(46-58)transform_website_data(62-74)load_website_data(78-87)hivemind_etl/mediawiki/activities.py (4)
get_hivemind_mediawiki_platforms(13-49)extract_mediawiki(53-69)transform_mediawiki_data(73-83)load_mediawiki_data(87-96)
registry.py (2)
hivemind_etl/website/activities.py (3)
get_hivemind_website_comminities(13-42)load_website_data(78-87)transform_website_data(62-74)hivemind_etl/mediawiki/activities.py (3)
get_hivemind_mediawiki_platforms(13-49)transform_mediawiki_data(73-83)load_mediawiki_data(87-96)
hivemind_etl/mediawiki/transform_xml.py (1)
hivemind_etl/mediawiki/schema.py (4)
Contributor(6-8)Page(23-27)Revision(11-20)SiteInfo(30-34)
🪛 Ruff (0.8.2)
workflows.py
7-7: hivemind_etl.website.workflows.CommunityWebsiteWorkflow imported but unused
Remove unused import
(F401)
8-8: hivemind_etl.website.workflows.WebsiteIngestionSchedulerWorkflow imported but unused
Remove unused import
(F401)
11-11: hivemind_etl.mediawiki.workflows.MediaWikiETLWorkflow imported but unused
Remove unused import: hivemind_etl.mediawiki.workflows.MediaWikiETLWorkflow
(F401)
hivemind_etl/mediawiki/wikiteam_crawler.py
14-14: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
51-51: f-string without any placeholders
Remove extraneous f prefix
(F541)
hivemind_etl/activities.py
5-5: hivemind_etl.website.activities.get_hivemind_website_comminities imported but unused
Remove unused import
(F401)
6-6: hivemind_etl.website.activities.extract_website imported but unused
Remove unused import
(F401)
7-7: hivemind_etl.website.activities.transform_website_data imported but unused
Remove unused import
(F401)
8-8: hivemind_etl.website.activities.load_website_data imported but unused
Remove unused import
(F401)
11-11: hivemind_etl.mediawiki.activities.get_hivemind_mediawiki_platforms imported but unused
Remove unused import
(F401)
12-12: hivemind_etl.mediawiki.activities.extract_mediawiki imported but unused
Remove unused import
(F401)
13-13: hivemind_etl.mediawiki.activities.transform_mediawiki_data imported but unused
Remove unused import
(F401)
14-14: hivemind_etl.mediawiki.activities.load_mediawiki_data imported but unused
Remove unused import
(F401)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: ci / lint / Lint
- GitHub Check: ci / test / Test
🔇 Additional comments (8)
workflows.py (1)
5-5: Good use of the imported activity.
Thesay_helloimport is correctly utilized later in theSayHelloworkflow.registry.py (3)
7-10: Verify that these newly imported MediaWiki activities are only added where needed.
Ensure thatextract_mediawiki,get_hivemind_mediawiki_platforms,transform_mediawiki_data, andload_mediawiki_dataare actively used in your workflows or scheduled tasks. If they are not invoked, consider removing them or adding usage and test coverage.
16-16: Workflow inclusion appears consistent.
AddingMediaWikiETLWorkflowto the workflow list makes sense for the new MediaWiki ETL feature.
23-23: Great addition to WORKFLOWS.
No concerns about addingMediaWikiETLWorkflowhere. Looks consistent with the existing pattern.hivemind_etl/mediawiki/wikiteam_crawler.py (1)
66-84: Ensure consistent log levels for file deletion.
Thedelete_dumpmethod useslogging.infoandlogging.warningfor events; confirm that the chosen log levels agree with your logging strategy (e.g.,infovs.debug).hivemind_etl/mediawiki/activities.py (2)
52-70: Good separation of concerns forextract_mediawiki.
Extraction is clearly delegated toMediawikiETL.extract(). This is an excellent pattern for maintainability.
72-84: Checknamespacesparameter usage.
transform_mediawiki_datacreatesMediawikiETL(community_id=community_id)without specifyingnamespaces. Confirm if that is correct or if you should pass them here as well.hivemind_etl/mediawiki/transform_xml.py (1)
46-47: Looks Good for File Existence Check.Raising a
FileNotFoundErrorif no XML file is located aligns well with robust error handling. This ensures early detection of misconfiguration.
| if not isinstance(path, str) and not isinstance(base_url, str): | ||
| raise ValueError("Wrong format for `path` and `base_url`!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential logic error with string validation.
Currently, the check raises a ValueError only if both path and base_url are not strings. If either is not a string, it should likely raise.
- if not isinstance(path, str) and not isinstance(base_url, str):
+ if not isinstance(path, str) or not isinstance(base_url, str):
raise ValueError("Wrong format for `path` and `base_url`!")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if not isinstance(path, str) and not isinstance(base_url, str): | |
| raise ValueError("Wrong format for `path` and `base_url`!") | |
| if not isinstance(path, str) or not isinstance(base_url, str): | |
| raise ValueError("Wrong format for `path` and `base_url`!") |
| def __init__( | ||
| self, | ||
| community_id: str, | ||
| xml: bool = True, | ||
| force: bool = True, | ||
| curonly: bool = True, | ||
| namespaces: list[int] = [], | ||
| **kwargs, | ||
| ) -> None: | ||
| self.community_id = community_id | ||
| self.xml = xml | ||
| self.force = force | ||
| self.curonly = curonly | ||
| self.extra_params = kwargs | ||
| self.namespaces = namespaces |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid using a mutable default argument for namespaces.
In Python, using a mutable as a default ([]) can lead to unexpected behavior if the list is modified. Initialize it with None and set the default inside the constructor.
- def __init__(
- self,
- community_id: str,
- xml: bool = True,
- force: bool = True,
- curonly: bool = True,
- namespaces: list[int] = [],
- **kwargs,
- ) -> None:
- self.community_id = community_id
- self.xml = xml
- self.force = force
- self.curonly = curonly
- self.extra_params = kwargs
- self.namespaces = namespaces
+ def __init__(
+ self,
+ community_id: str,
+ xml: bool = True,
+ force: bool = True,
+ curonly: bool = True,
+ namespaces: list[int] | None = None,
+ **kwargs,
+ ) -> None:
+ self.community_id = community_id
+ self.xml = xml
+ self.force = force
+ self.curonly = curonly
+ self.extra_params = kwargs
+ self.namespaces = namespaces if namespaces is not None else []📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def __init__( | |
| self, | |
| community_id: str, | |
| xml: bool = True, | |
| force: bool = True, | |
| curonly: bool = True, | |
| namespaces: list[int] = [], | |
| **kwargs, | |
| ) -> None: | |
| self.community_id = community_id | |
| self.xml = xml | |
| self.force = force | |
| self.curonly = curonly | |
| self.extra_params = kwargs | |
| self.namespaces = namespaces | |
| def __init__( | |
| self, | |
| community_id: str, | |
| xml: bool = True, | |
| force: bool = True, | |
| curonly: bool = True, | |
| namespaces: list[int] | None = None, | |
| **kwargs, | |
| ) -> None: | |
| self.community_id = community_id | |
| self.xml = xml | |
| self.force = force | |
| self.curonly = curonly | |
| self.extra_params = kwargs | |
| self.namespaces = namespaces if namespaces is not None else [] |
🧰 Tools
🪛 Ruff (0.8.2)
14-14: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
| for page in pages: | ||
| try: | ||
| documents.append( | ||
| Document( | ||
| doc_id=page.page_id, | ||
| text=page.revision.text, | ||
| metadata={ | ||
| "title": page.title, | ||
| "namespace": page.namespace, | ||
| "revision_id": page.revision.revision_id, | ||
| "parent_revision_id": page.revision.parent_revision_id, | ||
| "timestamp": page.revision.timestamp, | ||
| "comment": page.revision.comment, | ||
| "contributor_username": page.revision.contributor.username, | ||
| "contributor_user_id": page.revision.contributor.user_id, | ||
| "sha1": page.revision.sha1, | ||
| "model": page.revision.model, | ||
| }, | ||
| excluded_embed_metadata_keys=[ | ||
| "namespace", | ||
| "revision_id", | ||
| "parent_revision_id", | ||
| "sha1", | ||
| "model", | ||
| "contributor_user_id", | ||
| "comment", | ||
| "timestamp", | ||
| ], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Validate existence of page.revision before referencing.
Accessing page.revision.text may raise an AttributeError if page.revision is None. Confirm whether pages without revisions should be skipped or handled differently.
🏁 Script executed:
#!/bin/bash
# This script searches for references to page.revision in this file to see if they’re all guarded.
rg -A 3 'page\.revision' hivemind_etl/mediawiki/etl.pyLength of output: 939
Action Required: Add a guard for page.revision existence.
It was verified that the code on lines 35–62 in hivemind_etl/mediawiki/etl.py directly accesses attributes of page.revision without checking if it exists. This can lead to a runtime AttributeError if a page lacks a revision.
- File:
hivemind_etl/mediawiki/etl.py(Lines 35–62) - Issue: Directly accessing
page.revision.text,page.revision.revision_id, etc., without confirmingpage.revisionis notNone. - Recommendation: Introduce a conditional check before dereferencing the revision. For example, consider:
for page in pages: if page.revision is None: continue # or handle accordingly try: documents.append( Document( doc_id=page.page_id, text=page.revision.text, metadata={ "title": page.title, "namespace": page.namespace, "revision_id": page.revision.revision_id, "parent_revision_id": page.revision.parent_revision_id, "timestamp": page.revision.timestamp, "comment": page.revision.comment, "contributor_username": page.revision.contributor.username, "contributor_user_id": page.revision.contributor.user_id, "sha1": page.revision.sha1, "model": page.revision.model, }, excluded_embed_metadata_keys=[ "namespace", "revision_id", "parent_revision_id", "sha1", "model", "contributor_user_id", "comment", "timestamp", ], ) ) except Exception as e: # appropriate error handling if needed pass
- Action: Please confirm the intended behavior for pages missing a revision and update the code accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
tests/integration/test_mediawiki_modules.py (1)
1-235: 🛠️ Refactor suggestionAdd tests for error handling scenarios.
There are no tests for the error handling in the
get_learning_platformsmethod. Consider adding tests for:
- When path or base_url are not strings
- When an exception is raised during processing
Example test case:
def test_error_handling_invalid_types(self): """Test error handling when path or base_url are not strings""" platform_id = ObjectId("6579c364f1120850414e0dc6") community_id = ObjectId("6579c364f1120850414e0dc5") # Insert platform with non-string path self.client["Core"]["platforms"].insert_one({ "_id": platform_id, "name": "mediaWiki", "metadata": { "baseURL": "http://example.com", "path": 123, # Non-string path }, "community": community_id, "disconnectedAt": None, "connectedAt": datetime.now(), "createdAt": datetime.now(), "updatedAt": datetime.now(), }) # Insert corresponding module self.client["Core"]["modules"].insert_one({ "name": "hivemind", "community": community_id, "options": { "platforms": [ { "platform": platform_id, "name": "mediaWiki", "metadata": {"namespaces": [0, 1, 2]}, } ] }, }) # Should log error and return empty list result = self.modules_mediawiki.get_learning_platforms() self.assertEqual(result, [])
♻️ Duplicate comments (1)
hivemind_etl/mediawiki/module.py (1)
66-67: Fix the logical condition for string validation.Currently, the check raises a
ValueErroronly if bothpathandbase_urlare not strings. It should raise an error if either is not a string.- if not isinstance(path, str) and not isinstance(base_url, str): + if not isinstance(path, str) or not isinstance(base_url, str): raise ValueError("Wrong format for `path` and `base_url`!")
🧹 Nitpick comments (4)
hivemind_etl/mediawiki/module.py (2)
82-82: Remove unnecessary type ignore comment.The type ignore comment isn't needed if you properly validate that
base_urlandpathare strings before concatenation. Fix the validation condition as suggested earlier and you can safely remove this comment.- "base_url": base_url + path, # type: ignore + "base_url": base_url + path,
56-56: Consider implementing the TODO for database optimization.Retrieving
baseURLandpathin a single database call would reduce latency and improve performance, especially when processing multiple platforms.tests/integration/test_mediawiki_modules.py (2)
156-158: Update docstring to reflect test purpose.The current docstring describes the data setup but not the specific purpose of the test, which is filtering platforms by ID.
- """ - Two mediawiki platforms for one community - """ + """ + Test filtering platforms by ID when multiple platforms exist for one community + """
159-219: Reduce code duplication in test setup.There's significant duplication in the test setup between this test and the previous one. Consider creating a helper method to set up the test data.
Here's a suggested approach:
def _setup_multiple_platforms(self): platform_id1 = ObjectId("6579c364f1120850414e0dc6") platform_id2 = ObjectId("6579c364f1120850414e0dc7") community_id = ObjectId("1009c364f1120850414e0dc5") # Insert module with two platforms self.client["Core"]["modules"].insert_one({ "name": "hivemind", "community": community_id, "options": { "platforms": [ { "platform": platform_id1, "name": "mediaWiki", "metadata": {"namespaces": [0, 1, 2]}, }, { "platform": platform_id2, "name": "mediaWiki", "metadata": {"namespaces": [3, 4, 5]}, }, ] }, }) # Insert platform records for i, platform_id in enumerate([platform_id1, platform_id2]): self.client["Core"]["platforms"].insert_one({ "_id": platform_id, "name": "mediaWiki", "metadata": { "baseURL": f"http://example{i+1}.com", "path": "/api", }, "community": community_id, "disconnectedAt": None, "connectedAt": datetime.now(), "createdAt": datetime.now(), "updatedAt": datetime.now(), }) return platform_id1, platform_id2, community_id
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
hivemind_etl/mediawiki/activities.py(1 hunks)hivemind_etl/mediawiki/module.py(1 hunks)tests/integration/test_mediawiki_modules.py(1 hunks)tests/unit/test_mediawiki_etl.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- hivemind_etl/mediawiki/activities.py
- tests/unit/test_mediawiki_etl.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/integration/test_mediawiki_modules.py (1)
hivemind_etl/mediawiki/module.py (2)
ModulesMediaWiki(6-91)get_learning_platforms(11-91)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: ci / test / Test
- GitHub Check: ci / lint / Lint
🔇 Additional comments (7)
hivemind_etl/mediawiki/module.py (3)
6-9: LGTM: Class definition and constructor are well structured.The class correctly extends ModulesBase and initializes the platform_name attribute before calling the parent constructor.
11-36: LGTM: Method signature and documentation are clear.The method signature provides appropriate type hints and default values. The docstring clearly describes the purpose, parameters and return value with a helpful example.
37-54: LGTM: Query and filtering implementation looks good.The code correctly queries modules based on platform name and implements proper filtering logic for platforms.
tests/integration/test_mediawiki_modules.py (4)
9-16: LGTM: Test setup is clean and properly initializes the environment.The setUp method correctly drops existing collections and initializes the ModulesMediaWiki instance.
17-19: LGTM: Empty data test is straightforward.This test correctly verifies the behavior when no data is available.
21-69: LGTM: Single data test covers basic functionality.This test properly sets up a single platform and module, then verifies the correct output format and values.
70-154: LGTM: Multiple platforms test is comprehensive.The test correctly verifies that multiple platforms for a single community are properly retrieved and formatted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (6)
hivemind_etl/mediawiki/workflows.py (6)
28-35: Consider using more specific exception handling.While catching all exceptions is appropriate for the workflow's main try block, consider adding logging for specific exception types that might be common or require different handling.
38-43: Consider using a dedicated data structure for platform configuration.Using a dictionary for the platform configuration works, but a Pydantic model would provide better type safety and validation.
from pydantic import BaseModel class MediaWikiPlatform(BaseModel): base_url: str community_id: str namespaces: list[int] # Assuming namespaces is a list of integers
58-63: Implement retry logic for the load operation.Loading data often encounters transient issues. Consider implementing a retry mechanism for the
load_mediawiki_dataactivity using Temporal's built-in retry options.# Example retry policy from temporalio.common import RetryPolicy retry_policy = RetryPolicy( initial_interval=timedelta(seconds=1), backoff_coefficient=2.0, maximum_interval=timedelta(minutes=5), maximum_attempts=5, ) # Then use it with the activity await workflow.execute_activity( load_mediawiki_data, documents, platform["community_id"], start_to_close_timeout=timedelta(minutes=30), retry_policy=retry_policy, )
65-72: Improve error reporting granularity.Consider adding more detailed logging for different stages of processing (extraction, transformation, loading) to help diagnose issues more effectively.
# Example of more detailed logging try: # Extraction logging.info(f"Starting extraction for community: {platform['community_id']}") await workflow.execute_activity(...) logging.info(f"Extraction completed for community: {platform['community_id']}") # Transformation logging.info(f"Starting transformation for community: {platform['community_id']}") documents = await workflow.execute_activity(...) logging.info(f"Transformation completed for community: {platform['community_id']}, processed {len(documents)} documents") # Loading logging.info(f"Starting data loading for community: {platform['community_id']}") await workflow.execute_activity(...) logging.info(f"Successfully completed ETL for community id: {platform['community_id']}") except Exception as e: # Error handling remains the same
74-76: Consider returning workflow execution summary.The workflow currently returns None. Consider returning a summary of the processing results, such as the number of communities processed successfully and failed.
@workflow.run async def run(self, platform_id: str | None = None) -> dict: results = {"successful": [], "failed": []} try: # ... existing code ... for platform in platforms: try: # ... existing processing code ... results["successful"].append(platform["community_id"]) except Exception as e: results["failed"].append({ "community_id": platform["community_id"], "error": str(e) }) continue except Exception as e: # ... existing error handling ... return results
36-68: Consider implementing parallel processing for multiple platforms.Currently, platforms are processed sequentially. If processing numerous platforms, consider implementing parallel processing to improve efficiency.
You could use Temporal's child workflows or activity execution options to implement parallel processing:
# Example using concurrent.futures for parallel processing from concurrent.futures import ThreadPoolExecutor import asyncio # Inside the run method: with ThreadPoolExecutor(max_workers=3) as executor: # Create a list of platform processing tasks tasks = [] for platform in platforms: task = asyncio.create_task(self._process_platform(platform)) tasks.append(task) # Wait for all tasks to complete await asyncio.gather(*tasks) # Add a helper method to process a single platform async def _process_platform(self, platform): try: # Existing platform processing code mediawiki_platform = {...} await workflow.execute_activity(...) # ... except Exception as e: logging.error(f"Error processing community id: {platform['community_id']}: {str(e)}")This approach would need careful consideration of resource usage and potential rate limits.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
hivemind_etl/mediawiki/workflows.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: ci / test / Test
🔇 Additional comments (3)
hivemind_etl/mediawiki/workflows.py (3)
1-13: Imports look well-structured and follow best practices.The code correctly uses the Temporal workflow's
unsafe.imports_passed_through()context manager to handle imports for distributed execution. All necessary activities from the mediawiki module are properly imported.
15-27: Documentation is clear and comprehensive.The workflow class is properly decorated with
@workflow.defnand the method with@workflow.run. The docstring effectively explains the purpose and parameters of the run method.
51-55:Details
❓ Verification inconclusive
Be cautious with memory consumption during transformation.
The
documentsvariable might contain a large amount of data. Consider implementing streaming or batch processing patterns if memory consumption becomes an issue.
🏁 Script executed:
#!/bin/bash # Check if there are any memory management patterns in the transform activity rg -A 5 "transform_mediawiki_data" --glob "*.py" | grep -E "batch|stream|chunk|yield"Length of output: 85
Below is a new shell script to gather more context about the
transform_mediawiki_datafunction. The initial check returned no output, so this script will extract around 50 lines following the function’s definition and then search for keywords related to memory management (e.g. batch, stream, chunk, yield). Please run this command and share the output so we can verify if any memory optimization patterns have been implemented:If the output confirms that no such memory management patterns are present, please review whether implementing streaming or batch processing is feasible to control memory usage when
documentscould be large.
🏁 Script executed:
#!/bin/bash # Extract extended context around the transform_mediawiki_data function definition rg -A 50 "def transform_mediawiki_data" --glob "*.py" > transform_context.txt && \ grep -E "batch|stream|chunk|yield" transform_context.txtLength of output: 236
Memory Usage Concern — Manual Verification Recommended
Based on our verification attempts, there is no evidence of streaming, batching, chunking, or yielding patterns within the
transform_mediawiki_datafunction. This suggests that the transformation currently loads all data into memory, which might lead to high memory consumption when processing large datasets. Please verify manually that this behavior is acceptable under expected workloads, and consider refactoring the transformation to support streaming or batch processing patterns to mitigate potential memory issues.
| await workflow.execute_activity( | ||
| extract_mediawiki, | ||
| mediawiki_platform, | ||
| start_to_close_timeout=timedelta(days=5), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Five-day timeout may require monitoring strategy.
The 5-day timeout for extract_mediawiki indicates this could be a very long-running operation. Consider:
- Implementing progress tracking/reporting
- Breaking this into smaller, checkpointed activities
- Adding monitoring to detect stalled extractions
Summary by CodeRabbit
New Features
Bug Fixes
Chores
wikiteam3.