Skip to content

Conversation

@amindadgar
Copy link
Member

@amindadgar amindadgar commented Apr 9, 2025

Summary by CodeRabbit

  • New Features

    • Introduced robust MediaWiki data integration with end-to-end support for extracting, transforming, and loading content from dumps.
    • Added capabilities for efficient XML data crawling and parsing to convert raw data into structured content.
    • Enhanced content models and error-handling workflows to improve overall data management and cleanup processes.
    • Implemented new asynchronous activity functions for processing MediaWiki data and website data related to communities.
    • Added structured workflows for managing website data ingestion and MediaWiki data processing.
    • Introduced a new ETL workflow specifically for MediaWiki data processing.
    • Added new methods for retrieving and processing community data from MediaWiki platforms.
  • Bug Fixes

    • Improved error handling and logging mechanisms across various data processing activities.
  • Chores

    • Updated dependencies to include the latest version of wikiteam3.

@coderabbitai
Copy link

coderabbitai bot commented Apr 9, 2025

Walkthrough

This pull request introduces a MediaWiki ETL pipeline. It adds new modules to extract, transform, and load MediaWiki XML dumps into structured data models using Pydantic. Additionally, it implements an XML reader, a crawler for handling dumps via an external package, and corresponding test suites. Minor updates were also made to the .gitignore file and dependency configurations, along with the restructuring of asynchronous activity functions and workflows related to community data processing.

Changes

File(s) Change Summary
.gitignore Added *.xml and dump_* to ignore XML files and dump files; ensured a newline after main.ipynb.
hivemind_etl/mediawiki/etl.py Introduced the MediawikiETL class with methods for extracting (via crawling), transforming (using parse_mediawiki_xml), and loading MediaWiki data.
hivemind_etl/mediawiki/llama_xml_reader.py Added XMLReader for XML parsing.
hivemind_etl/mediawiki/schema.py Added new Pydantic models (Contributor, Revision, Page, SiteInfo) to structure MediaWiki data.
hivemind_etl/mediawiki/transform_xml.py Introduced parse_mediawiki_xml function for parsing MediaWiki XML dumps.
hivemind_etl/mediawiki/wikiteam_crawler.py Added WikiteamCrawler class for crawling MediaWiki dumps and deleting dump files.
hivemind_etl/activities.py Removed several asynchronous activity functions related to community data processing.
hivemind_etl/website/activities.py Added new asynchronous activity functions for processing website data related to communities.
hivemind_etl/website/workflows.py Introduced workflow classes for managing website data processing activities.
registry.py Updated function names in the ACTIVITIES list to reflect changes in naming conventions.
workflows.py Removed workflow classes related to community website processing.
requirements.txt Added new dependency: wikiteam3==4.4.1.
tests/unit/test_mediawiki_etl.py Added a test suite for MediawikiETL, covering initialization, extraction, transformation, load, error handling, and dump deletion.
tests/integration/test_mediawiki_modules.py Added a test suite for get_learning_platforms in ModulesMediaWiki, validating its functionality under various conditions.

Sequence Diagram(s)

sequenceDiagram
    participant U as User
    participant ETL as MediawikiETL
    participant C as WikiteamCrawler
    participant DG as DumpGenerator
    participant PT as parse_mediawiki_xml
    participant CIP as CustomIngestionPipeline

    U->>ETL: Initiate ETL process
    ETL->>C: Call crawl(api_url, dump_path)
    C->>DG: Generate XML dump
    DG-->>C: Dump file created
    C-->>ETL: Return dump file path
    ETL->>PT: Parse XML dump
    PT-->>ETL: Return list of Document objects
    ETL->>CIP: Ingest documents
    alt Delete dump enabled
        ETL->>ETL: Delete dump file
    end
Loading

Poem

I hopped through lines of code so neat,
Parsing XML with a joyful beat.
MediaWiki dumps now gleam in view,
Each ETL step brisk and true.
With a nibble of tests and a hop so light,
I celebrate this code-day delight!
🐇✨

Tip

⚡💬 Agentic Chat (Pro Plan, General Availability)
  • We're introducing multi-step agentic chat in review comments and issue comments, within and outside of PR's. This feature enhances review and issue discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments and add commits to existing pull requests.
✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai plan to trigger planning for file edits and PR creation.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (7)
hivemind_etl/mediawiki/schema.py (1)

6-9: Consider type constraints.
user_id might benefit from explicit numeric or string constraints. For instance, if user_id is strictly numeric, consider validating with constr(regex="^\d+$") or an integer type.

hivemind_etl/mediawiki/transform_xml.py (1)

56-68: Check for large-scale performance.
Gathering all pages in memory can be expensive if the XML is huge. Consider a streaming approach with iterparse if you anticipate large dumps.

hivemind_etl/mediawiki/llama_xml_reader.py (1)

1-4: Fix duplicate module docstrings

The file contains two separate docstring literals at the top of the file, which is unconventional. The second docstring indicates the code was copied from another source.

Combine these into a single comprehensive docstring:

-"""XML Reader."""
-
-"""Copied from https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/xml/base.py"""
+"""XML Reader.
+
+Adapted from https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/xml/base.py
+"""
tests/unit/test_mediawiki_etl.py (4)

34-44: Fix grammar in comment

Minor grammatical error in the comment.

def test_extract_with_default_path(self):
-   # Create a ETL instance with mocked wikiteam_crawler
+   # Create an ETL instance with mocked wikiteam_crawler
    etl = MediawikiETL(community_id=self.community_id)
    etl.wikiteam_crawler = Mock()

    etl.extract(self.api_url)

    etl.wikiteam_crawler.crawl.assert_called_once_with(
        self.api_url, f"dumps/{self.community_id}.xml"
    )

45-55: Fix grammar in comment

Minor grammatical error in the comment.

def test_extract_with_custom_path(self):
-   # Create a ETL instance with mocked wikiteam_crawler
+   # Create an ETL instance with mocked wikiteam_crawler
    etl = MediawikiETL(community_id=self.community_id)
    etl.wikiteam_crawler = Mock()

    etl.extract(self.api_url, self.custom_path)

    self.assertEqual(etl.dump_path, self.custom_path)
    etl.wikiteam_crawler.crawl.assert_called_once_with(
        self.api_url, self.custom_path
    )

57-90: Expand test coverage for transform method

The test validates several metadata fields but doesn't verify all fields, excluded_embed_metadata_keys, or excluded_llm_metadata_keys. This could miss issues with fields that aren't explicitly tested.

Expand the test to verify more metadata fields and configuration options:

@patch("hivemind_etl.mediawiki.etl.parse_mediawiki_xml")
def test_transform_success(self, mock_parse_mediawiki_xml):
    etl = MediawikiETL(community_id=self.community_id)

    # Mock page data
    mock_page = Mock()
    mock_page.page_id = "123"
    mock_page.title = "Test Page"
    mock_page.namespace = 0
    mock_page.revision = Mock(
        text="Test content",
        revision_id="456",
        parent_revision_id="455",
        timestamp="2024-01-01T00:00:00Z",
        comment="Test edit",
        contributor=Mock(username="testuser", user_id="789"),
        sha1="abc123",
        model="wikitext",
    )

    mock_parse_mediawiki_xml.return_value = [mock_page]

    documents = etl.transform()

    self.assertEqual(len(documents), 1)
    doc = documents[0]
    self.assertIsInstance(doc, Document)
    self.assertEqual(doc.doc_id, "123")
    self.assertEqual(doc.text, "Test content")
    self.assertEqual(doc.metadata["title"], "Test Page")
    self.assertEqual(doc.metadata["namespace"], 0)
    self.assertEqual(doc.metadata["revision_id"], "456")
    self.assertEqual(doc.metadata["contributor_username"], "testuser")
+   # Verify additional metadata fields
+   self.assertEqual(doc.metadata["parent_revision_id"], "455")
+   self.assertEqual(doc.metadata["timestamp"], "2024-01-01T00:00:00Z")
+   self.assertEqual(doc.metadata["comment"], "Test edit")
+   self.assertEqual(doc.metadata["contributor_user_id"], "789")
+   self.assertEqual(doc.metadata["sha1"], "abc123")
+   self.assertEqual(doc.metadata["model"], "wikitext")
+   
+   # Verify excluded_embed_metadata_keys
+   self.assertIn("namespace", doc.excluded_embed_metadata_keys)
+   self.assertIn("revision_id", doc.excluded_embed_metadata_keys)
+   self.assertIn("parent_revision_id", doc.excluded_embed_metadata_keys)
+   self.assertIn("sha1", doc.excluded_embed_metadata_keys)
+   self.assertIn("model", doc.excluded_embed_metadata_keys)
+   self.assertIn("contributor_user_id", doc.excluded_embed_metadata_keys)
+   self.assertIn("comment", doc.excluded_embed_metadata_keys)
+   self.assertIn("timestamp", doc.excluded_embed_metadata_keys)
+   
+   # Verify excluded_llm_metadata_keys
+   self.assertIn("namespace", doc.excluded_llm_metadata_keys)
+   self.assertIn("revision_id", doc.excluded_llm_metadata_keys)
+   self.assertIn("parent_revision_id", doc.excluded_llm_metadata_keys)
+   self.assertIn("sha1", doc.excluded_llm_metadata_keys)
+   self.assertIn("model", doc.excluded_llm_metadata_keys)
+   self.assertIn("contributor_user_id", doc.excluded_llm_metadata_keys)

9-24: Use a context manager for directory cleanup

The current implementation creates and cleans up directories in setUp and tearDown, but if a test fails or is interrupted, cleanup might not occur.

Consider using a context manager like tempfile.TemporaryDirectory for more reliable cleanup:

import os
import unittest
+import tempfile
from unittest.mock import Mock, patch

from llama_index.core import Document
from hivemind_etl.mediawiki.etl import MediawikiETL


class TestMediawikiETL(unittest.TestCase):
    def setUp(self):
        self.community_id = "test_community"
        self.api_url = "https://example.com/api.php"
        self.custom_path = "custom/path/dump.xml"

+       # Create a temporary directory for dumps
+       self.temp_dir = tempfile.TemporaryDirectory()
+       self.original_dumps_dir = "dumps"
+       # Create a symlink from dumps to our temporary directory
+       if os.path.exists(self.original_dumps_dir):
+           self.original_dumps_backup = tempfile.TemporaryDirectory()
+           os.rename(self.original_dumps_dir, os.path.join(self.original_dumps_backup.name, "dumps"))
+       os.symlink(self.temp_dir.name, self.original_dumps_dir)

    def tearDown(self):
+       # Remove the symlink
+       if os.path.islink(self.original_dumps_dir):
+           os.unlink(self.original_dumps_dir)
+       # Restore original dumps directory if it existed
+       if hasattr(self, 'original_dumps_backup'):
+           os.rename(os.path.join(self.original_dumps_backup.name, "dumps"), self.original_dumps_dir)
+           self.original_dumps_backup.cleanup()
+       # Clean up the temporary directory
+       self.temp_dir.cleanup()

This approach ensures cleanup happens even if tests fail, and it doesn't interfere with existing files in the current directory.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bb5c53a and c7c18c5.

📒 Files selected for processing (8)
  • .gitignore (1 hunks)
  • hivemind_etl/mediawiki/etl.py (1 hunks)
  • hivemind_etl/mediawiki/llama_xml_reader.py (1 hunks)
  • hivemind_etl/mediawiki/schema.py (1 hunks)
  • hivemind_etl/mediawiki/transform_xml.py (1 hunks)
  • hivemind_etl/mediawiki/wikiteam_crawler.py (1 hunks)
  • requirements.txt (1 hunks)
  • tests/unit/test_mediawiki_etl.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (3)
hivemind_etl/mediawiki/transform_xml.py (1)
hivemind_etl/mediawiki/schema.py (4)
  • Contributor (6-8)
  • Page (23-27)
  • Revision (11-20)
  • SiteInfo (30-34)
hivemind_etl/mediawiki/etl.py (2)
hivemind_etl/mediawiki/transform_xml.py (1)
  • parse_mediawiki_xml (7-122)
hivemind_etl/mediawiki/wikiteam_crawler.py (2)
  • WikiteamCrawler (7-78)
  • crawl (22-59)
tests/unit/test_mediawiki_etl.py (2)
hivemind_etl/mediawiki/etl.py (4)
  • MediawikiETL (10-80)
  • extract (18-24)
  • transform (26-71)
  • load (73-80)
hivemind_etl/mediawiki/wikiteam_crawler.py (1)
  • crawl (22-59)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: ci / lint / Lint
🔇 Additional comments (9)
requirements.txt (1)

11-11:

Details

❓ Verification inconclusive

Ensure package version security.
The newly added wikiteam3==4.4.1 might have known vulnerabilities or require version updates. Consider verifying it with safety or similarly recognized tools.


🏁 Script executed:

#!/bin/bash
# Checking for known vulnerabilities in wikiteam3==4.4.1
pip install safety
safety check --full-report --file requirements.txt

Length of output: 6329


Attention: Manual Verification Needed for wikiteam3==4.4.1 Vulnerability Scan

  • The automated check using Safety did not produce any vulnerability report because the safety command wasn’t found. It appears Safety was installed in /home/jailuser/.local/bin which isn’t in the current PATH.
  • Action Required:
    • Ensure that your PATH includes /home/jailuser/.local/bin or run Safety via its absolute path.
    • Manually verify the package version (e.g., by running ~/home/jailuser/.local/bin/safety check --full-report --file requirements.txt) to confirm that wikiteam3==4.4.1 is not affected by known vulnerabilities.
.gitignore (1)

164-164: Confirm ignoring XML files is intentional.
Adding *.xml to .gitignore is helpful to avoid large or temporary support files. However, ensure that XML files aren’t crucial for version control or partial diffs that might be needed for debugging or data reconstruction.

Also applies to: 166-166

hivemind_etl/mediawiki/schema.py (4)

1-4: Imports look appropriate.
Good use of typing.Optional and pydantic.BaseModel. No immediate concerns here.


11-21: Revise default contributor construction.
contributor: Contributor = Contributor() means all Revision instances share the same default Contributor. This is correct because Pydantic defaults re-instantiate mutable fields. However, if you intended a single shared contributor object, note it carefully. Most of the time, separate Contributor() for each Revision is the right approach.


23-28: Model looks consistent.
Page model aligns well with optional fields. For some data ingestion, you might add validations (e.g., title string length).


30-34: SiteInfo usage.
All fields are optional, which is practical for diverse MediaWiki dumps. Consider clarifying or validating at least sitename or dbname if your downstream logic relies on them.

hivemind_etl/mediawiki/transform_xml.py (3)

7-26: Documentation is thorough.
The docstring provides a clear understanding of the function’s purpose, parameters, and return value. This is very helpful.


47-54: Optional SiteInfo usage.
Your code properly checks if <siteinfo> exists. Ensure the missing site info scenario is indeed valid for your downstream logic. If site details are absolutely required, you may want to log a warning when it’s missing.


119-122: Logging usage is good.
You log the total pages processed, which is helpful. Consider structured logging (e.g., JSON format) if you need advanced analytics or container-based orchestration logs.

Comment on lines +69 to +117
# Extract revision details
revision = page.find("mw:revision", namespaces)
if revision is not None:
rev_data = Revision()
rev_id_el = revision.find("mw:id", namespaces)
rev_data.revision_id = rev_id_el.text if rev_id_el is not None else None

parentid_el = revision.find("mw:parentid", namespaces)
rev_data.parent_revision_id = (
parentid_el.text if parentid_el is not None else None
)

timestamp_el = revision.find("mw:timestamp", namespaces)
rev_data.timestamp = timestamp_el.text if timestamp_el is not None else None

# Revision comment (present only on some pages)
comment_el = revision.find("mw:comment", namespaces)
rev_data.comment = comment_el.text if comment_el is not None else ""

# Contributor information
contributor = revision.find("mw:contributor", namespaces)
if contributor is not None:
cont_data = Contributor()
username_el = contributor.find("mw:username", namespaces)
cont_data.username = (
username_el.text if username_el is not None else None
)

user_id_el = contributor.find("mw:id", namespaces)
cont_data.user_id = user_id_el.text if user_id_el is not None else None

rev_data.contributor = cont_data

# Other revision details like model and format
model_el = revision.find("mw:model", namespaces)
rev_data.model = model_el.text if model_el is not None else None

format_el = revision.find("mw:format", namespaces)
rev_data.format = format_el.text if format_el is not None else None

# Extract the full text content; note that XML escapes are retained (e.g., &lt;)
text_el = revision.find("mw:text", namespaces)
rev_data.text = text_el.text if text_el is not None else ""

# Capture sha1 if needed
sha1_el = revision.find("mw:sha1", namespaces)
rev_data.sha1 = sha1_el.text if sha1_el is not None else None

page_data.revision = rev_data
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Optional deeper contributor logic.
You have robust coverage for revision extraction. If you want to handle multiple revisions, you might extend the logic for multiple <revision> elements per page. Right now, you only grab the first (or last) <revision> block.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
hivemind_etl/activities.py (1)

1-21: 🛠️ Refactor suggestion

Consider adding explicit re-exports if the imported functions should be available through this module.

If the intention is to make the website activity functions available through this module, you should explicitly re-export them to maintain the same public API.

 import logging
 from typing import Any

 from hivemind_etl.website.activities import (
     get_hivemind_website_comminities,
     extract_website,
     transform_website_data,
     load_website_data,
 )

 from temporalio import activity


 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)


 @activity.defn
 async def say_hello():
     return 7
+
+# Re-export website activities to maintain the public API
+__all__ = [
+    "get_hivemind_website_comminities",
+    "extract_website",
+    "transform_website_data",
+    "load_website_data",
+    "say_hello",
+]

Additionally, update the function name if you fix the typo in the original function:

+# Re-export website activities to maintain the public API
+__all__ = [
+    "get_hivemind_website_communities",  # Fixed typo
+    "extract_website",
+    "transform_website_data",
+    "load_website_data",
+    "say_hello",
+]
🧰 Tools
🪛 Ruff (0.8.2)

2-2: typing.Any imported but unused

Remove unused import: typing.Any

(F401)


5-5: hivemind_etl.website.activities.get_hivemind_website_comminities imported but unused

Remove unused import

(F401)


6-6: hivemind_etl.website.activities.extract_website imported but unused

Remove unused import

(F401)


7-7: hivemind_etl.website.activities.transform_website_data imported but unused

Remove unused import

(F401)


8-8: hivemind_etl.website.activities.load_website_data imported but unused

Remove unused import

(F401)

🧹 Nitpick comments (5)
workflows.py (1)

7-8: Remove unused imports.

CommunityWebsiteWorkflow and WebsiteIngestionSchedulerWorkflow are not referenced in this file. Consider removing them to satisfy lint and reduce clutter.

-from hivemind_etl.website.workflows import (
-    CommunityWebsiteWorkflow,
-    WebsiteIngestionSchedulerWorkflow,
-)
🧰 Tools
🪛 Ruff (0.8.2)

7-7: hivemind_etl.website.workflows.CommunityWebsiteWorkflow imported but unused

Remove unused import

(F401)


8-8: hivemind_etl.website.workflows.WebsiteIngestionSchedulerWorkflow imported but unused

Remove unused import

(F401)

registry.py (1)

3-6: Check spelling for function name.

get_hivemind_website_comminities likely contains a typographical error.

- get_hivemind_website_comminities
+ get_hivemind_website_communities
hivemind_etl/website/workflows.py (1)

16-63: Consider unifying retry configurations or making them configurable.

Each activity call has individual retry policies. For maintainability, store these in a shared constant or config if you need to adjust them in multiple places later.

hivemind_etl/website/activities.py (2)

1-10: Imports look good but consider organizing by category.

Your imports are correctly structured, but consider organizing them by standard library, third-party, and local imports with blank lines separating each category for better readability.

 import logging 
 from typing import Any
 
 from temporalio import activity, workflow
 
+
 with workflow.unsafe.imports_passed_through():
     from hivemind_etl.website.module import ModulesWebsite
     from hivemind_etl.website.website_etl import WebsiteETL
     from llama_index.core import Document

13-42: Fix typo in function name "comminities" → "communities".

There's a spelling error in the function name that should be corrected for consistency and clarity.

-async def get_hivemind_website_comminities(
+async def get_hivemind_website_communities(
     platform_id: str | None = None,
 ) -> list[dict[str, Any]]:

Also ensure you update all references to this function in other files, particularly in hivemind_etl/activities.py.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c7c18c5 and f132990.

📒 Files selected for processing (5)
  • hivemind_etl/activities.py (1 hunks)
  • hivemind_etl/website/activities.py (1 hunks)
  • hivemind_etl/website/workflows.py (1 hunks)
  • registry.py (2 hunks)
  • workflows.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (5)
registry.py (1)
hivemind_etl/website/activities.py (4)
  • get_hivemind_website_comminities (13-42)
  • load_website_data (78-87)
  • transform_website_data (62-74)
  • extract_website (46-58)
hivemind_etl/website/activities.py (2)
hivemind_etl/website/module.py (2)
  • ModulesWebsite (6-74)
  • get_learning_platforms (11-74)
hivemind_etl/website/website_etl.py (1)
  • WebsiteETL (9-106)
hivemind_etl/activities.py (1)
hivemind_etl/website/activities.py (4)
  • get_hivemind_website_comminities (13-42)
  • extract_website (46-58)
  • transform_website_data (62-74)
  • load_website_data (78-87)
hivemind_etl/website/workflows.py (1)
hivemind_etl/website/activities.py (4)
  • extract_website (46-58)
  • get_hivemind_website_comminities (13-42)
  • load_website_data (78-87)
  • transform_website_data (62-74)
workflows.py (2)
hivemind_etl/activities.py (1)
  • say_hello (19-20)
hivemind_etl/website/workflows.py (2)
  • CommunityWebsiteWorkflow (18-61)
  • WebsiteIngestionSchedulerWorkflow (66-93)
🪛 Ruff (0.8.2)
hivemind_etl/activities.py

5-5: hivemind_etl.website.activities.get_hivemind_website_comminities imported but unused

Remove unused import

(F401)


6-6: hivemind_etl.website.activities.extract_website imported but unused

Remove unused import

(F401)


7-7: hivemind_etl.website.activities.transform_website_data imported but unused

Remove unused import

(F401)


8-8: hivemind_etl.website.activities.load_website_data imported but unused

Remove unused import

(F401)

workflows.py

7-7: hivemind_etl.website.workflows.CommunityWebsiteWorkflow imported but unused

Remove unused import

(F401)


8-8: hivemind_etl.website.workflows.WebsiteIngestionSchedulerWorkflow imported but unused

Remove unused import

(F401)

⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: ci / test / Test
  • GitHub Check: ci / lint / Lint
🔇 Additional comments (8)
workflows.py (1)

5-5: Import usage is valid.

The say_hello import is used later in line 25, so no concerns here.

registry.py (1)

20-26: ACTIVITIES list update looks consistent.

These references align with the new website activity modules.

hivemind_etl/website/workflows.py (2)

1-7: Imports and setup look good.

All needed imports (logging, asyncio, Temporal classes, etc.) are correctly specified here.


64-94: Workflow scheduling logic is straightforward.

Child workflows are started in parallel for each community. This sets the stage for concurrent ingestion, which looks correct for your use case.

hivemind_etl/website/activities.py (3)

45-58: Good implementation of extract_website activity with proper logging.

The function correctly initializes the WebsiteETL class and calls its extract method with appropriate error handling and logging.


61-74: Naming consistency between function and module.

The function name transform_website_data follows a good naming pattern that clearly indicates it's for website data. This is more explicit than the previous function name mentioned in the summary (transform_data).


77-87: Consistent ETL pattern implementation.

The load_website_data function follows the same pattern as the extract and transform functions, providing good consistency across the ETL pipeline.

hivemind_etl/activities.py (1)

18-20:

Details

✅ Verification successful

Verify if the say_hello function is still needed.

This function appears to be a simple test function that returns a constant value (7). Check if it's still being used and consider removing it if it's no longer needed.


🏁 Script executed:

#!/bin/bash
# Check if say_hello is used anywhere in the codebase
echo "Checking if say_hello is used anywhere:"
rg -A 2 "say_hello" --exclude="activities.py"

Length of output: 544


Below is a new shell script using the correct ripgrep flag. Please re-run it manually to re-verify if any references to say_hello exist outside of hivemind_etl/activities.py:

Once you run this script, if no references to say_hello are found, it is safe to remove the function.


🏁 Script executed:

#!/bin/bash
echo "Checking if say_hello is referenced in files excluding hivemind_etl/activities.py:"
rg -A 2 "say_hello" -g '!hivemind_etl/activities.py'

Length of output: 743


Confirmation on say_hello Usage:
The verification shows that the say_hello function is actively referenced in other parts of the codebase (specifically in workflows.py and registry.py via imports). No removal is needed at this time.

Comment on lines +4 to +9
from hivemind_etl.website.activities import (
get_hivemind_website_comminities,
extract_website,
transform_website_data,
load_website_data,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Fix unused imports identified by static analysis.

The imports are not being used in this file according to static analysis. If they're being imported for use in other modules that import from this file, consider adding a comment explaining this purpose. Otherwise, they should be removed.

-from hivemind_etl.website.activities import (
-    get_hivemind_website_comminities,
-    extract_website,
-    transform_website_data,
-    load_website_data,
-)

Let's verify if these functions are used elsewhere:


🏁 Script executed:

#!/bin/bash
# Search for imports or uses of these functions in other files
echo "Checking where these functions are imported or used:"
rg -l "from hivemind_etl.activities import" | xargs rg -l "(get_hivemind_website_comminities|extract_website|transform_website_data|load_website_data)"

Length of output: 280


Action Required: Clarify Re-Exported Functions Usage
The static analysis warning is a false positive since our grep confirms that these functions are imported in registry.py, meaning they’re intentionally re-exported as part of the module's public API. To prevent future confusion and accidental removal, please add a comment next to the import block in hivemind_etl/activities.py indicating that these imports exist solely for external use (re-export).

  • Location: hivemind_etl/activities.py (Lines 4-9)
  • Suggestion: Add a comment such as:
    # These functions are imported here to expose them as part of the public API for external modules (e.g., used in registry.py).
🧰 Tools
🪛 Ruff (0.8.2)

5-5: hivemind_etl.website.activities.get_hivemind_website_comminities imported but unused

Remove unused import

(F401)


6-6: hivemind_etl.website.activities.extract_website imported but unused

Remove unused import

(F401)


7-7: hivemind_etl.website.activities.transform_website_data imported but unused

Remove unused import

(F401)


8-8: hivemind_etl.website.activities.load_website_data imported but unused

Remove unused import

(F401)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (5)
hivemind_etl/mediawiki/wikiteam_crawler.py (1)

24-65: 🛠️ Refactor suggestion

Wrap DumpGenerator calls in try-except.
External calls to DumpGenerator can fail unexpectedly. Placing them in a try-except allows for logging and controlled error handling.

     # Directly call the DumpGenerator static __init__ method which will parse these parameters,
     # execute the dump generation process, and run through the rest of the workflow.
-    DumpGenerator(params)
+    try:
+        DumpGenerator(params)
+    except Exception as e:
+        logging.error(f"Error occurred while generating dump: {e}")
+        raise
🧰 Tools
🪛 Ruff (0.8.2)

51-51: f-string without any placeholders

Remove extraneous f prefix

(F541)

hivemind_etl/mediawiki/transform_xml.py (2)

56-57: Add error handling for malformed XML.

Wrapping ET.parse(xml_file) in a try-except block helps gracefully handle malformed or unreadable XML files.

Apply a try-except around the parse call:

 try:
     tree = ET.parse(xml_file)
     root = tree.getroot()
 except ET.ParseError as e:
     logging.error(f"Error parsing XML file {xml_file}: {str(e)}")
     raise

82-83: Consider supporting multiple revisions.

Currently, page.find("mw:revision", ...) retrieves only one revision. If you need full revision history, use findall and process each <revision>.

hivemind_etl/activities.py (1)

4-15: Unused imports flagged by static analysis.

These imported functions appear unused here. If they're intentionally re-exported, add a clarifying comment; otherwise, remove them to avoid confusion and reduce clutter.

🧰 Tools
🪛 Ruff (0.8.2)

5-5: hivemind_etl.website.activities.get_hivemind_website_comminities imported but unused

Remove unused import

(F401)


6-6: hivemind_etl.website.activities.extract_website imported but unused

Remove unused import

(F401)


7-7: hivemind_etl.website.activities.transform_website_data imported but unused

Remove unused import

(F401)


8-8: hivemind_etl.website.activities.load_website_data imported but unused

Remove unused import

(F401)


11-11: hivemind_etl.mediawiki.activities.get_hivemind_mediawiki_platforms imported but unused

Remove unused import

(F401)


12-12: hivemind_etl.mediawiki.activities.extract_mediawiki imported but unused

Remove unused import

(F401)


13-13: hivemind_etl.mediawiki.activities.transform_mediawiki_data imported but unused

Remove unused import

(F401)


14-14: hivemind_etl.mediawiki.activities.load_mediawiki_data imported but unused

Remove unused import

(F401)

hivemind_etl/mediawiki/etl.py (1)

84-85: Add error handling when removing the dump directory.

If shutil.rmtree(self.dump_dir) fails or the directory doesn't exist, an unhandled exception may arise. Consider surrounding it with a try-except block to log or handle errors gracefully.

🧹 Nitpick comments (8)
workflows.py (2)

6-8: Remove unused imports from hivemind_etl.website.workflows.
Static analysis flags CommunityWebsiteWorkflow and WebsiteIngestionSchedulerWorkflow as unused. Consider removing them to reduce dead code.

 from hivemind_etl.website.workflows import (
-    CommunityWebsiteWorkflow,
-    WebsiteIngestionSchedulerWorkflow,
 )
🧰 Tools
🪛 Ruff (0.8.2)

7-7: hivemind_etl.website.workflows.CommunityWebsiteWorkflow imported but unused

Remove unused import

(F401)


8-8: hivemind_etl.website.workflows.WebsiteIngestionSchedulerWorkflow imported but unused

Remove unused import

(F401)


10-12: Remove unused import from hivemind_etl.mediawiki.workflows.
MediaWikiETLWorkflow is not used in this file, so you can safely remove it to clean up the imports.

 from hivemind_etl.mediawiki.workflows import (
-    MediaWikiETLWorkflow,
 )
🧰 Tools
🪛 Ruff (0.8.2)

11-11: hivemind_etl.mediawiki.workflows.MediaWikiETLWorkflow imported but unused

Remove unused import: hivemind_etl.mediawiki.workflows.MediaWikiETLWorkflow

(F401)

tests/integration/test_mediawiki_modules.py (1)

17-20: Consider adding negative test coverage.
Little or no test coverage exists for cases where baseURL or path might be malformed or missing namespaces. Adding negative tests can help ensure error handling is robust.

hivemind_etl/mediawiki/workflows.py (1)

36-72: Consider parallelizing or using child workflows for each platform.
Currently, the ETL processes each platform sequentially in a for loop. If concurrency is desired (or if a single failure shouldn't stall other communities), you can leverage child workflows or asyncio concurrency for improved throughput and resilience.

registry.py (2)

3-4: Consider correcting the spelling for “comminities”.
The function name get_hivemind_website_comminities seems to have a typographical error; consider renaming it to match the domain concept “communities.”

-from hivemind_etl.activities import (
-    extract_website,
-    get_hivemind_website_comminities,
-    ...
+from hivemind_etl.activities import (
+    extract_website,
+    get_hivemind_website_communities,
+    ...

26-36: Check for code duplication in ACTIVITIES.
say_hello is included at the end of ACTIVITIES. Confirm that reordering or grouping activities by domain (Website vs. MediaWiki) might improve readability.

hivemind_etl/mediawiki/wikiteam_crawler.py (1)

51-51: Remove extraneous “f” prefix.
Static analysis indicates f"--namespaces" has no placeholders. Remove the f to avoid confusion.

-            params.append(f"--namespaces")
+            params.append("--namespaces")
🧰 Tools
🪛 Ruff (0.8.2)

51-51: f-string without any placeholders

Remove extraneous f prefix

(F541)

hivemind_etl/mediawiki/activities.py (1)

86-96: Review data load concurrency.
Simultaneous calls to load_mediawiki_data for the same community_id might overlap. Consider additional safeguards or concurrency checks, e.g., locking or version checks.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f132990 and 4c30962.

📒 Files selected for processing (11)
  • .gitignore (1 hunks)
  • hivemind_etl/activities.py (1 hunks)
  • hivemind_etl/mediawiki/activities.py (1 hunks)
  • hivemind_etl/mediawiki/etl.py (1 hunks)
  • hivemind_etl/mediawiki/module.py (1 hunks)
  • hivemind_etl/mediawiki/transform_xml.py (1 hunks)
  • hivemind_etl/mediawiki/wikiteam_crawler.py (1 hunks)
  • hivemind_etl/mediawiki/workflows.py (1 hunks)
  • registry.py (1 hunks)
  • tests/integration/test_mediawiki_modules.py (1 hunks)
  • workflows.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • .gitignore
🧰 Additional context used
🧬 Code Graph Analysis (8)
workflows.py (2)
hivemind_etl/activities.py (1)
  • say_hello (25-26)
hivemind_etl/website/workflows.py (2)
  • CommunityWebsiteWorkflow (18-61)
  • WebsiteIngestionSchedulerWorkflow (66-93)
tests/integration/test_mediawiki_modules.py (1)
hivemind_etl/mediawiki/module.py (2)
  • ModulesMediaWiki (6-80)
  • get_learning_platforms (11-80)
hivemind_etl/mediawiki/activities.py (2)
hivemind_etl/mediawiki/module.py (2)
  • ModulesMediaWiki (6-80)
  • get_learning_platforms (11-80)
hivemind_etl/mediawiki/etl.py (4)
  • MediawikiETL (10-85)
  • extract (23-29)
  • transform (31-76)
  • load (78-85)
hivemind_etl/mediawiki/etl.py (2)
hivemind_etl/mediawiki/transform_xml.py (1)
  • parse_mediawiki_xml (9-134)
hivemind_etl/mediawiki/wikiteam_crawler.py (2)
  • WikiteamCrawler (7-83)
  • crawl (24-64)
hivemind_etl/mediawiki/workflows.py (1)
hivemind_etl/mediawiki/activities.py (4)
  • get_hivemind_mediawiki_platforms (13-49)
  • extract_mediawiki (53-69)
  • transform_mediawiki_data (73-83)
  • load_mediawiki_data (87-96)
hivemind_etl/activities.py (2)
hivemind_etl/website/activities.py (4)
  • get_hivemind_website_comminities (13-42)
  • extract_website (46-58)
  • transform_website_data (62-74)
  • load_website_data (78-87)
hivemind_etl/mediawiki/activities.py (4)
  • get_hivemind_mediawiki_platforms (13-49)
  • extract_mediawiki (53-69)
  • transform_mediawiki_data (73-83)
  • load_mediawiki_data (87-96)
registry.py (2)
hivemind_etl/website/activities.py (3)
  • get_hivemind_website_comminities (13-42)
  • load_website_data (78-87)
  • transform_website_data (62-74)
hivemind_etl/mediawiki/activities.py (3)
  • get_hivemind_mediawiki_platforms (13-49)
  • transform_mediawiki_data (73-83)
  • load_mediawiki_data (87-96)
hivemind_etl/mediawiki/transform_xml.py (1)
hivemind_etl/mediawiki/schema.py (4)
  • Contributor (6-8)
  • Page (23-27)
  • Revision (11-20)
  • SiteInfo (30-34)
🪛 Ruff (0.8.2)
workflows.py

7-7: hivemind_etl.website.workflows.CommunityWebsiteWorkflow imported but unused

Remove unused import

(F401)


8-8: hivemind_etl.website.workflows.WebsiteIngestionSchedulerWorkflow imported but unused

Remove unused import

(F401)


11-11: hivemind_etl.mediawiki.workflows.MediaWikiETLWorkflow imported but unused

Remove unused import: hivemind_etl.mediawiki.workflows.MediaWikiETLWorkflow

(F401)

hivemind_etl/mediawiki/wikiteam_crawler.py

14-14: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


51-51: f-string without any placeholders

Remove extraneous f prefix

(F541)

hivemind_etl/activities.py

5-5: hivemind_etl.website.activities.get_hivemind_website_comminities imported but unused

Remove unused import

(F401)


6-6: hivemind_etl.website.activities.extract_website imported but unused

Remove unused import

(F401)


7-7: hivemind_etl.website.activities.transform_website_data imported but unused

Remove unused import

(F401)


8-8: hivemind_etl.website.activities.load_website_data imported but unused

Remove unused import

(F401)


11-11: hivemind_etl.mediawiki.activities.get_hivemind_mediawiki_platforms imported but unused

Remove unused import

(F401)


12-12: hivemind_etl.mediawiki.activities.extract_mediawiki imported but unused

Remove unused import

(F401)


13-13: hivemind_etl.mediawiki.activities.transform_mediawiki_data imported but unused

Remove unused import

(F401)


14-14: hivemind_etl.mediawiki.activities.load_mediawiki_data imported but unused

Remove unused import

(F401)

⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: ci / lint / Lint
  • GitHub Check: ci / test / Test
🔇 Additional comments (8)
workflows.py (1)

5-5: Good use of the imported activity.
The say_hello import is correctly utilized later in the SayHello workflow.

registry.py (3)

7-10: Verify that these newly imported MediaWiki activities are only added where needed.
Ensure that extract_mediawiki, get_hivemind_mediawiki_platforms, transform_mediawiki_data, and load_mediawiki_data are actively used in your workflows or scheduled tasks. If they are not invoked, consider removing them or adding usage and test coverage.


16-16: Workflow inclusion appears consistent.
Adding MediaWikiETLWorkflow to the workflow list makes sense for the new MediaWiki ETL feature.


23-23: Great addition to WORKFLOWS.
No concerns about adding MediaWikiETLWorkflow here. Looks consistent with the existing pattern.

hivemind_etl/mediawiki/wikiteam_crawler.py (1)

66-84: Ensure consistent log levels for file deletion.
The delete_dump method uses logging.info and logging.warning for events; confirm that the chosen log levels agree with your logging strategy (e.g., info vs. debug).

hivemind_etl/mediawiki/activities.py (2)

52-70: Good separation of concerns for extract_mediawiki.
Extraction is clearly delegated to MediawikiETL.extract(). This is an excellent pattern for maintainability.


72-84: Check namespaces parameter usage.
transform_mediawiki_data creates MediawikiETL(community_id=community_id) without specifying namespaces. Confirm if that is correct or if you should pass them here as well.

hivemind_etl/mediawiki/transform_xml.py (1)

46-47: Looks Good for File Existence Check.

Raising a FileNotFoundError if no XML file is located aligns well with robust error handling. This ensures early detection of misconfiguration.

Comment on lines +55 to +56
if not isinstance(path, str) and not isinstance(base_url, str):
raise ValueError("Wrong format for `path` and `base_url`!")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Potential logic error with string validation.
Currently, the check raises a ValueError only if both path and base_url are not strings. If either is not a string, it should likely raise.

- if not isinstance(path, str) and not isinstance(base_url, str):
+ if not isinstance(path, str) or not isinstance(base_url, str):
     raise ValueError("Wrong format for `path` and `base_url`!")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not isinstance(path, str) and not isinstance(base_url, str):
raise ValueError("Wrong format for `path` and `base_url`!")
if not isinstance(path, str) or not isinstance(base_url, str):
raise ValueError("Wrong format for `path` and `base_url`!")

Comment on lines +8 to +22
def __init__(
self,
community_id: str,
xml: bool = True,
force: bool = True,
curonly: bool = True,
namespaces: list[int] = [],
**kwargs,
) -> None:
self.community_id = community_id
self.xml = xml
self.force = force
self.curonly = curonly
self.extra_params = kwargs
self.namespaces = namespaces
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid using a mutable default argument for namespaces.
In Python, using a mutable as a default ([]) can lead to unexpected behavior if the list is modified. Initialize it with None and set the default inside the constructor.

-    def __init__(
-        self,
-        community_id: str,
-        xml: bool = True,
-        force: bool = True,
-        curonly: bool = True,
-        namespaces: list[int] = [],
-        **kwargs,
-    ) -> None:
-        self.community_id = community_id
-        self.xml = xml
-        self.force = force
-        self.curonly = curonly
-        self.extra_params = kwargs
-        self.namespaces = namespaces
+    def __init__(
+        self,
+        community_id: str,
+        xml: bool = True,
+        force: bool = True,
+        curonly: bool = True,
+        namespaces: list[int] | None = None,
+        **kwargs,
+    ) -> None:
+        self.community_id = community_id
+        self.xml = xml
+        self.force = force
+        self.curonly = curonly
+        self.extra_params = kwargs
+        self.namespaces = namespaces if namespaces is not None else []
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def __init__(
self,
community_id: str,
xml: bool = True,
force: bool = True,
curonly: bool = True,
namespaces: list[int] = [],
**kwargs,
) -> None:
self.community_id = community_id
self.xml = xml
self.force = force
self.curonly = curonly
self.extra_params = kwargs
self.namespaces = namespaces
def __init__(
self,
community_id: str,
xml: bool = True,
force: bool = True,
curonly: bool = True,
namespaces: list[int] | None = None,
**kwargs,
) -> None:
self.community_id = community_id
self.xml = xml
self.force = force
self.curonly = curonly
self.extra_params = kwargs
self.namespaces = namespaces if namespaces is not None else []
🧰 Tools
🪛 Ruff (0.8.2)

14-14: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

Comment on lines +35 to +62
for page in pages:
try:
documents.append(
Document(
doc_id=page.page_id,
text=page.revision.text,
metadata={
"title": page.title,
"namespace": page.namespace,
"revision_id": page.revision.revision_id,
"parent_revision_id": page.revision.parent_revision_id,
"timestamp": page.revision.timestamp,
"comment": page.revision.comment,
"contributor_username": page.revision.contributor.username,
"contributor_user_id": page.revision.contributor.user_id,
"sha1": page.revision.sha1,
"model": page.revision.model,
},
excluded_embed_metadata_keys=[
"namespace",
"revision_id",
"parent_revision_id",
"sha1",
"model",
"contributor_user_id",
"comment",
"timestamp",
],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Validate existence of page.revision before referencing.

Accessing page.revision.text may raise an AttributeError if page.revision is None. Confirm whether pages without revisions should be skipped or handled differently.


🏁 Script executed:

#!/bin/bash
# This script searches for references to page.revision in this file to see if they’re all guarded.
rg -A 3 'page\.revision' hivemind_etl/mediawiki/etl.py

Length of output: 939


Action Required: Add a guard for page.revision existence.
It was verified that the code on lines 35–62 in hivemind_etl/mediawiki/etl.py directly accesses attributes of page.revision without checking if it exists. This can lead to a runtime AttributeError if a page lacks a revision.

  • File: hivemind_etl/mediawiki/etl.py (Lines 35–62)
  • Issue: Directly accessing page.revision.text, page.revision.revision_id, etc., without confirming page.revision is not None.
  • Recommendation: Introduce a conditional check before dereferencing the revision. For example, consider:
    for page in pages:
        if page.revision is None:
            continue  # or handle accordingly
        try:
            documents.append(
                Document(
                    doc_id=page.page_id,
                    text=page.revision.text,
                    metadata={
                        "title": page.title,
                        "namespace": page.namespace,
                        "revision_id": page.revision.revision_id,
                        "parent_revision_id": page.revision.parent_revision_id,
                        "timestamp": page.revision.timestamp,
                        "comment": page.revision.comment,
                        "contributor_username": page.revision.contributor.username,
                        "contributor_user_id": page.revision.contributor.user_id,
                        "sha1": page.revision.sha1,
                        "model": page.revision.model,
                    },
                    excluded_embed_metadata_keys=[
                        "namespace",
                        "revision_id",
                        "parent_revision_id",
                        "sha1",
                        "model",
                        "contributor_user_id",
                        "comment",
                        "timestamp",
                    ],
                )
            )
        except Exception as e:
            # appropriate error handling if needed
            pass
  • Action: Please confirm the intended behavior for pages missing a revision and update the code accordingly.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
tests/integration/test_mediawiki_modules.py (1)

1-235: 🛠️ Refactor suggestion

Add tests for error handling scenarios.

There are no tests for the error handling in the get_learning_platforms method. Consider adding tests for:

  1. When path or base_url are not strings
  2. When an exception is raised during processing

Example test case:

def test_error_handling_invalid_types(self):
    """Test error handling when path or base_url are not strings"""
    platform_id = ObjectId("6579c364f1120850414e0dc6")
    community_id = ObjectId("6579c364f1120850414e0dc5")

    # Insert platform with non-string path
    self.client["Core"]["platforms"].insert_one({
        "_id": platform_id,
        "name": "mediaWiki",
        "metadata": {
            "baseURL": "http://example.com",
            "path": 123,  # Non-string path
        },
        "community": community_id,
        "disconnectedAt": None,
        "connectedAt": datetime.now(),
        "createdAt": datetime.now(),
        "updatedAt": datetime.now(),
    })
    
    # Insert corresponding module
    self.client["Core"]["modules"].insert_one({
        "name": "hivemind",
        "community": community_id,
        "options": {
            "platforms": [
                {
                    "platform": platform_id,
                    "name": "mediaWiki",
                    "metadata": {"namespaces": [0, 1, 2]},
                }
            ]
        },
    })

    # Should log error and return empty list
    result = self.modules_mediawiki.get_learning_platforms()
    self.assertEqual(result, [])
♻️ Duplicate comments (1)
hivemind_etl/mediawiki/module.py (1)

66-67: Fix the logical condition for string validation.

Currently, the check raises a ValueError only if both path and base_url are not strings. It should raise an error if either is not a string.

- if not isinstance(path, str) and not isinstance(base_url, str):
+ if not isinstance(path, str) or not isinstance(base_url, str):
    raise ValueError("Wrong format for `path` and `base_url`!")
🧹 Nitpick comments (4)
hivemind_etl/mediawiki/module.py (2)

82-82: Remove unnecessary type ignore comment.

The type ignore comment isn't needed if you properly validate that base_url and path are strings before concatenation. Fix the validation condition as suggested earlier and you can safely remove this comment.

- "base_url": base_url + path,  # type: ignore
+ "base_url": base_url + path,

56-56: Consider implementing the TODO for database optimization.

Retrieving baseURL and path in a single database call would reduce latency and improve performance, especially when processing multiple platforms.

tests/integration/test_mediawiki_modules.py (2)

156-158: Update docstring to reflect test purpose.

The current docstring describes the data setup but not the specific purpose of the test, which is filtering platforms by ID.

- """
- Two mediawiki platforms for one community
- """
+ """
+ Test filtering platforms by ID when multiple platforms exist for one community
+ """

159-219: Reduce code duplication in test setup.

There's significant duplication in the test setup between this test and the previous one. Consider creating a helper method to set up the test data.

Here's a suggested approach:

def _setup_multiple_platforms(self):
    platform_id1 = ObjectId("6579c364f1120850414e0dc6")
    platform_id2 = ObjectId("6579c364f1120850414e0dc7")
    community_id = ObjectId("1009c364f1120850414e0dc5")

    # Insert module with two platforms
    self.client["Core"]["modules"].insert_one({
        "name": "hivemind",
        "community": community_id,
        "options": {
            "platforms": [
                {
                    "platform": platform_id1,
                    "name": "mediaWiki",
                    "metadata": {"namespaces": [0, 1, 2]},
                },
                {
                    "platform": platform_id2,
                    "name": "mediaWiki",
                    "metadata": {"namespaces": [3, 4, 5]},
                },
            ]
        },
    })

    # Insert platform records
    for i, platform_id in enumerate([platform_id1, platform_id2]):
        self.client["Core"]["platforms"].insert_one({
            "_id": platform_id,
            "name": "mediaWiki",
            "metadata": {
                "baseURL": f"http://example{i+1}.com",
                "path": "/api",
            },
            "community": community_id,
            "disconnectedAt": None,
            "connectedAt": datetime.now(),
            "createdAt": datetime.now(),
            "updatedAt": datetime.now(),
        })
    
    return platform_id1, platform_id2, community_id
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5ed9028 and f065bf3.

📒 Files selected for processing (4)
  • hivemind_etl/mediawiki/activities.py (1 hunks)
  • hivemind_etl/mediawiki/module.py (1 hunks)
  • tests/integration/test_mediawiki_modules.py (1 hunks)
  • tests/unit/test_mediawiki_etl.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • hivemind_etl/mediawiki/activities.py
  • tests/unit/test_mediawiki_etl.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/integration/test_mediawiki_modules.py (1)
hivemind_etl/mediawiki/module.py (2)
  • ModulesMediaWiki (6-91)
  • get_learning_platforms (11-91)
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: ci / test / Test
  • GitHub Check: ci / lint / Lint
🔇 Additional comments (7)
hivemind_etl/mediawiki/module.py (3)

6-9: LGTM: Class definition and constructor are well structured.

The class correctly extends ModulesBase and initializes the platform_name attribute before calling the parent constructor.


11-36: LGTM: Method signature and documentation are clear.

The method signature provides appropriate type hints and default values. The docstring clearly describes the purpose, parameters and return value with a helpful example.


37-54: LGTM: Query and filtering implementation looks good.

The code correctly queries modules based on platform name and implements proper filtering logic for platforms.

tests/integration/test_mediawiki_modules.py (4)

9-16: LGTM: Test setup is clean and properly initializes the environment.

The setUp method correctly drops existing collections and initializes the ModulesMediaWiki instance.


17-19: LGTM: Empty data test is straightforward.

This test correctly verifies the behavior when no data is available.


21-69: LGTM: Single data test covers basic functionality.

This test properly sets up a single platform and module, then verifies the correct output format and values.


70-154: LGTM: Multiple platforms test is comprehensive.

The test correctly verifies that multiple platforms for a single community are properly retrieved and formatted.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (6)
hivemind_etl/mediawiki/workflows.py (6)

28-35: Consider using more specific exception handling.

While catching all exceptions is appropriate for the workflow's main try block, consider adding logging for specific exception types that might be common or require different handling.


38-43: Consider using a dedicated data structure for platform configuration.

Using a dictionary for the platform configuration works, but a Pydantic model would provide better type safety and validation.

from pydantic import BaseModel

class MediaWikiPlatform(BaseModel):
    base_url: str
    community_id: str
    namespaces: list[int]  # Assuming namespaces is a list of integers

58-63: Implement retry logic for the load operation.

Loading data often encounters transient issues. Consider implementing a retry mechanism for the load_mediawiki_data activity using Temporal's built-in retry options.

# Example retry policy
from temporalio.common import RetryPolicy

retry_policy = RetryPolicy(
    initial_interval=timedelta(seconds=1),
    backoff_coefficient=2.0,
    maximum_interval=timedelta(minutes=5),
    maximum_attempts=5,
)

# Then use it with the activity
await workflow.execute_activity(
    load_mediawiki_data,
    documents,
    platform["community_id"],
    start_to_close_timeout=timedelta(minutes=30),
    retry_policy=retry_policy,
)

65-72: Improve error reporting granularity.

Consider adding more detailed logging for different stages of processing (extraction, transformation, loading) to help diagnose issues more effectively.

# Example of more detailed logging
try:
    # Extraction
    logging.info(f"Starting extraction for community: {platform['community_id']}")
    await workflow.execute_activity(...)
    logging.info(f"Extraction completed for community: {platform['community_id']}")
    
    # Transformation
    logging.info(f"Starting transformation for community: {platform['community_id']}")
    documents = await workflow.execute_activity(...)
    logging.info(f"Transformation completed for community: {platform['community_id']}, processed {len(documents)} documents")
    
    # Loading
    logging.info(f"Starting data loading for community: {platform['community_id']}")
    await workflow.execute_activity(...)
    logging.info(f"Successfully completed ETL for community id: {platform['community_id']}")
except Exception as e:
    # Error handling remains the same

74-76: Consider returning workflow execution summary.

The workflow currently returns None. Consider returning a summary of the processing results, such as the number of communities processed successfully and failed.

@workflow.run
async def run(self, platform_id: str | None = None) -> dict:
    results = {"successful": [], "failed": []}
    try:
        # ... existing code ...
        for platform in platforms:
            try:
                # ... existing processing code ...
                results["successful"].append(platform["community_id"])
            except Exception as e:
                results["failed"].append({
                    "community_id": platform["community_id"],
                    "error": str(e)
                })
                continue
    except Exception as e:
        # ... existing error handling ...
    return results

36-68: Consider implementing parallel processing for multiple platforms.

Currently, platforms are processed sequentially. If processing numerous platforms, consider implementing parallel processing to improve efficiency.

You could use Temporal's child workflows or activity execution options to implement parallel processing:

# Example using concurrent.futures for parallel processing
from concurrent.futures import ThreadPoolExecutor
import asyncio

# Inside the run method:
with ThreadPoolExecutor(max_workers=3) as executor:
    # Create a list of platform processing tasks
    tasks = []
    for platform in platforms:
        task = asyncio.create_task(self._process_platform(platform))
        tasks.append(task)
    
    # Wait for all tasks to complete
    await asyncio.gather(*tasks)

# Add a helper method to process a single platform
async def _process_platform(self, platform):
    try:
        # Existing platform processing code
        mediawiki_platform = {...}
        await workflow.execute_activity(...)
        # ...
    except Exception as e:
        logging.error(f"Error processing community id: {platform['community_id']}: {str(e)}")

This approach would need careful consideration of resource usage and potential rate limits.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f065bf3 and 56d997a.

📒 Files selected for processing (1)
  • hivemind_etl/mediawiki/workflows.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: ci / test / Test
🔇 Additional comments (3)
hivemind_etl/mediawiki/workflows.py (3)

1-13: Imports look well-structured and follow best practices.

The code correctly uses the Temporal workflow's unsafe.imports_passed_through() context manager to handle imports for distributed execution. All necessary activities from the mediawiki module are properly imported.


15-27: Documentation is clear and comprehensive.

The workflow class is properly decorated with @workflow.defn and the method with @workflow.run. The docstring effectively explains the purpose and parameters of the run method.


51-55:

Details

❓ Verification inconclusive

Be cautious with memory consumption during transformation.

The documents variable might contain a large amount of data. Consider implementing streaming or batch processing patterns if memory consumption becomes an issue.


🏁 Script executed:

#!/bin/bash
# Check if there are any memory management patterns in the transform activity
rg -A 5 "transform_mediawiki_data" --glob "*.py" | grep -E "batch|stream|chunk|yield"

Length of output: 85


Below is a new shell script to gather more context about the transform_mediawiki_data function. The initial check returned no output, so this script will extract around 50 lines following the function’s definition and then search for keywords related to memory management (e.g. batch, stream, chunk, yield). Please run this command and share the output so we can verify if any memory optimization patterns have been implemented:

If the output confirms that no such memory management patterns are present, please review whether implementing streaming or batch processing is feasible to control memory usage when documents could be large.


🏁 Script executed:

#!/bin/bash
# Extract extended context around the transform_mediawiki_data function definition
rg -A 50 "def transform_mediawiki_data" --glob "*.py" > transform_context.txt && \
grep -E "batch|stream|chunk|yield" transform_context.txt

Length of output: 236


Memory Usage Concern — Manual Verification Recommended

Based on our verification attempts, there is no evidence of streaming, batching, chunking, or yielding patterns within the transform_mediawiki_data function. This suggests that the transformation currently loads all data into memory, which might lead to high memory consumption when processing large datasets. Please verify manually that this behavior is acceptable under expected workloads, and consider refactoring the transformation to support streaming or batch processing patterns to mitigate potential memory issues.

Comment on lines +44 to +48
await workflow.execute_activity(
extract_mediawiki,
mediawiki_platform,
start_to_close_timeout=timedelta(days=5),
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Five-day timeout may require monitoring strategy.

The 5-day timeout for extract_mediawiki indicates this could be a very long-running operation. Consider:

  1. Implementing progress tracking/reporting
  2. Breaking this into smaller, checkpointed activities
  3. Adding monitoring to detect stalled extractions

@amindadgar amindadgar merged commit d9f1fd6 into main Apr 13, 2025
3 checks passed
@amindadgar amindadgar linked an issue Apr 14, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: mediaWiki ETL!

2 participants