Skip to content

Conversation

@trivialfis
Copy link
Member

@trivialfis trivialfis commented Dec 18, 2025

  • high level test.
  • Make sure the cache is not trashed.

@david-cortes
Copy link
Contributor

How about deleting caches instead of erroring out on modifications?

@trivialfis
Copy link
Member Author

Yes, we can do that. But I would like to see the impact of making the DMatrix immutable first. I will update the PR to work with the cache.

@trivialfis trivialfis changed the title [wip] Test rejecting setter of DMatrix after begin used. [wip] Invalidate the DMatrix cache after modification. Dec 20, 2025
@trivialfis trivialfis requested a review from Copilot December 20, 2025 10:31
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a cache invalidation mechanism for DMatrix objects after their metadata has been modified. It introduces a has_been_read_ flag in the MetaInfo class to track whether data has been read via GetBatches() methods, and resets this flag when data is modified via SetInfo(). The cache eviction logic is updated to invalidate entries for DMatrix objects that haven't been marked as read.

  • Adds tracking mechanism with has_been_read_ flag in MetaInfo class
  • Updates all GetBatches() template specializations to mark data as read
  • Modifies cache invalidation logic to consider the read flag
  • Includes minor documentation updates and spelling corrections

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
include/xgboost/data.h Adds has_been_read_ flag, accessor methods, and read flag setting in all GetBatches() specializations; updates documentation style
src/data/data.cc Adds SetReadFlag(false) call in MetaInfo::SetInfo() to invalidate cache when metadata is modified
include/xgboost/cache.h Updates cache eviction logic to check the read flag and invalidate unread entries; fixes spelling of "repeatedly"

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

* \param key The key of the information.
* \param interface_str String representation of json format array interface.
/**
* @brief Set data in the meta info with array interface.
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation change from "information" to "data" may be inaccurate. The SetInfo method accepts various types of metadata (labels, weights, base_margin, etc.) through the array interface, not just "data". The term "information" more accurately describes what this method sets. Consider reverting this documentation change or clarifying that "data" refers to metadata fields.

Copilot uses AI. Check for mistakes.
CHECK(it != container_.cend());
if (it->second.ref.expired()) {
// Re-new the cache if this has never been read.
if (it->second.ref.expired() || !it->second.ref.lock()->Info().HasBeenRead()) {
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache invalidation logic may be incorrect. The condition checks if the DMatrix has NOT been read (!HasBeenRead()), which seems backwards. According to the comment "Re-new the cache if this has never been read", the logic appears to invalidate caches that haven't been read yet. However, the intent should be to invalidate caches for DMatrix objects that have been modified (where has_been_read_ was reset to false after being true). Consider whether the flag should be renamed to something like "is_modified_" or "needs_cache_refresh_" to better reflect its purpose, or verify that the invalidation condition is correct.

Copilot uses AI. Check for mistakes.
} // namespace

void MetaInfo::SetInfo(Context const& ctx, StringView key, StringView interface_str) {
this->SetReadFlag(false);
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SetReadFlag(false) call should be placed after the metadata modification is complete, not before. If the modification fails (e.g., Json::Load throws an exception), the flag will be incorrectly set to false even though no actual modification occurred. Consider moving this call to the end of the function, after all validation and modification operations have completed successfully.

Copilot uses AI. Check for mistakes.
} // namespace

void MetaInfo::SetInfo(Context const& ctx, StringView key, StringView interface_str) {
this->SetReadFlag(false);
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The has_been_read_ flag should be reset when MetaInfo is modified through methods other than SetInfo(). Consider adding SetReadFlag(false) calls to other modification methods such as SetFeatureInfo(), Extend(), Clear(), LoadBinary(), and Cats(std::shared_ptr). Without this, cache invalidation will not work correctly when data is modified through these alternative paths.

Copilot uses AI. Check for mistakes.
/*! \brief argsort of labels */
mutable std::vector<size_t> label_order_cache_;
bool has_categorical_{false};
bool has_been_read_{false};
Copy link

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The has_been_read_ member variable is not thread-safe. The flag is being read and written from multiple locations without synchronization, which could lead to data races in multi-threaded scenarios. Consider either making this field std::atomic or adding proper synchronization mechanisms, especially since the cache operations in cache.h are already protected by a mutex.

Copilot uses AI. Check for mistakes.
@trivialfis
Copy link
Member Author

This is not going to be an easy change. Internally, the DMatrix doesn't have a concept of when the construction is "completed" and hence there's no easy way to check whether a modification has been made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants