-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
[wip] Invalidate the DMatrix cache after modification. #11885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
How about deleting caches instead of erroring out on modifications? |
|
Yes, we can do that. But I would like to see the impact of making the DMatrix immutable first. I will update the PR to work with the cache. |
a808277 to
410cf1d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a cache invalidation mechanism for DMatrix objects after their metadata has been modified. It introduces a has_been_read_ flag in the MetaInfo class to track whether data has been read via GetBatches() methods, and resets this flag when data is modified via SetInfo(). The cache eviction logic is updated to invalidate entries for DMatrix objects that haven't been marked as read.
- Adds tracking mechanism with
has_been_read_flag in MetaInfo class - Updates all GetBatches() template specializations to mark data as read
- Modifies cache invalidation logic to consider the read flag
- Includes minor documentation updates and spelling corrections
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| include/xgboost/data.h | Adds has_been_read_ flag, accessor methods, and read flag setting in all GetBatches() specializations; updates documentation style |
| src/data/data.cc | Adds SetReadFlag(false) call in MetaInfo::SetInfo() to invalidate cache when metadata is modified |
| include/xgboost/cache.h | Updates cache eviction logic to check the read flag and invalidate unread entries; fixes spelling of "repeatedly" |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| * \param key The key of the information. | ||
| * \param interface_str String representation of json format array interface. | ||
| /** | ||
| * @brief Set data in the meta info with array interface. |
Copilot
AI
Dec 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation change from "information" to "data" may be inaccurate. The SetInfo method accepts various types of metadata (labels, weights, base_margin, etc.) through the array interface, not just "data". The term "information" more accurately describes what this method sets. Consider reverting this documentation change or clarifying that "data" refers to metadata fields.
| CHECK(it != container_.cend()); | ||
| if (it->second.ref.expired()) { | ||
| // Re-new the cache if this has never been read. | ||
| if (it->second.ref.expired() || !it->second.ref.lock()->Info().HasBeenRead()) { |
Copilot
AI
Dec 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache invalidation logic may be incorrect. The condition checks if the DMatrix has NOT been read (!HasBeenRead()), which seems backwards. According to the comment "Re-new the cache if this has never been read", the logic appears to invalidate caches that haven't been read yet. However, the intent should be to invalidate caches for DMatrix objects that have been modified (where has_been_read_ was reset to false after being true). Consider whether the flag should be renamed to something like "is_modified_" or "needs_cache_refresh_" to better reflect its purpose, or verify that the invalidation condition is correct.
| } // namespace | ||
|
|
||
| void MetaInfo::SetInfo(Context const& ctx, StringView key, StringView interface_str) { | ||
| this->SetReadFlag(false); |
Copilot
AI
Dec 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SetReadFlag(false) call should be placed after the metadata modification is complete, not before. If the modification fails (e.g., Json::Load throws an exception), the flag will be incorrectly set to false even though no actual modification occurred. Consider moving this call to the end of the function, after all validation and modification operations have completed successfully.
| } // namespace | ||
|
|
||
| void MetaInfo::SetInfo(Context const& ctx, StringView key, StringView interface_str) { | ||
| this->SetReadFlag(false); |
Copilot
AI
Dec 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The has_been_read_ flag should be reset when MetaInfo is modified through methods other than SetInfo(). Consider adding SetReadFlag(false) calls to other modification methods such as SetFeatureInfo(), Extend(), Clear(), LoadBinary(), and Cats(std::shared_ptr). Without this, cache invalidation will not work correctly when data is modified through these alternative paths.
| /*! \brief argsort of labels */ | ||
| mutable std::vector<size_t> label_order_cache_; | ||
| bool has_categorical_{false}; | ||
| bool has_been_read_{false}; |
Copilot
AI
Dec 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The has_been_read_ member variable is not thread-safe. The flag is being read and written from multiple locations without synchronization, which could lead to data races in multi-threaded scenarios. Consider either making this field std::atomic or adding proper synchronization mechanisms, especially since the cache operations in cache.h are already protected by a mutex.
|
This is not going to be an easy change. Internally, the DMatrix doesn't have a concept of when the construction is "completed" and hence there's no easy way to check whether a modification has been made. |
Uh oh!
There was an error while loading. Please reload this page.