Conversation
There was a problem hiding this comment.
Pull request overview
This PR prepares the codebase for v1.0 release by stabilizing training for v9-t models and fixing several bugs. The changes focus on implementing gradient accumulation with warmup, fixing a widespread typo, and enhancing distributed training support.
Key changes include:
- Implementation of a new
GradientAccumulationcallback to dynamically adjust gradient accumulation during training warmup - Updates to the EMA callback to properly account for gradient accumulation steps
- Correction of the "quite" → "quiet" typo throughout configuration and documentation
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| yolo/utils/model_utils.py | Added GradientAccumulation callback class for dynamic gradient accumulation during warmup; updated EMA callback to work with gradient accumulation by tracking batch steps |
| yolo/utils/logging_utils.py | Integrated GradientAccumulation into setup; corrected "quite" typo; changed batch progress description; removed clear_live() call |
| yolo/lazy.py | Added configurable accelerator and device settings; enabled sync_batchnorm for distributed training; corrected "quite" typo |
| yolo/config/task/train.yaml | Added equivalent_batch_size parameter to support gradient accumulation configuration |
| yolo/config/general.yaml | Changed device configuration from fixed 0 to auto and added accelerator setting |
| yolo/config/config.py | Added equivalent_batch_size field to DataConfig and accelerator field to Config |
| tests/test_utils/test_bounding_box_utils.py | Removed .to(device) call from model creation in test (model doesn't need to be on device for this test) |
| tests/conftest.py | Removed equivalent_batch_size from test configuration to prevent GradientAccumulation callback activation during tests |
| docs/HOWTO.md | Corrected "quite" → "quiet" in example command |
| README.md | Corrected "quite" → "quiet" in documentation; removed WIP and PWC badges |
| .github/workflows/deploy.yaml | Corrected "quite" → "quiet" in workflow command |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| class GradientAccumulation(Callback): | ||
| def __init__(self, data_cfg: DataConfig, scheduler_cfg: SchedulerConfig): | ||
| super().__init__() | ||
| self.equivalent_batch_size = data_cfg.equivalent_batch_size | ||
| self.actual_batch_size = data_cfg.batch_size | ||
| self.warmup_epochs = getattr(scheduler_cfg.warmup, "epochs", 0) | ||
| self.current_batch = 0 | ||
| self.max_accumulation = 1 | ||
| self.warmup_batches = 0 | ||
| logger.info(":arrows_counterclockwise: Enable Gradient Accumulation") |
There was a problem hiding this comment.
The GradientAccumulation class lacks a docstring explaining its purpose, parameters, and behavior. Consider adding documentation to describe how this callback manages gradient accumulation during training warmup.
| class GradientAccumulation(Callback): | ||
| def __init__(self, data_cfg: DataConfig, scheduler_cfg: SchedulerConfig): | ||
| super().__init__() | ||
| self.equivalent_batch_size = data_cfg.equivalent_batch_size | ||
| self.actual_batch_size = data_cfg.batch_size | ||
| self.warmup_epochs = getattr(scheduler_cfg.warmup, "epochs", 0) | ||
| self.current_batch = 0 | ||
| self.max_accumulation = 1 | ||
| self.warmup_batches = 0 | ||
| logger.info(":arrows_counterclockwise: Enable Gradient Accumulation") | ||
|
|
||
| def setup(self, trainer: "Trainer", pl_module: "LightningModule", stage: str) -> None: | ||
| effective_batch_size = self.actual_batch_size * trainer.world_size | ||
| self.max_accumulation = max(1, round(self.equivalent_batch_size / effective_batch_size)) | ||
| batches_per_epoch = int(len(pl_module.train_loader) / trainer.world_size) | ||
| self.warmup_batches = int(self.warmup_epochs * batches_per_epoch) | ||
|
|
||
| def on_train_epoch_start(self, trainer: "Trainer", pl_module: "LightningModule") -> None: | ||
| self.current_batch = trainer.global_step | ||
|
|
||
| def on_train_batch_start(self, trainer: "Trainer", pl_module: "LightningModule", *args, **kwargs) -> None: | ||
| if self.current_batch < self.warmup_batches: | ||
| current_accumulation = round(lerp(1, self.max_accumulation, self.current_batch, self.warmup_batches)) | ||
| else: | ||
| current_accumulation = self.max_accumulation | ||
| trainer.accumulate_grad_batches = current_accumulation | ||
|
|
||
| def on_train_batch_end(self, trainer: "Trainer", pl_module: "LightningModule", *args, **kwargs) -> None: | ||
| self.current_batch += 1 | ||
|
|
There was a problem hiding this comment.
The new GradientAccumulation callback class lacks test coverage. Consider adding tests to verify the gradient accumulation logic, warmup behavior, and integration with the trainer's accumulate_grad_batches setting.
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
|
@henrytsui000 I've opened a new pull request, #226, to work on those changes. Once the pull request is ready, I'll request review from you. |
* [Fix] typo of quiet (from quite) * 🐛 [Add] Gradient acc for stable single gpu * 🐛 [Fix] for setting multigpu, enable pick cuda id * 💚 [Fix] bug for pytest for inference * 📝 [Update] Docs, remove broken badge
* [Fix] typo of quiet (from quite) * 🐛 [Add] Gradient acc for stable single gpu * 🐛 [Fix] for setting multigpu, enable pick cuda id * 💚 [Fix] bug for pytest for inference * 📝 [Update] Docs, remove broken badge
* [Fix] typo of quiet (from quite) * 🐛 [Add] Gradient acc for stable single gpu * 🐛 [Fix] for setting multigpu, enable pick cuda id * 💚 [Fix] bug for pytest for inference * 📝 [Update] Docs, remove broken badge
Stabilize training for v9-t, and fix some previous bugs.