How to fine tune the model

#19
by al376646 - opened

Hi, I need adapt the model to detect objects related to food. I want to know if It is possible train the model over the pretrained model and how to do it. Also would be desiderable to know how my dataset have to be labeled in order to feed the model. Thanks.

Just use a MPL Classifier I use YOLO v.whatever

Hi there, here are some useful resources on how to fine-tune DETR:
https://huggingface.co/docs/transformers/main/en/model_doc/detr#resources

I want to know if It is possible train the model over the pretrained model and how to do it.

To fine tune this model, I personally used the Jupyter Notebook at How to Train DETR with πŸ€— Transformers on a Custom Dataset as a guide.

Also would be desiderable to know how my dataset have to be labeled in order to feed the model.

There are many ways to label your dataset. The approach I've taken is to use Label Studio, an open-source solution for labeling data collaboratively. You can export the labels in whatever format suits you. COCO works best for the Notebook I've shared.

I want to know if It is possible train the model over the pretrained model and how to do it.

To fine tune this model, I personally used the Jupyter Notebook at How to Train DETR with πŸ€— Transformers on a Custom Dataset as a guide.

Also would be desiderable to know how my dataset have to be labeled in order to feed the model.

There are many ways to label your dataset. The approach I've taken is to use Label Studio, an open-source solution for labeling data collaboratively. You can export the labels in whatever format suits you. COCO works best for the Notebook I've shared.

Thanks for providing the Jupyter Notebook link. It is nice to run code on cloud. But if the dataset is too large, Google doesn't allow long-time training.
Is it possible to just copy-past all the codes down to local PC and train the model according to personal needs?

Is it possible to just copy-past all the codes down to local PC and train the model according to personal needs?

Yup! That's what I did.

I'm trying to fine tune with my local coco dataset
Only one class and image size is 512*512
Same above notebook is giving error

RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

Hi @martiannomad ! can you provide a full traceback and minimal example? Also try to make your image contiguous, that might help (can't say more without additional information)
https://pytorch.org/docs/stable/generated/torch.Tensor.contiguous.html#torch.Tensor.contiguous
https://numpy.org/doc/stable/reference/generated/numpy.ascontiguousarray.html

@qubvel-hf

here is the complete traceback, I have only used the same notebook above. Also the image is like this 512 X 512 with only one class. Sure I'll look into these images but If you think of something while looking traceback pls share

image.png

Line of Code

from pytorch_lightning import Trainer

%cd {HOME}

# settings
MAX_EPOCHS = 1

# pytorch_lightning < 2.0.0
# trainer = Trainer(gpus=1, max_epochs=MAX_EPOCHS, gradient_clip_val=0.1, accumulate_grad_batches=8, log_every_n_steps=5)

# pytorch_lightning >= 2.0.0
#
trainer = Trainer(devices=1, accelerator="mps", max_epochs=MAX_EPOCHS, gradient_clip_val=0.1, accumulate_grad_batches=8, log_every_n_steps=5)

trainer.fit(model)
: UserWarning: This is now an optional IPython functionality, setting dhist requires you to install the `pickleshare` library.
  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/defect_detection

  | Name  | Type                   | Params | Mode
--------------------------------------------------------
0 | model | DetrForObjectDetection | 41.5 M | eval
--------------------------------------------------------
41.3 M    Trainable params
222 K     Non-trainable params
41.5 M    Total params
166.007   Total estimated model params size (MB)
0         Modules in train mode
399       Modules in eval mode




{
    "name": "RuntimeError",
    "message": "view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.",
    "stack": "---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[26], line 15
      8 # pytorch_lightning < 2.0.0
      9 # trainer = Trainer(gpus=1, max_epochs=MAX_EPOCHS, gradient_clip_val=0.1, accumulate_grad_batches=8, log_every_n_steps=5)
     10 
     11 # pytorch_lightning >= 2.0.0
     12 #
     13 trainer = Trainer(devices=1, accelerator=\"mps\", max_epochs=MAX_EPOCHS, gradient_clip_val=0.1, accumulate_grad_batches=8, log_every_n_steps=5)
---> 15 trainer.fit(model)

File /.env/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py:538, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    536 self.state.status = TrainerStatus.RUNNING
    537 self.training = True
--> 538 call._call_and_handle_interrupt(
    539     self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    540 )

File /.env/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py:47, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     45     if trainer.strategy.launcher is not None:
     46         return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
---> 47     return trainer_fn(*args, **kwargs)
     49 except _TunerExitException:
     50     _call_teardown_hook(trainer)

File /.env/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py:574, in Trainer._fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    567 assert self.state.fn is not None
    568 ckpt_path = self._checkpoint_connector._select_ckpt_path(
    569     self.state.fn,
    570     ckpt_path,
    571     model_provided=True,
    572     model_connected=self.lightning_module is not None,
    573 )
--> 574 self._run(model, ckpt_path=ckpt_path)
    576 assert self.state.stopped
    577 self.training = False

File /.env/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py:981, in Trainer._run(self, model, ckpt_path)
    976 self._signal_connector.register_signal_handlers()
    978 # ----------------------------
    979 # RUN THE TRAINER
    980 # ----------------------------
--> 981 results = self._run_stage()
    983 # ----------------------------
    984 # POST-Training CLEAN UP
    985 # ----------------------------
    986 log.debug(f\"{self.__class__.__name__}: trainer tearing down\")

File /.env/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py:1025, in Trainer._run_stage(self)
   1023         self._run_sanity_check()
   1024     with torch.autograd.set_detect_anomaly(self._detect_anomaly):
-> 1025         self.fit_loop.run()
   1026     return None
   1027 raise RuntimeError(f\"Unexpected state {self.state}\")

File /.env/lib/python3.11/site-packages/pytorch_lightning/loops/fit_loop.py:205, in _FitLoop.run(self)
    203 try:
    204     self.on_advance_start()
--> 205     self.advance()
    206     self.on_advance_end()
    207     self._restarting = False

File /.env/lib/python3.11/site-packages/pytorch_lightning/loops/fit_loop.py:363, in _FitLoop.advance(self)
    361 with self.trainer.profiler.profile(\"run_training_epoch\"):
    362     assert self._data_fetcher is not None
--> 363     self.epoch_loop.run(self._data_fetcher)

File /.env/lib/python3.11/site-packages/pytorch_lightning/loops/training_epoch_loop.py:140, in _TrainingEpochLoop.run(self, data_fetcher)
    138 while not self.done:
    139     try:
--> 140         self.advance(data_fetcher)
    141         self.on_advance_end(data_fetcher)
    142         self._restarting = False

File /.env/lib/python3.11/site-packages/pytorch_lightning/loops/training_epoch_loop.py:250, in _TrainingEpochLoop.advance(self, data_fetcher)
    247 with trainer.profiler.profile(\"run_training_batch\"):
    248     if trainer.lightning_module.automatic_optimization:
    249         # in automatic optimization, there can only be one optimizer
--> 250         batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
    251     else:
    252         batch_output = self.manual_optimization.run(kwargs)

File /.env/lib/python3.11/site-packages/pytorch_lightning/loops/optimization/automatic.py:183, in _AutomaticOptimization.run(self, optimizer, batch_idx, kwargs)
    172 if (
    173     # when the strategy handles accumulation, we want to always call the optimizer step
    174     not self.trainer.strategy.handles_gradient_accumulation and self.trainer.fit_loop._should_accumulate()
   (...)
    180     # -------------------
    181     # automatic_optimization=True: perform ddp sync only when performing optimizer_step
    182     with _block_parallel_sync_behavior(self.trainer.strategy, block=True):
--> 183         closure()
    185 # ------------------------------
    186 # BACKWARD PASS
    187 # ------------------------------
    188 # gradient update with accumulated gradients
    189 else:
    190     self._optimizer_step(batch_idx, closure)

File /.env/lib/python3.11/site-packages/pytorch_lightning/loops/optimization/automatic.py:144, in Closure.__call__(self, *args, **kwargs)
    142 

@override

	
    143 def __call__(self, *args: Any, **kwargs: Any) -> Optional[Tensor]:
--> 144     self._result = self.closure(*args, **kwargs)
    145     return self._result.loss

File /.env/lib/python3.11/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File /.env/lib/python3.11/site-packages/pytorch_lightning/loops/optimization/automatic.py:138, in Closure.closure(self, *args, **kwargs)
    135     self._zero_grad_fn()
    137 if self._backward_fn is not None and step_output.closure_loss is not None:
--> 138     self._backward_fn(step_output.closure_loss)
    140 return step_output

File /.env/lib/python3.11/site-packages/pytorch_lightning/loops/optimization/automatic.py:239, in _AutomaticOptimization._make_backward_fn.<locals>.backward_fn(loss)
    238 def backward_fn(loss: Tensor) -> None:
--> 239     call._call_strategy_hook(self.trainer, \"backward\", loss, optimizer)

File /.env/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py:319, in _call_strategy_hook(trainer, hook_name, *args, **kwargs)
    316     return None
    318 with trainer.profiler.profile(f\"[Strategy]{trainer.strategy.__class__.__name__}.{hook_name}\"):
--> 319     output = fn(*args, **kwargs)
    321 # restore current_fx when nested context
    322 pl_module._current_fx_name = prev_fx_name

File /.env/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py:212, in Strategy.backward(self, closure_loss, optimizer, *args, **kwargs)
    209 assert self.lightning_module is not None
    210 closure_loss = self.precision_plugin.pre_backward(closure_loss, self.lightning_module)
--> 212 self.precision_plugin.backward(closure_loss, self.lightning_module, optimizer, *args, **kwargs)
    214 closure_loss = self.precision_plugin.post_backward(closure_loss, self.lightning_module)
    215 self.post_backward(closure_loss)

File /.env/lib/python3.11/site-packages/pytorch_lightning/plugins/precision/precision.py:72, in Precision.backward(self, tensor, model, optimizer, *args, **kwargs)
     52 

@override

	
     53 def backward(  # type: ignore[override]
     54     self,
   (...)
     59     **kwargs: Any,
     60 ) -> None:
     61     r\"\"\"Performs the actual backpropagation.
     62 
     63     Args:
   (...)
     70 
     71     \"\"\"
---> 72     model.backward(tensor, *args, **kwargs)

File /.env/lib/python3.11/site-packages/pytorch_lightning/core/module.py:1101, in LightningModule.backward(self, loss, *args, **kwargs)
   1099     self._fabric.backward(loss, *args, **kwargs)
   1100 else:
-> 1101     loss.backward(*args, **kwargs)

File /.env/lib/python3.11/site-packages/torch/_tensor.py:581, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
    571 if has_torch_function_unary(self):
    572     return handle_torch_function(
    573         Tensor.backward,
    574         (self,),
   (...)
    579         inputs=inputs,
    580     )
--> 581 torch.autograd.backward(
    582     self, gradient, retain_graph, create_graph, inputs=inputs
    583 )

File /.env/lib/python3.11/site-packages/torch/autograd/__init__.py:347, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    342     retain_graph = create_graph
    344 # The reason we repeat the same comment below is that
    345 # some Python versions print out the first line of a multi-line function
    346 # calls in the traceback and some print out the last line
--> 347 _engine_run_backward(
    348     tensors,
    349     grad_tensors_,
    350     retain_graph,
    351     create_graph,
    352     inputs,
    353     allow_unreachable=True,
    354     accumulate_grad=True,
    355 )

File /.env/lib/python3.11/site-packages/torch/autograd/graph.py:825, in _engine_run_backward(t_outputs, *args, **kwargs)
    823     unregister_hooks = _register_logging_hooks_on_whole_graph(t_outputs)
    824 try:
--> 825     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    826         t_outputs, *args, **kwargs
    827     )  # Calls into the C++ engine to run the backward pass
    828 finally:
    829     if attach_logging_hooks:

RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead."
}

The traceback is not that useful, can't identify the cause with it... Let me know if you identify the reason. Did you try other models, e.g. RT-DETR? Other transformers version/ lightning version?

@qubvel-hf

I tried Yolo with yolo format datset and yoloobb dataset and yolo is working fine
The data is labeled in label-studio with orientation and downloaded COCO format from there. Nonetheless, the annotation in the other cells working perfectly it means dataset and annotation is completely aligned. What are your thoughts.

Also, No I've not tried with RT-DETR. What models do you recommend to try other then YOLO, any material / code that does the fine tuning instead of writing will be helpful.

@qubvel-hf

here is the complete traceback, I have only used the same notebook above. Also the image is like this 512 X 512 with only one class. Sure I'll look into these images but If you think of something while looking traceback pls share

image.png

the image you shared, is it car damage detection ?

the image you

@Mayank2024
Yes, its car damage indeed :)

Sign up or log in to comment