Hello,
I've been using AI Training for several big trainings and every time the training time is abnormaly long (And billed accordingly). It takes 30 days for a training that should take not longer than 5 days.
The Job Monitoring interface shows that both CPU and GPU are idle most of the time.
I use a custom image based on python:3.9 (which is based on debian buster).
Data are located in a mounted object storage RW:cache. (Note that the problem was the same without the cache). Outputs are stored in the same storage.
I suspect an IO or journalisation problem, or a problem related to the use/sync of object storage but i cannot inquire it as IO monitoring is not available.
Am i doing something wrong ?
GPU and CPU idle during training
Related questions
- Mon site perdu sur Google
3953
11.09.2021 07:13
- OVH Prescience 1.4.0
3822
24.10.2018 14:14
- Erreur optimisation
3736
02.04.2019 14:03
- Aide configuré ftp filezilla
3011
22.12.2020 15:48
- Message d erreur à l étape 10 sur 11: Step fail
2930
29.10.2020 13:54
- Library problem
2795
11.02.2021 14:00
- Possible de remplacer Betty?
2733
21.12.2020 09:11
- Pb avec l'exemple "premiers pas"
2728
18.03.2020 17:32
- Modifier l'adresse inscrite sur mon site web
2680
14.09.2021 15:56
- Impossible de modifier ma base de données
2634
25.02.2021 14:57
Bonjour @DamienL38,
Si le dysfonctionnement est toujours d'actualité, je vous invite à préciser davantage d'éléments et/ou tests effectués afin que la communauté puisse vous apporter un retour.
^FabL