GPU and CPU idle during training
... / GPU and CPU idle during t...
BMPCreated with Sketch.BMPZIPCreated with Sketch.ZIPXLSCreated with Sketch.XLSTXTCreated with Sketch.TXTPPTCreated with Sketch.PPTPNGCreated with Sketch.PNGPDFCreated with Sketch.PDFJPGCreated with Sketch.JPGGIFCreated with Sketch.GIFDOCCreated with Sketch.DOC Error Created with Sketch.
Question

GPU and CPU idle during training

by
DamienL38
Created on 2022-09-20 09:12:05 (edited on 2024-09-04 13:04:56) in AI and Machine Learning OVHcloud

Hello,
I've been using AI Training for several big trainings and every time the training time is abnormaly long (And billed accordingly). It takes 30 days for a training that should take not longer than 5 days.
The Job Monitoring interface shows that both CPU and GPU are idle most of the time.
image
I use a custom image based on python:3.9 (which is based on debian buster).
Data are located in a mounted object storage RW:cache. (Note that the problem was the same without the cache). Outputs are stored in the same storage.
I suspect an IO or journalisation problem, or a problem related to the use/sync of object storage but i cannot inquire it as IO monitoring is not available.
Am i doing something wrong ?


1 Reply ( Latest reply on 2022-09-26 09:54:25 by
^FabL
)

Bonjour @DamienL38,

Si le dysfonctionnement est toujours d'actualité, je vous invite à préciser davantage d'éléments et/ou tests effectués afin que la communauté puisse vous apporter un retour.

^FabL

Replies are currently disabled for this question.