Goodbye Slurm, Hello Konduktor. Trainy Konduktor is a software platform for AI teams to schedule workloads with priority, control resource allocation, and improve GPU reliability. With Konduktor, teams submit jobs to a healthy pool of GPUs, assign job priority with a simple user interface, and never worry about hardware faults again.
Studied CS & Mathematics at UC Santa Cruz. Led Audio team at Hive AI, where we trained and deployed 500M parameter-scale models to production.
Co-founder and CTO at Trainy building a training platform to make deep learning go faster. Previously a lead Machine Learning Engineer for Hive AI's object detection products. I completed my Physics Ph.D. UC Berkeley '22 where my thesis focused on applying computer vision and deep learning on nanoscience. Physics & Computer Science B.A. UC Berkeley '17.
tl;dr We help ML engineers training large models isolate performance bottlenecks during training. Trainy summarizes profiling information during large distributed training so that you know exactly what is limiting the speed of your model training.
Hello everyone, we are Andrew and Roanak, ML engineers from the bay. We’ve experienced firsthand the challenges of distributed training and getting the most out of your compute resources. That’s why we are building tools to help ML engineers training large models optimize training speed and take the guesswork out of estimating the time and cost of training.
Distributed training has enabled the training of ever-growing generative AI models. However, gains in speed are not simply proportional to the number of GPUs recruited and can quickly have diminishing returns depending on infrastructure and the model being trained. Tools now mainly focus on profiling a few GPUs, but become unwieldy the moment you scale beyond the 10 GPU mark.
We created a dashboard to display timing information across many GPUs. It’s a graph interface built on top of Tensorboard and PyTorch Profiler, existing tools familiar to ML engineers.
We summarize profiling information over many GPUs in a few key views that help ML developers identify where in the ensemble of GPUs there are inefficiencies. Statistics about computation, communication, and memory operations across many GPUs isolate straggling GPUs. Distributed training can only proceed as fast as the slowest GPU participating. By isolating slow outliers, an ML developer can zoom in on the operations running on the straggling GPU and make optimizations to their code to balance timings across GPUs and lower resource idling.
We’d love for you to try out our profiler if you are training your own models. https://github.com/Trainy-ai/nodify
Also join our community if you want to stay up to date with the development of our tools. https://discord.com/invite/d67CMuKY5V