Trainy

Infrastructure for managing GPU clusters for training/serving.

S23

Active

Infrastructure for managing GPU clusters for training/serving.

Goodbye Slurm, Hello Konduktor. Trainy Konduktor is a software platform for AI teams to schedule workloads with priority, control resource allocation, and improve GPU reliability. With Konduktor, teams submit jobs to a healthy pool of GPUs, assign job priority with a simple user interface, and never worry about hardware faults again.

Trainy

Founded:2023

Team Size:2

Status:

Active

Location:San Francisco

Group Partner:Diana Hu

Active Founders

Roanak Baviskar, Founder

Studied CS & Mathematics at UC Santa Cruz. Led Audio team at Hive AI, where we trained and deployed 500M parameter-scale models to production.

Roanak Baviskar

Trainy

Andrew Aikawa, Founder

Co-founder and CTO at Trainy building a training platform to make deep learning go faster. Previously a lead Machine Learning Engineer for Hive AI's object detection products. I completed my Physics Ph.D. UC Berkeley '22 where my thesis focused on applying computer vision and deep learning on nanoscience. Physics & Computer Science B.A. UC Berkeley '17.

Andrew Aikawa

Trainy

Company Launches

🚅 Trainy - Identify bottlenecks, boost training

See original launch post ›

tl;dr We help ML engineers training large models isolate performance bottlenecks during training. Trainy summarizes profiling information during large distributed training so that you know exactly what is limiting the speed of your model training.

Hello everyone, we are Andrew and Roanak, ML engineers from the bay. We’ve experienced firsthand the challenges of distributed training and getting the most out of your compute resources. That’s why we are building tools to help ML engineers training large models optimize training speed and take the guesswork out of estimating the time and cost of training.

The Problem

Distributed training has enabled the training of ever-growing generative AI models. However, gains in speed are not simply proportional to the number of GPUs recruited and can quickly have diminishing returns depending on infrastructure and the model being trained. Tools now mainly focus on profiling a few GPUs, but become unwieldy the moment you scale beyond the 10 GPU mark.

Our Solution

We created a dashboard to display timing information across many GPUs. It’s a graph interface built on top of Tensorboard and PyTorch Profiler, existing tools familiar to ML engineers.

How it works

We summarize profiling information over many GPUs in a few key views that help ML developers identify where in the ensemble of GPUs there are inefficiencies. Statistics about computation, communication, and memory operations across many GPUs isolate straggling GPUs. Distributed training can only proceed as fast as the slowest GPU participating. By isolating slow outliers, an ML developer can zoom in on the operations running on the straggling GPU and make optimizations to their code to balance timings across GPUs and lower resource idling.

The Ask

We’d love for you to try out our profiler if you are training your own models. https://github.com/Trainy-ai/nodify

Also join our community if you want to stay up to date with the development of our tools. https://discord.com/invite/d67CMuKY5V