About the Role
We are looking for an experienced engineer to lead our large-scale data processing efforts. In this role, you will be responsible for designing, building, and maintaining robust distributed systems that process terabytes of image and video data used to train state-of-the art generative models.
Key Responsibilities
- Design, implement, and optimize complex data processing pipelines responsible for ingesting and transforming large media datasets.
- Manage containerized applications on Kubernetes; deploy and scale distributed systems leveraging Ray to process tasks and orchestrate compute workloads.
- Implement and deploy state-of-the-art ML models for data cleaning, processing, and preparation
- Ensure data quality, diversity, and proper annotation (including captioning) for training readiness
- Work closely in the model development loop to update data as necessitated by the training trajectory
Ideal Experiences
- Deep understanding of Python and various file systems for data intensive manipulation and analysis
- Demonstrable experience deploying, managing, and scaling containerized applications on Kubernetes clusters.
- Hands-on experience with distributed computing engines such as Ray, including task scheduling, fault tolerance, and resource management.
- Experience with image and video processing libraries (e.g., OpenCV, FFmpeg)
- Experience working with large image/video datasets, including efficient data handling, transformation, and feature extraction.
- Familiarity with data annotation and captioning processes for ML training datasets