Lead Machine Learning Engineer (MLOps, KServe + building Kubernetes Clusters, PyTorch, TensorFlow on AWS)
Job Description
Capital One is looking for a Lead Machine Learning Engineer to spearhead MLOps initiatives, oversee KServe powered pipelines, and design scalable Kubernetes-based environments for PyTorch and TensorFlow workloads on AWS. The role sits at the intersection of advanced modeling and production engineering, collaborating closely with Product and Data Science teams to translate analytics into reliable, scalable solutions in New York City.
Responsibilities
- Design, build, and deliver ML models and components that address real business needs, partnering with Product and Data Science colleagues.
- Guide ML infrastructure decisions by applying modeling insights, including model selection, data and feature choices, training workflows, hyperparameter tuning, dimensionality, bias/variance considerations, and validation.
- Tackle complex problems through coding, model development, validation, and automation of tests and deployment processes.
- Work within a cross-functional Agile team to create and enhance software for cutting-edge big data and ML applications.
- Retrain, maintain, and monitor models in production to ensure ongoing performance.
- Use or build cloud-native architectures and platforms to deliver optimized ML models at scale.
- Construct efficient data pipelines that feed ML models with quality data.
- Apply continuous integration and continuous deployment best practices, including test automation and monitoring, to ensure successful deployment of models and code.
- Ensure code quality and governance, and adhere to Responsible and Explainable AI practices.
- Program primarily in Python, Scala, or Java.
Requirements
- Bachelor’s degree
- At least 6 years of experience designing and building data-intensive solutions using distributed computing (Internship experience does not apply)
- At least 4 years of experience programming with Python, Scala, or Java
- At least 2 years of experience building, scaling, and optimizing ML systems
Technologies
- Python
- Scala
- Java
- PyTorch
- TensorFlow
- scikit-learn
- Dask
- Spark
- Kubernetes
- KServe
- AWS
- Azure
- Google Cloud Platform
Benefits
- Health benefits
- Financial benefits
- Incentives (cash bonuses and/or long term incentives)
Basic Qualifications
- Bachelor’s degree
- At least 6 years of experience designing and building data-intensive solutions using distributed computing (Internship experience does not apply)
- At least 4 years of experience programming with Python, Scala, or Java
- At least 2 years of experience building, scaling, and optimizing ML systems
Preferred Qualifications
- Master’s or doctoral degree in computer science, electrical engineering, mathematics, or a related field
- 3+ years of experience building production-ready data pipelines that feed ML models
- 3+ years of on-the-job experience with an industry-recognized ML framework such as scikit-learn, PyTorch, Dask, Spark, or TensorFlow
- 2+ years of experience developing performant, resilient, and maintainable code
- 2+ years of experience gathering and preparing data for ML models
- 2+ years of people leadership experience
- 1+ years leading teams developing ML solutions using industry best practices, patterns, and automation
- Experience deploying ML solutions in public cloud environments like AWS, Azure, or Google Cloud
- Experience designing and scaling complex data pipelines for ML models and evaluating their performance
- Notable ML industry impact through conference talks, papers, open source contributions, or patents