Job Description
Job Description
Founded in 2012, H2O.ai is on a mission to democratize AI. As the world’s leading agentic AI company, H2O.ai converges Generative and Predictive AI to help enterprises and public sector agencies develop purpose-built GenAI applications on their private data. Its open-source technology is trusted by over 20,000 organizations worldwide - including more than half of the Fortune 500 - H2O.ai powers AI transformation for companies like AT&T, Commonwealth Bank of Australia, Singtel, Chipotle, Workday, Progressive Insurance, and NIH.
H2O.ai partners include Dell Technologies, Deloitte, Ernst & Young (EY), NVIDIA, Snowflake, AWS, Google Cloud Platform (GCP) and VAST. H2O.ai’s AI for Good program supports nonprofit groups, foundations, and communities in advancing education, healthcare, and environmental conservation. With a vibrant community of 2 million data scientists worldwide, H2O.ai aims to co-create valuable AI applications for all users.
H2O.ai has raised $256 million from investors, including Commonwealth Bank, NVIDIA, Goldman Sachs, Wells Fargo, Capital One, Nexus Ventures and New York Life.
About This Opportunity
We are seeking a Senior Machine Learning Engineer with exceptional technical expertise in deploying, scaling, and maintaining production ML systems. This role requires a strong combination of software engineering skills, ML/AI knowledge, and system architecture experience to build robust, scalable machine learning infrastructure. The ideal candidate will have experience with end-to-end ML pipelines, modern MLOps practices, and the ability to bridge research and production environments.
What You Will Do
ML System Architecture & Development
- Design and implement end-to-end machine learning pipelines from research to production
- Build scalable ML infrastructure supporting multiple models and high-throughput inference
- Develop automated systems for model training, validation, deployment, and monitoring
- Create efficient data processing pipelines with multiprocessing optimization and performance tuning
- Architect feature stores, model registries, and ML metadata management systems
Production ML Operations
- Deploy and maintain production ML models with focus on reliability, scalability, and performance Implement MLOps best practices including CI/CD for ML, automated testing, and model versioning
- Monitor model performance, data drift, and system health in production environments
- Optimize model inference for latency and throughput requirements
- Manage model lifecycle including retraining, rollback, and A/B testing strategies
Advanced ML Implementation
- Implement cutting-edge ML techniques including generative AI, diffusion models, and large
- language models
- Develop and optimize deep learning models using modern frameworks (TensorFlow,
- PyTorch)Build systems for handling multimodal data (text, images, video, time-series)
- Create solutions for challenging ML problems including out-of-distribution detection and feature alignment
- Implement efficient algorithms achieving significant performance improvements (orders of
- magnitude speedups)
Technical Leadership & Collaboration
- Lead technical design reviews and architecture decisions for ML systems
- Mentor junior engineers and data scientists on ML engineering best practices
- Collaborate with research teams to transition experimental models to production
- Work with infrastructure teams to ensure optimal resource utilization and scaling
- Provide technical guidance on complex ML system design and implementation
What We Are Looking For
Education & Experience
- Master's degree in Computer Science, Engineering, Physics, Mathematics, or related
- technical field
- 7+ years of experience in machine learning engineering, software development, or related roles
- 5+ years of experience building and deploying production ML systems
- Proven track record of leading technical projects and mentoring team members
Core Programming & ML
- Expert-level proficiency in Python with strong knowledge of Bash, SQL, C/C++
- Deep experience with ML frameworks: TensorFlow, PyTorch, Scikit-learn
- Extensive experience with data processing libraries: NumPy, Pandas, Matplotlib
- Hands-on experience with Hugging Face ecosystem and modern NLP/LLM tools
ML Ops & Infrastructure
- Strong experience with containerization and orchestration: Docker, Kubernetes
- Knowledge of cloud platforms: AWS, GCP, Azure and their ML services
- Experience with MLworkflow orchestration tools: Airflow, Kubeflow, MLflow
- Proficiency in Infrastructure as Code: Terraform, CloudFormation
- Experience with monitoring and observability tools: Prometheus, Grafana, ELK stack
Advanced ML Technologies
- Proven expertise in generative AI including diffusion models, GANs, VAEs, and normalizing flows
- Experience with large language models (LLMs) and agentic AI systems
- Knowledge of advanced architectures: CNNs, U-Nets, transformers, and attention
- mechanisms
- Experience with model optimization techniques: quantization, pruning, distillation
- Understanding of distributed training and inference systems
Software Engineering
- Strong software development practices including version control, testing, and code review
- Experience with micro services architecture and API development
- Knowledge of database systems and data storage solutions
- Understanding of distributed systems and concurrent programming
- Experience with performance profiling and optimization
System Design & Architecture
- Experience designing large-scale ML systems and data pipelines
- Knowledge of real-time and batch processing architectures
- Understanding of model serving patterns and inference optimization
- Experience with auto-scaling and resource management in production environments
- Knowledge of security best practices for ML systems
Problem-Solving & Innovation
- Track record of solving complex technical problems with innovative engineering solutions
- Experience working with real-world, noisy datasets across multiple domains
- Ability to achieve significant performance improvements and system optimizations
- Strong debugging and troubleshooting skills for production ML systems
- Experience with A/B testing and experimentation frameworks
How to Stand Out From the Crowd
- PhD in Computer Science, Engineering, Physics, Mathematics, or related quantitative field
- Deep background in computational sciences (astrophysics, physics, computational biology)
- Experience in technology companies with large-scale ML infrastructure
- Knowledge of financial services, healthcare, or other regulated industries
- Background in research environments with transition to production systems
- Experience building and deploying LLM applications and chatbot systems
- Background in computer vision and image processing applications
- Knowledge of time-series analysis and forecasting systems
- Experience with automated content generation and summarization systems
- Understanding of federated learning and privacy-preserving ML techniques
Technical Specializations
- Experience with edge deployment and model optimization for mobile/IoT devices
- Knowledge of multi-cloud and hybrid cloud architectures
- Background in streaming data processing and real-time ML systems
- Experience with graph neural networks and knowledge graphs
- Understanding of reinforcement learning and multi-agent systems
Leadership & Communication
- Experience mentoring engineering teams and establishing technical standards
- Strong project management skills with experience in Agile/Scrum methodologies
- Ability to communicate complex technical concepts to diverse audiences
- Experience with technical writing and documentation
- Track record of driving technical innovation and process improvements
Success Metrics
- System uptime and reliability of production ML services
- Model performance and accuracy in production environments
- Deployment velocity and time-to-production for new modelsResource utilization efficiency and cost optimization
- Team productivity and knowledge sharing initiatives
- Technical innovation and patent applications
Technical Environment
- Access to cutting-edge ML infrastructure and computing resources
- Opportunity to work with the latest ML frameworks and tools
- Collaborative environment with research and product teams
- Support for experimentation and technical innovation
- Flexible architecture allowing for rapid prototyping and iteration
Why H2O.ai?
- Market leader in total rewards
- Remote-friendly culture
- Flexible working environment
- Be part of a world-class team
- Career growth
H2O.ai is committed to creating a diverse and inclusive culture. All qualified applicants will receive consideration for employment without regard to their race, ethnicity, religion, gender, sexual orientation, age, disability status or any other legally protected basis.
H2O.ai is an innovative AI cloud platform company, leading the mission to democratize AI for everyone. Thousands of organizations from all over the world have used our cutting-edge technology across a variety of industries. We’ve made it easy for people at all levels to generate breakthrough solutions to complex business problems and advance the discovery of new ideas and revenue streams. We push the boundaries of what is possible with artificial intelligence.
H2O.ai employs the world’s top Kaggle Grandmasters, the community of best-in-the-world machine learning practitioners and data scientists. A strong AI for Good ethos and responsible AI drive the company’s purpose.
Please visit www.H2O.ai to learn more.
Powered by JazzHR
o6pnUxfjsZ