Machine Learning Infrastructure Engineer

Machine Learning Infrastructure Engineer

Job Overview

Location
New York City, New York
Job Type
Full Time Job
Job ID
122909
Date Posted
6 months ago
Recruiter
Dennis Ruth
Job Views
182

Job Description

We’ll trust you to:

  •  Interact with data engineers and ML experts across the company to understand their workflows and requirements to inform the next set of features for the platform.
  •  Provide GPU management solution to enhance distributed training performance and resource usage efficiency
  •  Enhance distributed training user experience using main stream and internal training frameworks
  •  Design seamless workflow from model training to model inference
  •  Troubleshoot and debug user issues
  •  Provide operational and user facing documentation
  •  Provide performance analysis and capacity planning for clusters

What we are looking for:

  •  Have a strong sense of curiosity to solve new problems and keep learning new technologies.
  •  Have a passion for providing reliable and scalable infrastructure
  •  Experience building and scaling Docker-based systems using Kubernetes, Swarm or Mesos
  •  Experience with ML infrastructure open source projects such as Kubeflow, Triton, MLFlow, Feast
  •  Experience with mainstream machine learning frameworks such as Pytorch, Tensorflow
  •  Experience with distributed systems eg. Kubernetes, Kafka, Zookeeper, Spark
  •  Proficiency in two or more languages (Go, Python, C++, Java, or JavaScript) and willingness to learn more as needed
  •  At least 2 years of experience as a software engineer

Nice to haves:

  •  Experience working with authentication & authorization systems such as Spiffe and Spire
  •  Experience with data encryption
  •  Experience working with GPU compute software and hardware
  •  Ability to identify and perform OS and hardware-level optimizations
  •  Open source involvement such as a well-curated blog, accepted contribution, or community presence
  •  Experience with cloud providers such as AWS, GCP or Azure
  •  Experience with configuration management systems (Chef, Puppet, Ansible, or Salt)
  •  Experience with continuous integration tools and technologies (Jenkins, Git, Chat-ops)
  •  Passion for education e.g. providing workshops for tenants

Job ID: 122909

Similar Jobs

Cargill

Full Time Job

Machine learning infrastructure engineer Machine learning infrastructure engineer

A Typical Work Day May Include: • Completing preventative, predictive, ...

Full Time Job

Deloitte

Full Time Job

Machine learning infrastructure engineer Machine learning infrastructure engineer

Are you looking to elevate your cyber career? Your technical skills? Your opport...

Full Time Job

Cargill

Full Time Job

Machine learning infrastructure engineer Machine learning infrastructure engineer

Cargill Animal Nutrition is a global business that serves large-scale feed mill ...

Full Time Job

Veolia

Full Time Job

Machine learning infrastructure engineer Machine learning infrastructure engineer

Primary Duties / Responsibilities:● Assist in daily operational troublesho...

Full Time Job

Cookies

This website uses cookies to ensure you get the best experience on our website.

Accept