top of page

ML Infrastructure Engineer / Startup / Digital Realities

San Francisco, CA, USA

Job Type

In Office

Workspace

San Francisco

About the Role

What you’ll do

Own the LLM serving stack end-to-end: inference frameworks, routing, batching, autoscaling.
Optimise latency, throughput, GPU utilisation and cost across the whole pipeline.
Work with GPU workloads (CUDA/Triton, kernels, profiling, memory optimisation).
Deploy and scale models using frameworks like TensorRT-LLM / Triton Inference Server / vLLM.
Run everything on Kubernetes or Ray with robust observability (metrics, logs, tracing, alerts).
Collaborate with core product/ML teams to turn research ideas into production-grade systems.

Requirements

What they’re looking for

  • Strong experience in ML infrastructure, distributed systems, or performance engineering.

  • Proficient in Python plus one of C++ / Rust / Go (or similar systems language).

  • Hands-on with GPU-based training/serving – you’ve profiled and fixed real bottlenecks.

  • Comfortable with Kubernetes/Ray, CI/CD, and production on-call.

  • Bonus: experience with quantisation (AWQ/GPTQ/FP8) or PEFT (LoRA/DoRA), or scaling multi-node training/serving.

  • Happy working in-person in a small, high-ownership team

About the Company

Partnered with VC-backed AI startup is building real-time, interactive products on top of large language models. They need a ML Infrastructure Engineer who can make model serving faster, cheaper, and more reliable – this is deep systems work, not just wiring up APIs.

Apply Now
bottom of page