Langsung ke konten utama
Daftar untuk melamar

Sudah punya akun? Masuk

Kembali ke Lowongan

Senior Site Reliability Engineer AI Infrastructure

Kembangkan arsitektur klaster GPU multi-region untuk pelatihan AI skala besar

Rancang dan kelola infrastruktur GPU untuk pelatihan dan inferensi AI. Fokus pada optimasi performa, keandalan, dan efisiensi biaya.

Kenapa Menarik?

Akses langsung ke founder dan dampak nyata sejak hari pertama

Skills Wajib

SREGPUInfrastruktur AIJaringanCloud ComputingTroubleshooting

Keywords

SRE AIGPU ClusterInfiniBandRoCENVLinkPelatihan AIInfrastruktur Skala Besar
Lihat Deskripsi Asli dari RemoteOK

Deskripsi asli dari RemoteOK

Senior Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco · Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Since then, we’ve been quietly building the systems, network, and orchestration layer that makes the world’s AI infrastructure more accessible. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth. Our long-term vision is to build the liquidity layer for global AI compute — a marketplace that moves the infrastructure and workloads powering AGI not dissimilar to the flows of capital in the world’s financial markets. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering. The Role This is not a generalist SRE role. You will design, operate, and debug large-scale GPU infrastructure used for distributed training and inference, working directly with customers pushing the limits of modern AI systems. We’re looking for engineers who have personally run GPU clusters in production, understand the failure modes of distributed training, and can reason about performance from network fabric → kernel → framework. What You’ll Own GPU Cluster Architecture: Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training. Make topology-aware scheduling, networking, and storage decisions that directly impact training throughput and cost efficiency. Customer Technical Partnership: Serve as the primary technical point of contact for customers running large-scale training workloads. Onboard, troubleshoot, and optimize, often in real time. Reliability & Performance Engineering: Define SLOs and error budgets that account for the unique failure modes of GPU infrastructure (ECC errors, NVLink degradation, NCCL timeouts). Own capacity planning across heterogeneous GPU fleets optimized for training throughput. Networking & Fabric Health: Ensure the health and performance of high-speed interconnects (InfiniBand, RoCE, NVLink) that underpin distributed training. Diagnose and resolve fabric-level issues that degrade collective operations. Observability: Build deep visibility into GPU utilization, memory pressure, interconnect throughput, training job performance, and hardware health. Go well beyond standard infrastructure metrics. Automation & Tooling: Build production-grade automation for cluster provisioning, GPU health checks, job scheduling, self-healing, and firmware/driver lifecycle management. Incident Leadership: Lead incident response for complex, multi-layer failures spanning hardware, networking, orchestration, and ML frameworks. Drive blameless postmortems and systemic fixes. What We’re Looking For GPU Systems Expertise: Deep, hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent). You understand GPU memory hierarchies, ECC behavior, thermal throttling, and hardware failure modes from direct experience not documentation. High-Performance Networking: Production experience with InfiniBand, RoCE, or NVLink fabrics in the context of distributed training. You can diagnose why an all-reduce is slow, identify a degraded link in a fat-tree topology, and reason about congestion control at scale. Distributed Training & ML Frameworks: Working knowledge of how large training jobs actually run — NCCL, CUDA, PyTorch distributed, DeepSpeed, Megatron, FSDP, or similar. You don't need to write the models, but you need to understand what's happening at the systems level when a 1,000-GPU training run stalls. Linux &

Situs sumber mungkin diblokir ISP Indonesia

Beberapa ISP Indonesia (Telkomsel, Indihome) memblokir RemoteOK. Jika tombol Apply tidak terbuka, coba pakai data seluler atau VPN.

Tips: ganti jaringan atau aktifkan VPN, lalu klik Apply lagi.


Bagikan lowongan ini

Bantu teman kamu menemukan kerja remote berikutnya.


Sumber
RemoteOK
0
Tipe Pekerjaan
full time
Lokasi
Regional Remote · Remote
Kategori
Design
Level
senior
DipostingFresh
29 Apr 2026