Crusoe is on a mission to accelerate the abundance of energy and intelligence. As the only vertically integrated AI infrastructure company built from the ground up, we own and operate each layer of the stack — from electrons to tokens — to power the world's most ambitious AI workloads. When you join Crusoe, you join a team that is building the future, faster.

We're in the midst of the greatest industrial revolution of our time. The demand for AI compute is boundless, and power is a bottleneck. We're solving that — with an energy-first approach that makes AI infrastructure better for the world and faster for the people innovating with AI.

We're looking for problem-solving, opportunity-finding teammates with a sense of urgency, who believe in the scale of our ambition and thrive on a path not fully paved — people who want to grow their careers alongside a team of experts across energy, manufacturing, data center construction, and cloud services.

If you want to do the most meaningful work of your career, help our customers and partners advance their AI strategies, and be part of a high-performing team that believes in each other, come build with us at Crusoe.

Senior Virtualization Validation Engineer

San Francisco, Sunnyvale (Onsite)

Role Mission

At Crusoe, we are pioneering the future of sustainable computing. As a Virtualization Validation Engineer, you will be responsible for the end-to-end validation of large-scale, multi-node GPU clusters. You will focus on high-performance GPU Virtualization using QEMU and Cloud Hypervisor, ensuring that distributed workloads scale efficiently across multiple virtualized nodes. Your role is critical in validating the interconnect fabric and collective communication libraries (NCCL/RCCL) that power the world’s most demanding AI and HPC applications.

What You’ll Be Working On:

Multi-Node Scaling Validation: Design and execute large-scale validation tests across multi-node virtualized clusters to ensure linear scaling and stability of GPU workloads.
Interconnect & Fabric Testing: Validate high-speed interconnects—including NVLink, Infinity Fabric, InfiniBand, and RoCE—within virtualized environments to ensure low-latency, high-bandwidth communication.
Hypervisor & GPU Virtualization: Lead the validation of QEMU and Cloud Hypervisor with a focus on PCIe passthrough (VFIO), IOMMU, and direct device assignment for GPUs and high-speed NICs.
Collective Communication Benchmarking: Architect and run comprehensive test suites using nccl-tests and rccl-tests (e.g., AllReduce, AllGather) to verify performance across node boundaries.
Network Stack Validation: Validate SR-IOV and RDMA configurations to ensure that virtualized guests achieve near-bare-metal networking performance for distributed GPU tasks.
Automated Cluster Orchestration: Develop and maintain automation frameworks in Python or Go to dynamically provision, configure, and stress-test multi-node virtualized environments.
Performance Bottleneck Analysis: Perform deep-dive analysis of performance regressions in multi-node communication, identifying root causes across the guest OS, hypervisor, and physical fabric.

What You’ll Bring to the Team:

Education & Experience: 2-5+ YOE demonstrated ability to competently and independently perform responsibilities plus Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related technical field.
Virtualization Expertise: Proven experience with QEMU/KVM and Cloud Hypervisor in a production or research environment.
Distributed GPU Ecosystems: Deep familiarity with NVIDIA (CUDA/NCCL) and/or AMD (ROCm/RCCL) stacks in a multi-node context.
Networking Knowledge: Strong understanding of RDMA, RoCE, and InfiniBand protocols and their implementation in virtualized systems.
System Internals: Expert-level knowledge of Linux kernel internals, specifically PCIe topology, VFIO, and memory management (HugePages, IOMMU).
Automation & Scripting: Advanced proficiency in Python and/or Bash for automating complex cluster-wide test scenarios.
Experience with MNNVL (Multi-Node NVLink) or specialized AI fabric architectures.
Familiarity with hardware-level debugging tools and performance profilers (e.g., NVIDIA Nsight, AMD Omniperf).
Knowledge of containerized orchestration for GPUs (e.g., Kubernetes with specialized device plugins).

Compensation:

Compensation will be paid in the range of $172,500 - $210,000. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.

Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.

Apply now

See more open positions at Crusoe

Powered by Getro.com

Privacy policy Cookie policy

Bellwethers welcome.

Say hello