🌍 Global Opportunities
Updated Hourly
🎓 Student Friendly

parttimejobs.work

Flexible Work, Better Balance

⏰ Full-time

Senior System Architect, Infrastructure Reliability

NVIDIA
Location 📍 Santa Clara, United States
Posted 📅 June 06, 2026
Work Type ⏰ Full-time

Position Overview

NVIDIA is seeking a Senior System Architect: Heterogeneous EDA Systems to solve a complex challenge in accelerated computing: Failure Attribution at Scale. As EDA or equivalent experience workloads scale across thousands of heterogeneous nodes, a single failure can cause massive resource waste. We need an engineer to develop and build an automated framework. This framework will ingest telemetry from CPU and GPU clusters to identify the root cause of job failures in real-time. It will distinguish between hardware faults, infrastructure instability, and software defects.

What you'll be doing:
+ Architect Failure Attribution Frameworks: Build a scalable flight recorder for EDA jobs that captures high-fidelity state across the CPU, GPU, and Fabric at the moment of failure.
+ Build automated diagnostics that correlate GPU XID errors, PCIe bus failures, and CUDA memory exceptions. Connect these errors with system-level events such as OOM kills or NUMA-related hangs.
+ Distr...

Apply Now

Submit Application →

Quick and easy application process

Job Details

Employment Type
Full-time
📊
Category
other-general
🏠
Work Arrangement
On-site
📍
Location
Santa Clara, United States