Position Overview
Job Description
Insight Global is looking for a Senior level Linux Systems administrator with strong HPC skills. This individual will be responsible for handling all L1/L2 operational support related to HPC (High Performance Computing) operations for Colorado Springs. They will support HPC business users at the OS level and troubleshooting issues related to HPC hardware, monitor SLURM and health of HPC, and monitor HPC cluster health (nodes, storage, interconnect, schedulers). They will handle all operational issues through SNOW, respond to alerts from monitoring tools (Nagios, Prometheus, Grafna), restart failed services and jobs where procedures exist, and coordinate with hardware vendor for hardware related issues.
Day to Day activities:
1. Monitor SNOW tickets and perform basic triage
2. Perform the storage checks (Quota and utilization)
3. Monitor SLURM (queue, Job state) and escalate to the next level where necessary
4. Attend regular standup meetings
5. Ana...