Position Overview
Description
As a Neuron Collectives Software Developer, you will:
* Enhance collective algorithms and topologies for optimal training performance
* Use tools like Neuron Explorer to identify bottlenecks in compute and bus bandwidth utilization
* Monitor and analyze processor, DMA, firmware, and workload metrics
* Optimize collective operations to scale AI compute across the data center through low level device driver development
* Work closely with the hardware team to co-optimize software and Trainium silicon
* Develop and optimize C/C++ implementations of collective communication patterns
* Investigate and implement improvements for specific training topologies used by modern LLMs
* Build and maintain analysis frameworks and automation solutions
The role offers opportunities to work on cutting-edge AI training hardware while contributing to one of Amazon's most critical initiatives.
A day in the life
Annapurna Labs, a crucial part...