Team Overview: The OCI Cluster Networking team is at the forefront of building ultra-high-performance networking solutions to support advanced AI/ML/HPC workloads. This is your chance to join the AI revolution by designing scalable systems that support thousands of GPUs without compromising on performance.
Role Summary: As a Senior Principal Member of Technical Staff, you'll be part of a dynamic team responsible for designing, developing, and optimizing a software and hardware stack capable of running distributed AI/ML/HPC workloads across thousands of GPUs. You will work with cutting-edge libraries like NCCL, leverage high-performance networking, and build innovative, scalable solutions for our customers.
Who You Are: We're looking for adaptable, self-motivated engineers who can learn quickly. You are a solid developer and distributed systems generalist who can work across the stack, from low-level systems to high-level distributed system interactions. You value simplicity, scalability, and thrive in a collaborative, agile environment.
Career Level: IC5
Career Level - IC4
Key Responsibilities:
• Design and develop scalable, high-performance software and hardware solutions for distributed AI/ML/HPC workloads.
• Performance tune networking libraries (e.g., NCCL) and integrate them with our distributed systems.
• Collaborate with cross-functional teams on new initiatives and deliver innovative solutions to complex networking challenges.
Basic Qualifications:
• 10+ years of software development experience in systems or application-level engineering
• 2+ years of experience with collective communication libraries (e.g., NCCL, RCCL, MPI) and GPU frameworks (e.g., CUDA, ROCm)
• 2+ years of experience with ML training frameworks (e.g., PyTorch, TensorFlow)
• Proficiency in at least two of the following programming languages: Go, Java, C/C++, Python
• Strong knowledge of data structures, algorithms, and operating systems
• Excellent communication skills, both verbal and written
• Bachelor's degree in Computer Science, Engineering, or a related field
Preferred Qualifications:
• Master's degree in Computer Science or a related field
• Experience with RDMA programming, including GPUDirect RDMA
• Experience with distributed workload managers (e.g., Kubernetes)
• Proficiency with Linux performance tools
• Familiarity with SDN, NFV, and cloud networking
• Experience with Infrastructure-as-a-Service platforms (e.g., AWS, Azure, GCP)
MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.