Job Description
Are you ready to define the technological landscape of 2026? Nebula Horizon is seeking a visionary Senior AI Infrastructure Engineer to build the robust, scalable systems that will power the next generation of artificial intelligence. In this role, you will not just maintain systems; you will architect the future.
We are on a mission to revolutionize how AI models are trained and deployed. As part of our 2026 strategic roadmap, you will lead the charge in optimizing our global GPU clusters and designing resilient distributed computing environments. If you are passionate about cutting-edge technology and want to leave a lasting impact on the industry, we want to hear from you.
Responsibilities
- Architect 2026-Ready AI Pipelines: Design and implement high-throughput, low-latency infrastructure for machine learning workloads, ensuring scalability for future growth.
- Optimize GPU Clusters: Manage and optimize large-scale GPU clusters (NVIDIA/AWS) to maximize training efficiency and reduce operational costs.
- Implement MLOps: Build and maintain CI/CD pipelines for model deployment, automating testing, validation, and rollout processes.
- Ensure Data Sovereignty: Enforce strict security protocols and compliance standards to protect sensitive training data and proprietary algorithms.
- Collaborate with Researchers: Partner with data scientists and ML researchers to translate theoretical models into production-ready infrastructure.
- System Reliability: Proactively monitor system health, troubleshoot complex issues, and implement disaster recovery strategies.
Qualifications
- Experience: 5+ years of experience in software engineering, with at least 2 years specifically focused on AI/ML infrastructure.
- Programming: Proficiency in Python, Go, or Rust, with deep knowledge of Kubernetes and containerization technologies.
- Cloud Expertise: Strong experience with AWS, Azure, or Google Cloud Platform, specifically in GPU instances and serverless architectures.
- Distributed Systems: Deep understanding of distributed systems principles, networking, and database management (SQL/NoSQL).
- Problem Solving: Exceptional ability to debug complex, multi-layered infrastructure problems under pressure.
- Education: Bachelor’s degree in Computer Science, Engineering, or a related technical field (Master’s preferred).