Careers - MLSys 2026

KERNEL ENGINEER

MakerMaker.AI · In Person · United States

Location: San Francisco · On-site

ABOUT THE COMPANY

We're building autonomous research agents for recursive self-improvement (multi-agent systems that propose, run, and analyze machine learning experiments). We're a small team based in San Francisco, on-site

ABOUT THE ROLE

You'll write and optimize the GPU kernels and supporting systems software that makes our training and inference workloads fast. This is deep, low-level work (performance counters, memory bandwidth, warp-level scheduling) applied to the specific shapes and patterns our models actually use.

We hire kernel engineers because the gap between "this works" and "this is fast on the hardware we have" is enormous, and that gap directly bounds what our researchers can try. You'll close that gap.

WHAT YOU'LL DO

Write and optimize GPU kernels (CUDA, ROCm, Triton, or similar) for training and inference workloads: attention variants, MoE layers, custom activations, communication primitives
Profile real workloads with hardware counters and translate findings into specific kernel-level optimizations
Co-design kernels with the research teams, when the kernel and the algorithm need to change together, you participate in both
Integrate optimized kernels into our training and serving stacks; benchmark before and after; verify the win is real end-to-end
Maintain kernel quality over time as hardware, frameworks, and workloads shift underneath
Spread kernel-level fluency across the team; we want this expertise shared, not siloed

WHAT WE'RE LOOKING FOR

4+ years writing performant GPU kernels (CUDA, ROCm, Triton, or production-grade equivalent)
Hardware-level fluency: memory hierarchy, occupancy, register pressure, tensor cores, warp scheduling
Profiling fluency (Nsight, ncu, or comparable tools) and the discipline to measure before changing
Track record of shipping kernel-level optimizations that moved a measurable metric in a real system
Strong systems expertise: you understand how kernels live inside larger frameworks and how integration choices affect end-to-end performance
Comfortable reading framework-level Python and C++ around your kernels

NICE TO HAVE

Open-source contributions to kernel libraries, compilers, or ML frameworks
Experience with multiple accelerator architectures (different GPU families, TPUs, custom ASICs), preferably AMD GPUs
Familiarity with collective communication primitives (NCCL or equivalent)
Compiler or runtime background

THIS ROLE IS PROBABLY NOT FOR YOU IF

You haven't gotten your hands dirty at the kernel level: this isn't a higher-level systems role rebranded
You want to stay narrowly in one library; we expect breadth across the kernel surface our models actually use
Performance work without measurable end-to-end impact frustrates you

P.S. We’re also hosting a small private dinner during MLSys for people interested in agents, recursive self-improvement, and AI infrastructure. Apply to join us here: https://luma.com/u6yt1gri

Performance Engineer

PDT Partners · United States

We’re looking for an exceptional Performance Engineer to join our growing technology organization. Interviewing at PDT is intentionally focused on finding great people who can build long-term, impactful careers with us.

Performance Engineers at PDT are responsible for deeply understanding and optimizing the systems that enable our trading strategies at scale. You will work at the intersection of software, systems, and hardware to analyze performance, drive infrastructure efficiency, and free up critical compute capacity. Your work directly amplifies researcher velocity and scales our core models, creating massive impact through both cost savings and accelerated innovation. You'll thrive at PDT if you love open-ended problems, diving into GPU optimization and system optimization/design, and are excited to take your discoveries all the way to production at scale.

This is a hybrid position and will require the person to work from our New York City office at a minimum of 3 days a week.

Why join us

PDT Partners has a stellar 30+ year track record and a reputation for excellence. Our goal is to be the best quantitative investment manager in the world, measured by the quality of our products, not their size. PDT’s very high employee-retention rate speaks for itself. Our people are intellectually extraordinary, and our community is close-knit, down-to-earth, and diverse.

Key Responsibilities

Analyze and understand system performance to enhance researcher throughput and velocity.

Focus on infrastructure/system-level efficiency, working across Python, PyTorch, OS, networking, storage, and CPU/GPU layers to optimize compute resource utilization

Read and understand software layers, providing suggestions/PRs that optimize parts of codebases.

Free up capacity and reduce costs by improving computational efficiency

Support scaling of core models by ensuring efficient implementation

Propose and implement systems to improve performance telemetry

Conduct proof-of-concept (PoC) evaluations and contribute to system design

Identify and act on optimization opportunities across the stack

Below is a list of skills and experiences we think are relevant. Even if you don’t think you’re a perfect match, we still encourage you to apply because we are committed to developing our people.

Strong proficiency in Linux and its associated performance engineering toolset.

Experience with PyTorch, GPUs and CUDA for optimization.

Deep understanding and appreciation of what happens at the hardware-software interface.

Versatile engineering mindset: ability to learn quickly, tackle diverse challenges, and adapt.

Skills in coding, micro-optimization, and understanding multiple programming languages.

Ability to analyze performance without being solely focused on heads-down optimization.

The salary range for this role is between $195,000 and $225,000. This range is not inclusive of any potential bonus amounts. Factors that may impact the agreed upon salary within the range for a particular candidate include years of experience, level of education obtained, skill set, and other external factors.

PRIVACY STATEMENT: For information on ways PDT may collect, use, and process your personal information, please see PDT’s privacy notices.

Performance Modeling

Netpreme · United States

About the Role

We are seeking a Member of Technical Staff, Performance Modeling to develop performance models for end-to-end ML systems based on our silicon products. The role focuses on building rigorous, decision-useful models that connect architectural choices to real end-to-end workload behavior.

You'll work closely with silicon architects and workload teams to explore design tradeoffs, validate performance assumptions, and identify bottlenecks early in the development cycle. This role is well-suited for engineers who enjoy reasoning from first principles and co-exploring the design space as hardware and software evolve together.

This role will be performed onsite in Santa Clara, CA or Boston, MA.

Essential Duties & Responsibilities

Build and maintain system-level performance models for rack-scale ML infrastructure with our custom silicon components.
Evaluate end-to-end value proposition on ML workloads, translating model outputs into actionable guidance for the silicon team.
Work day-to-day with silicon architects, system designers, and workload owners to align performance expectations and constraints.
Identify performance bottlenecks, scaling limits, and sensitivity points across compute, memory, and interconnects.
Clearly communicate modeling assumptions, limitations, and conclusions to both technical and non-specialist stakeholders.

Qualifications

Bachelor's or Master's in Electrical Engineering, Computer Engineering, or a closely related field.
Deep understanding of ML systems: workload sharding, KV caching hierarchies, attention optimizations, and trade-offs when deploying ML models at scale.
Ability to quickly learn new ML architectures as they emerge and build performance models for them.
5–10+ years of experience in performance modeling for computer architectures, accelerators, or high-performance networking systems.
Ability to reason across multiple abstraction layers, from architectural details to system-level performance behavior.

Preferred Qualifications

PhD in Computer Science, Electrical Engineering, or a related field.
Prior experience modeling performance for ML accelerators and/or ML systems broadly.
Familiarity with shared memory systems and frameworks (e.g. CUDA VMM).
Experience with scale-up and high-bandwidth interconnects (e.g. NVLink or similar).

Compensation & Benefits

Competitive base salary, performance-based bonus, and early stage equity grant
Comprehensive health, dental, vision, and life insurance
Relocation assistance and visa sponsorship
Daily lunch stipend, 401k match, and more
Sunny offices in Santa Clara, CA and Boston, MA

The Opportunity

Impact: We are tackling a fundamental challenge at the infrastructure layer: unlocking greater AI capability while dramatically improving efficiency. The work we do here compounds across state-of-the-art AI models, systems, and real-world applications.
Timing: Joining now means real ownership of the company and meaningful influence over product direction and execution. You'll work from first principles, move quickly from insight to execution, and see your contributions directly reflected in what we build.
Culture: You'll work alongside a group of people who care deeply about rigor, clarity, and impact. We value thoughtful disagreement, fast learning, and intellectual fearlessness. This is a place where strong ideas shine, curiosity is encouraged, and growth is a daily practice.

Senior Engineering Manager, AI Runtime

Databricks, Inc. · Hybrid · United States

Senior ML Kernel Performance Engineer

Amazon · In Person · United States

Senior Research Scientist, World Action Modeling

Waymo · Hybrid · United States

MTS, Kernels

Inception · In Person · United States

Inception creates the world’s fastest, most efficient AI models. Our Mercury model is the world’s fastest reasoning LLM and first commercially available diffusion LLM, delivering 5x greater speed and efficiency than today’s LLMs, with best-in-class quality.

We are the AI researchers and engineers behind such breakthrough AI technologies as diffusion models, flash attention, and DPO. The Role We're looking for engineers and scientists to design, optimize, and maintain the compute foundations that power large-scale language model training and inference. You will develop high-performance ML kernels, enable efficient low-precision arithmetic, and improve the distributed compute stack that makes training and serving large models possible.

Key Responsibilities - Design and implement custom ML kernels (CUDA, CuTe, Triton) for core dLLM operations such as attention, matrix multiplication, gating, and normalization, optimized for modern GPU architectures. - Design compute primitives to reduce memory bandwidth bottlenecks and improve kernel efficiency. - Contribute to infrastructure stability and scalability, ensuring reproducibility, consistency across precision formats, and high utilization of compute resources.

Qualifications - BS/MS/PhD in Computer Science, Engineering, or a related field (or equivalent experience). - Proficiency in CUDA, CuTe, Triton, or other GPU programming frameworks. - Understanding of ML frameworks (PyTorch, TensorFlow) from a systems perspective. - Background in performance optimization and profiling of ML systems. - Experience implementing low-precision formats (FP8, INT8, block floating point) or contributing to related compiler stacks (XLA, TVM). - Familiarity with distributed training techniques (data parallel, model parallel, pipeline parallel). - Proficiency in Python and at least one systems programming language (C++/Rust/Go). - Experience with containerization (Docker), orchestration (Kubernetes), and CI/CD pipelines.

Preferred Skills - Experience building and maintaining large-scale language models with tens of billions of parameters or more. - Experience with distributed systems and cloud computing platforms (AWS/GCP/Azure). - Familiarity with distributed frameworks such as PyTorch/XLA, DeepSpeed, Megatron-LM. - Prior contributions to open-source deep learning infrastructure such as PyTorch, DeepSpeed, or XLA.

Why Join Inception - Work with World-Class Talent: Collaborate with the inventors of diffusion models and leading AI researchers - Shape Foundational Technology: Your decisions will influence how the next generation of AI products are built and used - Immediate Impact: Join at the ground floor where your contributions directly shape product direction and company trajectory

Perks & Benefits - Competitive salary and equity in a rapidly growing startup - Flexible vacation and paid time off (PTO) - Health, dental, and vision insurance - Catered meals (breakfast, lunch, & dinner) - Commuter subsidies - A collaborative and inclusive culture

Onboard Infrastructure Software Engineer

Waymo · Hybrid · United States

Senior Developer Tools Engineer

Lemurian Labs · Hybrid · United States

Location Santa Clara, California, USA or Toronto, Canada

Description At Lemurian Labs, we’re on a mission to bring the power of AI to everyone—without leaving a massive environmental footprint. We care deeply about the impact AI has on our society and planet, and we’re building a rock-solid foundation for its future, ensuring AI grows sustainably and responsibly. Because let’s face it, what good is innovation if it doesn’t help the world?

We are building a high-performance, portable compiler that lets developers “build once, deploy anywhere.” Yes, anywhere. We’re talking about seamless cross-platform compatibility, so you can train your models in the cloud, deploy them to the edge, and everything in between—all while optimizing for resource efficiency and scalability.

If the idea of sustainably scaling AI motivates you and you’re excited about making AI development both powerful and accessible, then we’d love to have you. Join us at Lemurian Labs, where you can have fun building the future—without leaving a mess behind.

About the Role As the founding member of our Developer Experience (DevX) team, you will be instrumental in shaping how engineers interact with our compiler infrastructure. You'll build the tools that give developers deep visibility into system performance—from profiling and debugging capabilities to hardware introspection interfaces. Your work will bridge the gap between our core compiler technology and the engineers who use it, transforming complex system data into actionable insights.

This role sits at the intersection of systems programming and developer tooling. You'll work closely with our compiler engineers to surface server-side telemetry through intuitive client-side interfaces, ultimately creating a best-in-class development experience for our users.

Here is what you will do: Design and build developer tools for profiling, debugging, and performance introspection across our compiler stack. Create client-side tooling that transforms server-side compiler telemetry into clear, actionable information for engineers. Develop interfaces that expose hardware performance metrics, and interrupt data in meaningful ways. Build GPU debugging capabilities and visualization tools to help engineers understand execution on heterogeneous hardware. Define formats and protocols for debug information exchange, working with standard debugger formats (DWARF, JTAG) and object file formats (ELF, COFF). Collaborate with internal engineering teams to understand their needs and iterate on tooling, with a path toward external customer-facing tools.

Essential Skills and Experience: 3+ years of professional experience in systems-level software development. Strong proficiency in C++ with experience writing performance-critical code. Working knowledge of assembly language and low-level debugging techniques. Familiarity with debugger formats (DWARF, JTAG) and object file formats (ELF, COFF). Understanding of profiling methodologies and performance analysis tools. Ability to work on-site at our Toronto or Santa Clara office.

Preferred Skills and Experience: Experience with GPU programming and debugging (CUDA, ROCm, or similar). Experience with OS-level interfaces including I/O subsystems and interrupt handling. Background in compiler development or toolchain infrastructure. Experience building developer-facing tools or IDEs. Contributions to open-source debugging or profiling tools.

Salary depends on experience and geographical location.

This salary range may be inclusive of several career levels and will be narrowed during the interview process based on a number of factors, such as the candidate’s experience, knowledge, skills, and abilities, as well as internal equity among our team.

Additional benefits for this role may include: equity, company bonus opportunities, medical, dental, and vision benefits; retirement savings plan; and supplemental wellness benefits.

Lemurian Labs ensures equal employment op

Software Development Manager, ML Accelerators, AWS Neuron, Annapurna Labs