474 Tracked Repos | 107,299 Commits | 9,603 Contributors
THE SIGNAL
The gravity centre of ML systems work shifted decisively toward production inference this quarter. LLM Serving & Inference pulled in 2,144 unique contributors (up 13% QoQ), overtaking Training Frameworks (2,140, down 9%) for the first time in D33P S1GNL’s tracking history. The unforgettable number: 594 net-new contributors appeared across the vllm and sglang ecosystems in a single quarter. The talent pool you’re competing for has physically relocated; the engineers who were building training loops 12 months ago are now writing schedulers, KV cache logic, and batch-shaping code.
Last quarter’s predictions: hardware-specific serving acceleration (confirmed), AMD overtaking NVIDIA in kernel contributors (wrong on timeline, right on direction), distributed training normalising (punted). Two out of three on direction.
This edition tracks 9,603 contributors across 474 tracked repositories and 107,299 commits.
Q-over-Q Snapshot
The table captures a market rotating from “make the model” to “run the model cheaply.”
| Category | Commits | Contributors | Active Repos | Commits QoQ | Contribs QoQ |
|---|---|---|---|---|---|
| GPU Kernels & Performance | 14,795 | 1,303 | 66 | +24% | +5% |
| ML Compilers & Graph Optimization | 14,089 | 809 | 33 | -1% | +6% |
| Distributed Training & Parallelism | 6,236 | 645 | 25 | +10% | +18% |
| Inference Runtimes & Engines | 1,769 | 338 | 21 | -33% | -10% |
| LLM Serving & Inference | 16,525 | 2,144 | 60 | +14% | +13% |
| Training Frameworks & Model Architecture | 19,469 | 2,140 | 69 | -16% | -9% |
| ML Platform & Orchestration | 6,619 | 912 | 28 | -17% | +6% |
| Edge & On-Device ML | 8,228 | 622 | 34 | -24% | +7% |
| Model Optimization & Compression | 4,254 | 299 | 23 | +12% | -3% |
| Hardware-Software Co-Design | 10,388 | 1,262 | 24 | -6% | +2% |
| ML Debugging & Tooling | 2,736 | 290 | 18 | -7% | -16% |
| Agent Framework | 2,191 | 262 | 6 | -2% | -27% |
Whatโs Moving
๐ LLM Serving & Inference
Every major serving project accelerated, but the nature of the work changed. The engineering across vllm and sglang has shifted from model integration toward production runtime internals: request scheduling, memory allocation, and kernel-level batch optimisation. A year ago, most serving commits were model-onboarding patches. Now the codebases read more like distributed systems projects than ML frameworks.
vllm crossed 618 contributors in Q4 (350 first-time), sglang hit 402 (244 new). vllmโs codebase now shows the engineering discipline of a production platform; its test and validation infrastructure has grown to match enterprise distributed systems. A newer NVIDIA-backed project is building at a different layer: closer to systems orchestration than model serving. If its growth holds, it will split how NVIDIA-ecosystem serving talent is distributed.
Hardware-specific serving forks targeting non-NVIDIA accelerators are building contributor communities at a pace that will tighten supply. The largest now carries well over 100 contributors, with the majority arriving this quarter alone. These are production-grade backends with dedicated CI and coordinated engineering teams. The contributor-level migration data here tells a more granular story; one weโre making available to a small number of hiring teams directly.
โ๏ธ GPU Kernels & Performance
Kernel commits surged 24% while contributor growth was a modest 5%. The existing engineers are writing more code, not joined by more people. Churn hit 38%, the highest of any category.
AMDโs primary kernel project now carries over 180 contributors, with more than half new this quarter. The work is tightly coupled to serving workloads: inference-specific attention and matmul optimisation, not generic compute. A kernel authoring framework from Metaโs PyTorch team shipped nearly 1,000 commits, built test-first at a density that signals production intent. Expect these frameworks to compress the hand-written-CUDA hiring pool by mid-2026.
๐งช Training Frameworks & Model Architecture
The contraction is structural. PyTorchโs engineering has shifted toward compiler passes and graph lowering. TensorFlowโs codebase is dominated by XLA backend work. Hugging Face transformers tilted toward serving integration. The framework label is legacy; the actual engineering is compiler delivery.
Where did the energy go? NVIDIAโs NeMo RL absorbed 75 contributors focused on distributed RL training loops. A newer project from a major Chinese research lab pulled 70 contributors with a 91% new-contributor ratio. The category is bifurcating: general-purpose maintenance is cooling while specialised training infrastructure concentrates into fewer, more focused projects.
๐ Hardware-Software Co-Design
Tenstorrentโs combined output makes it the most active non-incumbent silicon programme in our tracking. The engineering has matured beyond prototyping: the codebase shows the kind of validation and testing discipline youโd expect from production infrastructure, not early-stage bring-up. Contributors here are the most deeply embedded of any category we track.
AMDโs ROCm ecosystem and Intelโs LLVM fork both maintained steady output, but the talent pool for non-NVIDIA silicon remains fragmented across vendor-specific ecosystems. Engineers who understand multiple backends are vanishingly rare.
๐ Distributed Training & Parallelism
Contributor growth of 18% made this the fastest-expanding pool by headcount. The new projects are working at a lower level than traditional distributed training: GPU-level communication primitives and custom transport layers, not high-level orchestration. Churn at 26% confirms this talent is mobile; sourcing windows are narrow.
Quiet Corners
ML Compilers held flat on commits while adding 6% more contributors, concentrating around hardware-specific compiler backends. Edge & On-Device ML dropped 24% in commits while contributors grew 7%: ExecuTorch dominates and llama.cpp keeps pulling new arrivals, but smaller projects are going quiet. Inference Runtimes contracted sharply as general-purpose runtime work loses ground to LLM-specific serving stacks.
Model Optimization grew 12% on PyTorchโs INT4/FP8 quantisation work. ML Debugging & Tooling lost 16% of its contributors. Agent Frameworks shed 27% of their contributors while commit volume barely moved; sustained engineering effort is thinning.
Where Talent Is Moving
The strongest cross-pollination signal runs between ML Compilers and Training Frameworks: over 200 contributors worked across both. This reflects the merger of compiler optimisation into the training stack. For hiring, compiler engineers with training-loop context are the largest bridge population in ML systems.
Kernel engineers are splitting into two migration paths: one toward hardware co-design (silicon-specific optimisation), another toward LLM serving (attention kernels, quantised matmul). These populations barely intersect. A kernel engineer who understands ROCm primitives is a fundamentally different hire from one writing attention variants for vllm.
The direction of movement is overwhelmingly from training-adjacent work toward serving and kernel optimisation. Thereโs a layer beneath this finding that changes how youโd prioritise outreach. Itโs part of what we share in our hiring intelligence briefings.
What This Means If Youโre Hiring
Serving engineers with hardware-specific experience are the scarcest profile in the data. Non-NVIDIA serving fork pools remain small (low hundreds), and most contributors are still ramping. If you need someone who can optimise a serving stack for custom silicon today, youโre fishing in a pool measured in dozens. Specialist inference and kernel engineers command a 30-50% premium over generalist ML engineers at equivalent seniority, and that gap is widening.
Kernel engineers are producing at unprecedented intensity but the contributor base barely grew. Churn at 38% creates brief sourcing windows but retention requires more than competitive base. The emergence of kernel authoring frameworks is segmenting the pool: abstraction-stack engineers versus hardware-depth engineers. Your job description needs to specify which.
Cross-domain profiles (compiler + training, kernel + serving, distributed systems + hardware) represent the highest-leverage hires for teams building production ML infrastructure. The compiler-training bridge is the largest corridor; the kernel-serving overlap is growing fastest. These engineers rarely surface on traditional sourcing channels.
Q1 2026 will test whether the serving talent surge is sustainable or whether contributor fatigue sets in. If any of these patterns match what youโre seeing in your own pipeline, thatโs a conversation worth having.
Predictions
Q3 scorecard: one hit, one miss, one punt. Batting .333 in this market is honest.
- By Q1 2026: The fastest-growing non-NVIDIA serving backend will cross 150 contributors, and at least one additional silicon-specific fork will emerge with 50+. The serving layer is fragmenting by hardware target faster than most hiring plans account for.
- Watch for Q1: Training Frameworks will contract for a third consecutive quarter, dropping below 2,000 unique contributors. The talent is redistributing into serving, RL-specific training, and compiler-adjacent work.
- By mid-2026: Kernel authoring frameworks will compress the hand-written-CUDA hiring pool by 15-20%. Teams that havenโt adjusted sourcing will find their candidate pipelines drying up.
The engineers you want next quarter are already deep in someone elseโs codebase. The question is whether you can see them before your competitors do.
This report is powered by D33P S1GNL: a proprietary contributor intelligence engine. For access to the full contributor-level dataset or to discuss ML Systems hiring, contact [email protected]
Get our latest articles and insight straight to your inbox
Hiring Machine Learning Talent?
We engageย exceptional humans for companies powered by AI
