Back to articles

Deep Signal Quarterly – Q4 2025

18th January 2026

By Steve Kilpatrick
Founder & Director
ML Systems and Infrastructure

474 Tracked Repos | 107,299 Commits | 9,603 Contributors

THE SIGNAL

The gravity centre of ML systems work shifted decisively toward production inference this quarter. LLM Serving & Inference pulled in 2,144 unique contributors (up 13% QoQ), overtaking Training Frameworks (2,140, down 9%) for the first time in D33P S1GNL’s tracking history. The unforgettable number: 594 net-new contributors appeared across the vllm and sglang ecosystems in a single quarter. The talent pool you’re competing for has physically relocated; the engineers who were building training loops 12 months ago are now writing schedulers, KV cache logic, and batch-shaping code.

Last quarter’s predictions: hardware-specific serving acceleration (confirmed), AMD overtaking NVIDIA in kernel contributors (wrong on timeline, right on direction), distributed training normalising (punted). Two out of three on direction.

This edition tracks 9,603 contributors across 474 tracked repositories and 107,299 commits.

Q-over-Q Snapshot

The table captures a market rotating from “make the model” to “run the model cheaply.”

Category	Commits	Contributors	Active Repos	Commits QoQ	Contribs QoQ
GPU Kernels & Performance	14,795	1,303	66	+24%	+5%
ML Compilers & Graph Optimization	14,089	809	33	-1%	+6%
Distributed Training & Parallelism	6,236	645	25	+10%	+18%
Inference Runtimes & Engines	1,769	338	21	-33%	-10%
LLM Serving & Inference	16,525	2,144	60	+14%	+13%
Training Frameworks & Model Architecture	19,469	2,140	69	-16%	-9%
ML Platform & Orchestration	6,619	912	28	-17%	+6%
Edge & On-Device ML	8,228	622	34	-24%	+7%
Model Optimization & Compression	4,254	299	23	+12%	-3%
Hardware-Software Co-Design	10,388	1,262	24	-6%	+2%
ML Debugging & Tooling	2,736	290	18	-7%	-16%
Agent Framework	2,191	262	6	-2%	-27%

What’s Moving

🚀 LLM Serving & Inference

Every major serving project accelerated, but the nature of the work changed. The engineering across vllm and sglang has shifted from model integration toward production runtime internals: request scheduling, memory allocation, and kernel-level batch optimisation. A year ago, most serving commits were model-onboarding patches. Now the codebases read more like distributed systems projects than ML frameworks.

vllm crossed 618 contributors in Q4 (350 first-time), sglang hit 402 (244 new). vllm’s codebase now shows the engineering discipline of a production platform; its test and validation infrastructure has grown to match enterprise distributed systems. A newer NVIDIA-backed project is building at a different layer: closer to systems orchestration than model serving. If its growth holds, it will split how NVIDIA-ecosystem serving talent is distributed.

Hardware-specific serving forks targeting non-NVIDIA accelerators are building contributor communities at a pace that will tighten supply. The largest now carries well over 100 contributors, with the majority arriving this quarter alone. These are production-grade backends with dedicated CI and coordinated engineering teams. The contributor-level migration data here tells a more granular story; one we’re making available to a small number of hiring teams directly.

⚙️ GPU Kernels & Performance

Kernel commits surged 24% while contributor growth was a modest 5%. The existing engineers are writing more code, not joined by more people. Churn hit 38%, the highest of any category.

AMD’s primary kernel project now carries over 180 contributors, with more than half new this quarter. The work is tightly coupled to serving workloads: inference-specific attention and matmul optimisation, not generic compute. A kernel authoring framework from Meta’s PyTorch team shipped nearly 1,000 commits, built test-first at a density that signals production intent. Expect these frameworks to compress the hand-written-CUDA hiring pool by mid-2026.

🧪 Training Frameworks & Model Architecture

The contraction is structural. PyTorch’s engineering has shifted toward compiler passes and graph lowering. TensorFlow’s codebase is dominated by XLA backend work. Hugging Face transformers tilted toward serving integration. The framework label is legacy; the actual engineering is compiler delivery.

Where did the energy go? NVIDIA’s NeMo RL absorbed 75 contributors focused on distributed RL training loops. A newer project from a major Chinese research lab pulled 70 contributors with a 91% new-contributor ratio. The category is bifurcating: general-purpose maintenance is cooling while specialised training infrastructure concentrates into fewer, more focused projects.

🔌 Hardware-Software Co-Design

Tenstorrent’s combined output makes it the most active non-incumbent silicon programme in our tracking. The engineering has matured beyond prototyping: the codebase shows the kind of validation and testing discipline you’d expect from production infrastructure, not early-stage bring-up. Contributors here are the most deeply embedded of any category we track.

AMD’s ROCm ecosystem and Intel’s LLVM fork both maintained steady output, but the talent pool for non-NVIDIA silicon remains fragmented across vendor-specific ecosystems. Engineers who understand multiple backends are vanishingly rare.

🌐 Distributed Training & Parallelism

Contributor growth of 18% made this the fastest-expanding pool by headcount. The new projects are working at a lower level than traditional distributed training: GPU-level communication primitives and custom transport layers, not high-level orchestration. Churn at 26% confirms this talent is mobile; sourcing windows are narrow.

Quiet Corners

ML Compilers held flat on commits while adding 6% more contributors, concentrating around hardware-specific compiler backends. Edge & On-Device ML dropped 24% in commits while contributors grew 7%: ExecuTorch dominates and llama.cpp keeps pulling new arrivals, but smaller projects are going quiet. Inference Runtimes contracted sharply as general-purpose runtime work loses ground to LLM-specific serving stacks.

Model Optimization grew 12% on PyTorch’s INT4/FP8 quantisation work. ML Debugging & Tooling lost 16% of its contributors. Agent Frameworks shed 27% of their contributors while commit volume barely moved; sustained engineering effort is thinning.

Where Talent Is Moving

The strongest cross-pollination signal runs between ML Compilers and Training Frameworks: over 200 contributors worked across both. This reflects the merger of compiler optimisation into the training stack. For hiring, compiler engineers with training-loop context are the largest bridge population in ML systems.

Kernel engineers are splitting into two migration paths: one toward hardware co-design (silicon-specific optimisation), another toward LLM serving (attention kernels, quantised matmul). These populations barely intersect. A kernel engineer who understands ROCm primitives is a fundamentally different hire from one writing attention variants for vllm.

The direction of movement is overwhelmingly from training-adjacent work toward serving and kernel optimisation. There’s a layer beneath this finding that changes how you’d prioritise outreach. It’s part of what we share in our hiring intelligence briefings.

Talent Migration: Contributor Overlap Between Categories

What This Means If You’re Hiring

Serving engineers with hardware-specific experience are the scarcest profile in the data. Non-NVIDIA serving fork pools remain small (low hundreds), and most contributors are still ramping. If you need someone who can optimise a serving stack for custom silicon today, you’re fishing in a pool measured in dozens. Specialist inference and kernel engineers command a 30-50% premium over generalist ML engineers at equivalent seniority, and that gap is widening.

Kernel engineers are producing at unprecedented intensity but the contributor base barely grew. Churn at 38% creates brief sourcing windows but retention requires more than competitive base. The emergence of kernel authoring frameworks is segmenting the pool: abstraction-stack engineers versus hardware-depth engineers. Your job description needs to specify which.

Cross-domain profiles (compiler + training, kernel + serving, distributed systems + hardware) represent the highest-leverage hires for teams building production ML infrastructure. The compiler-training bridge is the largest corridor; the kernel-serving overlap is growing fastest. These engineers rarely surface on traditional sourcing channels.

Q1 2026 will test whether the serving talent surge is sustainable or whether contributor fatigue sets in. If any of these patterns match what you’re seeing in your own pipeline, that’s a conversation worth having.

Predictions

Q3 scorecard: one hit, one miss, one punt. Batting .333 in this market is honest.

By Q1 2026: The fastest-growing non-NVIDIA serving backend will cross 150 contributors, and at least one additional silicon-specific fork will emerge with 50+. The serving layer is fragmenting by hardware target faster than most hiring plans account for.
Watch for Q1: Training Frameworks will contract for a third consecutive quarter, dropping below 2,000 unique contributors. The talent is redistributing into serving, RL-specific training, and compiler-adjacent work.
By mid-2026: Kernel authoring frameworks will compress the hand-written-CUDA hiring pool by 15-20%. Teams that haven’t adjusted sourcing will find their candidate pipelines drying up.

The engineers you want next quarter are already deep in someone else’s codebase. The question is whether you can see them before your competitors do.

This report is powered by D33P S1GNL: a proprietary contributor intelligence engine. For access to the full contributor-level dataset or to discuss ML Systems hiring, contact [email protected]

Get our latest articles and insight straight to your inbox

Hiring Machine Learning Talent?

We engage exceptional humans for companies powered by AI

Find Out More > View Jobs >