Back to articles

Deep Signal Quarterly – Q3 2024

15th October 2024

By Steve Kilpatrick
Founder & Director
ML Systems and Infrastructure

474 Tracked Repos | 76,258 Commits | 7,341 Contributors

THE SIGNAL

Our Q2 predictions: the vllm retention test answered decisively. Churn dropped 35% QoQ while the contributor base grew 10%. vllm is retaining talent, and that changes the serving talent map. The PyTorch compiler subsystem formalisation call needs another quarter of data. Agent framework absorption is tracking; the category shed 14% of contributors this quarter, and the shallow engagement pattern in the dominant projects confirms the category is maturing into tooling rather than sustaining as a distinct engineering discipline.

GPU kernel engineering posted its strongest quarter in two years, and the growth came from an unexpected direction. While the largest vendor’s toolchain repos held steady, the contributor surge landed disproportionately in AMD’s kernel libraries and ARM’s compute primitives: projects focused on quantised operations, low-precision arithmetic, and attention mechanisms. Across LLM serving, vllm pulled 223 contributors with 149 brand new, yet churn dropped 35%. vllm is not just attracting people; it is retaining them at a rate no other serving project matches.

This edition tracks 7,341 contributors across 474 tracked repositories and 76,258 commits.

Q-over-Q Snapshot

The macro rotation is visible in a single scan: every category tied to inference cost (kernels, compilers, optimisation, serving) grew commits, while training-adjacent and agent categories contracted or flatlined.

Category	Commits	Contributors	Active Repos	Commits QoQ	Contribs QoQ
GPU Kernels & Performance	4,455	631	35	+18%	+21%
ML Compilers & Graph Optimization	12,013	704	29	+18%	+0%
Distributed Training & Parallelism	1,877	319	19	-2%	-6%
Inference Runtimes & Engines	1,921	369	19	-6%	-3%
LLM Serving & Inference	7,518	886	42	+2%	+10%
Training Frameworks & Model Architecture	18,529	2,025	60	+2%	-2%
ML Platform & Orchestration	6,645	807	29	-2%	+5%
Edge & On-Device ML	8,095	554	29	+12%	-3%
Model Optimization & Compression	2,744	263	24	+10%	+4%
Hardware-Software Co-Design	7,874	940	21	-6%	+7%
ML Debugging & Tooling	2,342	274	16	-6%	+4%
Agent Framework	2,245	616	5	-5%	-14%

What’s Moving

⚙️ GPU Kernels & Performance

Attention kernels and low-precision arithmetic absorbed most of the new contributor energy. The deeper signal is in the composition: the ratio of performance-tuning work to feature work climbed sharply. Projects are concentrating on runtime primitives rather than new operator coverage. Launch overhead and memory bandwidth ceilings are gating throughput, and kernel engineers are responding accordingly.

A project focused on fused training kernels appeared in our tracking for the first time, drawing contributors from both training framework and kernel performance pools. The coordination patterns are unusually high for a project this young, which typically correlates with a tight core team rather than drive-by contributions. A vendor’s compiler-adjacent kernel project posted substantial commits from just 5 contributors, split between test infrastructure and compiler source modifications. That profile (tiny team, heavy test investment, compiler-adjacent kernel work) describes the engineer every GPU compute team wants and almost nobody can find.

🧠 ML Compilers & Graph Optimization

The same compiler engineers wrote substantially more code with zero net contributor growth. Mean active weeks hit 8.5, the highest of any category, confirming this remains the domain of deeply embedded specialists. Churn spiked, but that reflects integration-phase contributors cycling through while core teams remained stable.

The predicted split between graph-level and hardware-targeting compiler profiles materialised. Graph-level projects focused on autotuning and cost-per-token optimisation for GPU and TPU backends. Hardware-targeting projects concentrated on backend lowering and runtime library work that requires engineers who think simultaneously about hardware constraints and compiler IR semantics. By year-end, expect these two tracks to formalise into distinct hiring profiles with minimal overlap.

🚀 LLM Serving & Inference

vllm’s contributor dynamics deserve scrutiny beyond the headlines. The project’s work evolved: scheduling logic, cache reuse, and batch-shaping code now dominate over model integration. That maturation pattern (from “support more models” to “optimise the runtime itself”) is exactly what happened with earlier serving infrastructure, and it preceded a wave of operational engineering hires.

A hardware-specific serving fork produced significant commits from just 16 contributors, with kernel-level quantisation and memory management work dominating. These two project types represent divergent strategies for the same problem, and the contributor profiles they attract are correspondingly different: one draws systems generalists; the other draws kernel specialists. There’s a layer beneath this finding that changes how you’d prioritise outreach. It’s part of what we share in our hiring intelligence briefings.

🔌 Hardware-Software Co-Design

Hardware co-design drew more people while producing less code, a pattern that typically precedes a velocity ramp as new contributors finish onboarding. One major hardware startup’s test-to-implementation ratio confirms verification and integration engineering is the current bottleneck, not kernel authoring.

Intel’s presence across this category is sprawling. Between their compiler fork, compute runtime, and backend projects, Intel-affiliated work accounted for a substantial share of the category’s engineering hours. Coordination patterns show tightly coordinated internal teams. For hiring teams competing with Intel for hardware-software engineers: these contributors are deeply embedded in multi-quarter projects with high switching costs.

🧪 Training Frameworks & Model Architecture

Both major training frameworks are converging on the same thesis: value lives in the compilation pipeline, not the user-facing API. PyTorch’s commit energy concentrated in compiler infrastructure. TensorFlow’s work is dominated by XLA backend consolidation.

Specialised fine-tuning projects are producing fused kernel work from extraordinarily small teams. By Q1 2025, expect fine-tuning optimisation to pull enough contributor energy from both frameworks and model optimisation to warrant its own tracking.

Quiet Corners

Edge & On-Device ML surged 12% in commits but shed contributors; ExecuTorch alone accounted for nearly half the category’s output. Model Optimization grew 10% in commits with quantisation-aware training paths as the dominant theme. Inference Runtimes contracted modestly, with hardware plugin integration as the bulk of activity.

Distributed Training posted its second consecutive decline, with commit profiles suggesting major projects are in maintenance mode. ML Debugging & Tooling saw contributors tick up, driven by evaluation harness projects absorbing LLM benchmarking work. Agent Framework contracted 14% in contributors; the dominant projects still command hundreds of contributors but the majority are new with shallow engagement, consistent with tutorial-driven traffic. ML Platform grew in contributors with the deepest engagement in the dataset at nearly 11 mean active weeks.

Where Talent Is Moving

The largest overlap sits between ML compilers and training frameworks: 192 engineers active in both. That number has grown for three consecutive quarters, reflecting framework engineering converging with compilation. The second-largest connects serving to training frameworks at roughly 150 engineers, driven by work on model loading, weight management, and mixed-precision paths spanning both domains. Together, these two overlaps describe roughly 340 engineers who operate at the intersection of “how models are built” and “how models are run.”

The kernel-to-hardware overlap grew as hardware backend projects pulled in kernel performance contributors. This cross-pollination is asymmetric: kernel engineers move toward hardware backends more readily than the reverse.

Talent Migration: Contributor Overlap Between Categories

What This Means If You’re Hiring

Serving engineers with kernel-level depth are the scarcest profile in our data. vllm absorbed 149 new contributors, but the core systems engineers number in the low dozens. Staff-level serving specialists with kernel fluency command $700K to $1.3M total comp at the upper end.

Compiler engineers who understand hardware targets represent a different kind of scarcity. Mean engagement of 8.5 weeks tells you they’re embedded in multi-quarter projects with high context-switching costs. Sourcing them requires understanding which compiler sub-domain they work in (graph optimisation vs hardware targeting vs automatic differentiation) and making a case that your problem is technically distinct. Generic compiler job descriptions will not reach them. ML compiler roles command $250K to $450K+ total comp.

Cross-domain profiles number in the low hundreds globally. They don’t respond to outreach that treats “ML engineer” as a single category.

If these patterns match what you’re tracking in your own hiring data, that’s a conversation worth having before Q4 intensifies demand.

Predictions

Q4 2024: PyTorch’s optimization library will cross 100 contributors and begin absorbing work from standalone quantisation libraries. Watch for contributor migration into its orbit as FP8 and INT4 paths stabilise.
Q4 2024: vllm’s core team will tighten review processes to manage the contributor influx, slowing commit velocity but improving quality. That gatekeeping will push some contributors toward competing forks.
Q1 2025: Fine-tuning optimisation (LoRA kernels, memory-efficient gradient computation, adapter merging) will emerge as a distinct hiring category, pulling talent from both frameworks and model optimisation.

Q4 hiring will be shaped by two forces: inference cost reduction intensifying demand for serving, kernel, and compiler engineers, while distributed training contraction releases some senior talent. Teams that identify training-infrastructure engineers whose skills transfer to inference optimisation will find a window that closes by mid-2025.

This report is powered by D33P S1GNL: a proprietary contributor intelligence engine. For access to the full contributor-level dataset or to discuss ML Systems hiring, contact [email protected]

Get our latest articles and insight straight to your inbox

Hiring Machine Learning Talent?

We engage exceptional humans for companies powered by AI

Find Out More > View Jobs >