MicroML - Dev Patel

Overview

MicroML started as an attempt to understand what modern deep learning systems actually look like beneath PyTorch abstractions. I wanted to build the core primitives myself: tensors, computational graphs, reverse-mode autodiff, neural network layers, optimizers, and vectorized kernels.

The framework is written in modern C++20 with minimal dependencies and supports dynamic computational graph construction, automatic differentiation, broadcasting semantics, configurable MLP architectures, and SIMD-accelerated matrix multiplication using AVX2/FMA instructions. Under the hood, the engine builds graphs lazily through operator overloading, performs topological sorting for efficient gradient flow, and executes reverse-mode backpropagation across arbitrary computation graphs.

One of the most interesting parts of the project was designing the tensor and graph system itself. I implemented memory-aware tensor layouts with explicit shape and stride handling, NumPy-style broadcasting, and graph nodes backed by smart pointers for safe gradient propagation and parameter sharing. The training stack includes custom loss functions, Xavier initialization, AdamW optimization, and visualization tooling for inspecting gradient flow through generated DOT computation graphs.

I also spent a lot of time optimizing the linear algebra path. Matrix multiplication kernels were vectorized using xsimd and AVX2/FMA instructions with cache-friendly memory access patterns, which produced roughly 8-9x speedups over naive implementations across multiple training workloads. End-to-end training time dropped from nearly five minutes to around thirty seconds, making experimentation significantly more interactive.

More than anything, this project forced me to deeply understand how modern ML systems actually work: computational graphs, autodiff internals, numerical stability, memory layout, optimization dynamics, and the systems tradeoffs behind high-performance ML infrastructure.

Technical Highlights

Reverse-mode automatic differentiation engine
SIMD-optimized tensor kernels with AVX2/FMA
Dynamic computational graph construction and topological sort
NumPy-style tensor broadcasting and shape inference
AdamW optimizer with momentum and weight decay
Graph visualization tooling for gradient inspection