Crayon: The Tokenizer That Broke the Speed Barrier Most people don't think about tokenizers. They should. Tokenization is the silent bottleneck in every LLM pipeline, and we just made it 35x faster. Here's how Crayon works under the hood. The Problem Every LLM needs text broken into tokens before it can think. Traditional tokenizers use hash maps or pointer heavy trees. They're slow. They thrash your cache. They choke on large vocabularies. tiktoken, the industry standard, processes about 1M tokens per second on CPU. That sounds fast until you're preprocessing terabytes of training data. The Cartridge System Crayon's core idea is simple. Pre compile your vocabulary into a binary profile called a cartridge. Ship it as a .DAT file. Load it instantly using zero copy memory mapping. No parsing. No allocation. No startup delay. Two cartridges ship out of the box. Lite at 50k subwords for general purpose work. Standard at 206k for rich multilingual and multi domain coverage. Swapping between them takes less than a millisecond. Double Array Trie Instead of hash maps or trees, Crayon compiles vocabularies into a Double Array Trie. Two flat arrays, BASE and CHECK. Every lookup is O(1) with a single array access. No pointer chasing. No cache misses. The entire structure is cache aligned and fits in L3. The binary format is dead simple. A 12 byte header with magic bytes and version. Then three contiguous int32 arrays for BASE, CHECK, and VALUES. Serialized directly to disk. Loaded directly into memory via mmap. Zero copy. Zero overhead. BPE Training The trainer uses a parallel array linked list to represent the corpus. Four contiguous arrays: tokens, prev_pos, next_pos, and active. Merging two adjacent tokens is a constant time pointer rearrangement. No struct allocation. No heap pressure. An inverted index maps every byte pair to the exact positions where it occurs. The trainer never rescans the corpus. It jumps directly to merge sites. A lazy max heap handles frequency counting. Stale entries are discarded in O(1) on pop. No expensive heap rearrangements. CPU Engine The CPU backend uses AVX2 SIMD to verify 32 bytes of ASCII in a single cycle. If the next 32 bytes are pure ASCII, the engine bypasses all UTF-8 boundary checks and runs an aggressive fast path. For mixed text it falls back to safe UTF-8 validation. GPU Engine CUDA and HIP backends map one document per thread. The entire DAT lives in VRAM. Each thread performs lookahead up to 128 characters without block synchronization. We fixed a truncation bug where sequences over 4096 tokens were silently dropped. Now capacity is dynamic based on the longest sequence in the batch. Benchmarks On a 68KB mixed corpus, Crayon Lite processes 24.6M tokens per second on CPU. tiktoken manages 1.1M. That is roughly 22x faster. On a 100 million character stress test, Crayon CPU sustained 23.8M tokens per second. Cold profile load takes 0.54 milliseconds versus 1200 to 2100 milliseconds for JSON based tokenizers. Why This Matters Tokenization is not glamorous. But it is the first thing that happens in every LLM pipeline. If your tokenizer is slow, everything downstream waits. Training costs go up. Inference latency goes up. Iteration speed goes down. Crayon makes tokenization free. Not cheap. Free. At 24M tokens per second on a single CPU core, it stops being a bottleneck and starts being invisible. The code is open source. MIT licensed. Written in C++17 with Python bindings. CUDA and HIP for GPU. AVX2 and AVX512 for CPU. Link in the comments. Repo: https://github.com/Xerv-Org/CRAYON