https://xerv.netlify.app/crayon_final.pdf updated and final crayon paper abstract : ``` Subword tokenization is a critical preprocessing gate in Large Language Model (LLM) inference and training pipelines. Traditional tokenizers rely on pointer-heavy trie structures or dynamic hash tables, introducing severe memory fragmentation, high pointer-chasing latencies, and significant cold-start loading overheads. This paper presents CRAYON, a systems-first BPE tokenization framework that represents vocabulary matching using memory-aligned Double-Array Tries (DAT). CRAYON achieves zero-copy, sub-millisecond vocabulary swaps via operating system memory mapping. To optimize inference, CRAYON integrates an optimistic AVX2 SIMD scanning pathway that processes 32-byte ASCII blocks in a single instruction cycle, bypassing UTF-8 validation overhead when safe. For massive parallel batch processing, CRAYON introduces a GPU-accelerated parallel lookup engine in CUDA and ROCm/HIP, bypassing thread-wide lock contention through dynamic batch capacity planning. Furthermore, CRAYON implements a mathematically exact greedy BPE training algorithm optimized via a parallel-array linked list, an inverted occurrence index, and a lazy max-heap priority queue. Empirical evaluation demonstrates that CRAYON achieves CPU throughput exceeding 18.4 million tokens/sec on standard benchmarks, outperforming Rust-based implementations by up to 35×, while maintaining a cold-start initialization latency of only 0.54 ms. ```