S
Soham
@sohampal
https://xerv.netlify.app/crayon_final.pdf
updated and final crayon paper
abstract : ```
Subword tokenization is a critical preprocessing gate in Large Language Model (LLM) inference and training pipelines.
Traditional tokenizers rely on pointer-heavy trie structures or dynamic hash tables, introducing severe memory fragmentation, high
pointer-chasing latencies, and significant cold-start loading overheads. This paper presents CRAYON, a systems-first BPE tokenization
framework that represents vocabulary matching using memory-aligned Double-Array Tries (DAT). CRAYON achieves zero-copy,
sub-millisecond vocabulary swaps via operating system memory mapping. To optimize inference, CRAYON integrates an optimistic AVX2
SIMD scanning pathway that processes 32-byte ASCII blocks in a single instruction cycle, bypassing UTF-8 validation overhead when
safe. For massive parallel batch processing, CRAYON introduces a GPU-accelerated parallel lookup engine in CUDA and ROCm/HIP,
bypassing thread-wide lock contention through dynamic batch capacity planning. Furthermore, CRAYON implements a mathematically
exact greedy BPE training algorithm optimized via a parallel-array linked list, an inverted occurrence index, and a lazy max-heap priority
queue. Empirical evaluation demonstrates that CRAYON achieves CPU throughput exceeding 18.4 million tokens/sec on standard
benchmarks, outperforming Rust-based implementations by up to 35×, while maintaining a cold-start initialization latency of only 0.54 ms.
```
❤️ 0 likes
💬 0 comments
Replies
No replies yet.