A year ago, I watched a quant researcher waste an entire weekend running a backtest. Friday afternoon through Monday morning. Forty hours of CPU time to process three days of NASDAQ order flow data.
He wasn't waiting on strategy logic. He wasn't waiting on data I/O. He was waiting on parsing.
Specifically, he was waiting on a Python script to decode millions of ITCH messages from NASDAQ's raw market data feed. Each message took microseconds to parse. Millions of messages multiplied by microseconds became hours. Days. Lost weekends.
The crazy part? He wasn't even doing anything fancy with those messages. No complex transformations. No machine learning pipelines. Just reading raw bytes, understanding the structure, and passing them downstream. Straightforward work that should have been fast.
That moment stuck with me. And it became the question that drove Lunyn: What if parsing NASDAQ ITCH data wasn't the bottleneck anymore?
What if instead of forty hours, a backtest took four hours? What if a researcher could test 150 strategies per year instead of 15?
We decided to build that parser. Not a theoretical one. A production system that actually works at scale. One that hits 107 million messages per second without sacrificing reliability or burning through infrastructure costs.
This is how we did it.
The Problem With Traditional Parsing
Before we could make something fast, we had to understand why existing solutions were slow.
ITCH 5.0 is a binary protocol. Each message has a fixed structure: a length field, a message type, and then fields specific to that message type. An "Add Order" message contains an order ID, a symbol, a price, a quantity, and some flags. An "Order Executed" message has fewer fields. Different message types have different structures.
Traditional parsers, especially those built in Python or even naive C implementations, treat each message the same way. They read the length. Allocate memory for that message. Parse the fields. Copy the data into some structure. Do it again for the next message.
This approach has three expensive operations baked into every single message:
- Memory allocation overhead. Allocating memory isn't free. Even in fast languages like Rust or C++, repeated allocation and deallocation on the order of hundreds of millions of times adds latency. It fragments the heap. It causes cache misses. At 100 million messages per second, even a few nanoseconds per allocation matters.
- Unnecessary memory copying. Traditional approaches copy raw message bytes into typed structures. Copy from the buffer into a struct. Copy from the struct into another format for your downstream system. Three times the data moves through memory. Three times the cache gets polluted.
- Generic field extraction. Most parsers use generic code paths for field extraction. Reading an integer from bytes? There's a function for that. Reading a price? There's a function. Each function call has overhead. Branch prediction misses. Function prologue and epilogue. Tiny overheads multiplied by 100 million messages become enormous.
A quant researcher's 40-hour backtest isn't actually about strategy computation. It's about those tiny overheads piled up across 300 million ITCH messages.
Python adds another layer of pain on top. Dynamic typing, GIL locks, garbage collection, interpreted overhead. We tested a Pandas-based ITCH parser. It processed about 200,000 messages per second. Not per nanosecond. Per second. That's roughly 500 times slower than where we needed to be.
We needed a different approach entirely.
Zero Copy Design
The first decision was existential: don't copy data unless you absolutely have to.
In practice, this meant reorganizing how we think about message parsing. Traditional approach goes like this:
- Read raw bytes from the buffer
- Allocate memory for a message structure
- Copy bytes into that structure
- Return the structure
- Caller uses the structure
- Memory is freed
At 100 million messages per second, steps 2 and 3 alone kill your throughput.
Our approach is different:
- Read raw bytes from the buffer
- Calculate offsets to fields within those bytes
- Provide a view into the original buffer that lets the caller access fields by offset
- Don't allocate. Don't copy. Just point.
Here's what that looks like in practice. When we parse an Add Order message, we don't create a new AddOrder struct containing all the fields. Instead, we return a thin wrapper that knows where in the buffer each field lives.
When you want the order ID, the wrapper calculates the offset and gives you a pointer to the raw bytes. No copy. No allocation. Just pointer arithmetic.
let order_id_offset = base_offset + 8;let order_id = u64::from_be_bytes( &buffer[order_id_offset..order_id_offset + 8]);No allocation. No copy. The data still lives in the original buffer. We just navigate to it.
This sounds simple, but the implications are massive. We went from allocating memory for each message to allocating memory for exactly zero messages. From copying data multiple times to copying zero times.
In benchmarks, switching to zero-copy parsing alone cut our per-message overhead by 40 percent.
SIMD Is Not Optional at Scale
Once we eliminated allocation overhead, the next bottleneck appeared: basic field extraction itself.
Converting bytes to integers happens constantly in ITCH parsing. The price field is 8 bytes representing an integer in big-endian format. The quantity field is 4 bytes. The order ID is 8 bytes. Simple conversions, but multiplied by 100 million messages, they matter.
In a traditional sequential approach, you'd convert fields one at a time. Read 8 bytes, convert to integer, next field. Repeat.
Modern CPUs have vector instructions specifically designed for this. AVX2 instructions on Intel and AMD chips let you process multiple pieces of data in parallel using a single CPU instruction. Eight integers can be processed simultaneously instead of one at a time.
We built field extraction routines that leverage this. When we're parsing a batch of Add Order messages, we don't extract each order ID sequentially. We extract eight order IDs in parallel using SIMD.
A straightforward integer conversion from bytes takes about 8 nanoseconds. Using SIMD, we can convert eight integers in about 12 nanoseconds total, which comes out to 1.5 nanoseconds per conversion. Five and a half times faster.
For message types that have high frequency and lots of numeric fields, SIMD gives us another 30 to 40 percent throughput improvement.
Here's the catch: SIMD requires thinking about data differently. You can't just apply SIMD to random code. Your data needs to be organized in a way that SIMD can process it. Your fields need to align correctly in memory. Your access patterns need to be regular and predictable.
This is why people don't just sprinkle SIMD everywhere. It's not that SIMD is complex in isolation. It's that integrating SIMD into a coherent parsing strategy requires planning your entire data layout around it from the start.
We did that planning early. The message layouts in our parser are organized specifically to be SIMD-friendly.
Lock-Free Concurrency
Single-threaded parsing at 107 million messages per second is theoretically possible but practically limited. You'd need an impossibly fast CPU, and you'd be wasting a lot of hardware.
Real world deployment needs to handle peak NASDAQ message rates, which can exceed 500,000 messages per second. To handle that, you need concurrency.
Traditional concurrency uses locks. Threads take turns accessing shared data. Thread A locks the queue, processes a batch of messages, unlocks the queue. Thread B does the same. This serializes access and becomes a bottleneck.
Lock-free concurrency uses atomic operations instead. No explicit locks. Instead, threads use Compare-And-Swap (CAS) operations to update shared state atomically.
For Lunyn, we use lock-free queues to distribute messages to parsing threads. Each thread grabs a batch from the queue without blocking other threads. The queue implementation uses atomics and careful ordering guarantees to make this safe.
The performance difference is substantial. On a 16-core system, using traditional mutexes, you'd see scaling drop off around 8 cores as lock contention increases. With lock-free queues, we see near-linear scaling all the way to 16 cores. Sometimes even better due to reduced context switching overhead.
But here's where people mess up with lock-free programming: it's easy to introduce subtle bugs. You have to think carefully about memory ordering. You have to ensure that one thread's write is visible to another thread at the right time. You have to handle edge cases where a thread crashes or gets preempted.
We tested extensively. We ran the parser under artificial stress. We simulated thread failures. We validated that the lock-free queue actually works correctly at scale, not just on paper.
The Memory Layout Question
Everything so far requires one critical decision: how you organize data in memory.
Most programmers don't think much about memory layout. You define a struct, the compiler arranges the fields, and you move on. But at the scale we're talking about, memory layout becomes destiny.
Here's why: modern CPUs don't fetch individual bytes from memory. They fetch cache lines, which are typically 64 bytes. When your CPU needs a single byte, it pulls an entire cache line into the CPU cache. If the next few operations happen to access bytes nearby, they're already cached. Super fast. If they access bytes far away, you get a cache miss and wait for memory. Super slow.
In an ITCH backtest, you might process billions of Add Order messages. The specific fields you access most frequently depend on your strategy. But there's a typical pattern: you read the symbol (for lookup), the order ID (to track the order), the price and quantity (for the order book), and some flags.
Those fields aren't evenly distributed in the binary message format. NASDAQ packed the message to be compact. Field A at bytes 8 to 16, Field B at bytes 20 to 22, Field C at bytes 80 to 88. By the time you've accessed all the fields you need, you've pulled multiple cache lines into memory.
We reorganized the parser so that the most frequently accessed fields are clustered together. When you query an Add Order message for the essentials, they all live in the same cache line. No extra memory fetches.
This optimization is subtle. There's no new algorithm. We just moved data around. But it saves about 15 percent throughput because you're not cache-missing on every message.
Making It Deterministic
Raw throughput is one metric. But for trading systems, consistency matters too.
You might think 107 million messages per second is an average. But latency tail matters. If your parser sometimes takes 100 nanoseconds and sometimes takes 10 microseconds, that inconsistency is worse than always taking a consistent 5 microseconds.
This is where a lot of optimizations fail. They chase peak throughput but sacrifice consistency. SIMD optimizations might only apply to certain message types. Lock-free queues might have rare edge cases where threads contend. Garbage collection in certain languages happens unpredictably.
We engineered Lunyn to have deterministic latency profiles across normal operating conditions. Here's what that meant:
- No garbage collection. Rust's ownership model meant we could manage memory without a garbage collector. No random pauses to clean up memory. No collector STW (stop-the-world) events.
- No dynamic allocation in the hot path. We pre-allocated all the memory the parser needs upfront. Threads grab pre-allocated buffers from a pool. They process messages. They return the buffers. No allocation. No unpredictability.
- No branch misprediction loops. We structured the parsing state machine to avoid situations where the CPU's branch predictor fails frequently. Message processing flows follow predictable patterns.
When you run the parser, p50 latency is about 8 nanoseconds, p99 is about 45 nanoseconds, and p99.9 is about 210 nanoseconds. No random spikes. No tail latencies that kill your strategy.
That consistency is what lets you build reliable trading systems on top of Lunyn.
The Testing Reality
Building something this fast is one thing. Knowing it's actually correct is another.
We didn't just benchmark against test files. We tested against real NASDAQ ITCH feeds, including the chaotic days. We fed it 500 gigabytes of production data and validated that every message was parsed correctly. We built parsers that reconstructed the full order book and compared our results against known-good implementations.
We ran it for 72 hours straight under peak load to catch any subtle bugs that only appear after hours of execution.
We tested edge cases. What happens when you get a malformed message? We don't crash. We don't skip. We report the error clearly so the operator knows something went wrong.
What if the system gets overloaded? Our lock-free queues prevent deadlock. Messages don't pile up in unpredictable ways.
We wrote extensive test coverage. The test suite is actually comparable in size to the parser itself. But that's intentional. When you're building infrastructure that people rely on, correctness beats clever.
The Results
All of this led to one number: 107 million messages per second.
On standard hardware (a 16-core CPU with 32 gigabytes of RAM), we process a full day of NASDAQ ITCH data in under three minutes. The same workload that took 40 hours now takes 4 hours in total research workflow time.
More importantly, we're doing it without massive infrastructure investment. You don't need to rent out a data center. You don't need specialized hardware. You don't need a team of infrastructure engineers to keep it running.
A small trading firm with a single powerful server can now run backtests that previously required scaling out across multiple machines.
Why This Matters
The funny thing about infrastructure bottlenecks is how invisible they are until you remove them.
That quant researcher who spent his weekend waiting on parsing? He's not thinking about it as a parsing problem. He's thinking about it as research being slow. The bottleneck gets attributed to the strategy not being good enough, to data quality, to something else. The actual problem was hiding in the stack.
This is true across quantitative finance. Teams have built entire operational processes around parsing being slow. They batch jobs. They schedule backtests to run overnight. They keep more machines than they actually need just to handle peak loads.
What if all of that goes away?
What if a researcher can test a new idea in an hour instead of waiting for the next scheduled batch run? What if firms don't need to over-provision infrastructure? What if traders can actually react to market conditions instead of being limited by how fast they can process data?
That's what Lunyn enables.
We spent a year on this because we believed that trading infrastructure could be better. That the people building strategies shouldn't be limited by tools that haven't caught up with what modern hardware can do.
Parsing ITCH isn't the most glamorous problem in finance. But it matters. And now it's fast.
If you're managing trading research infrastructure, running backtests, or processing market data at scale, Lunyn can probably cut your processing times by an order of magnitude. The benchmark is reproducible. The code is available to test. The proof is in the numbers.
We're only at the beginning. There's more optimization to come. More architectures to support. More use cases to discover.
But for now, NASDAQ ITCH parsing at 107 million messages per second is real. It's reproducible. And it's available.
If you want to stop waiting and start researching, let's talk.