Newline Delimited JSON (NDJSON) is a format where each line is usually a completed JSON value and newline separates records
NDJSON = JSONL (JSON Lines)
Pros of NDJSON:
- You can process without having to download the whole file
- You don’t need the full dataset in memory
- You can use a plain JSON parser repeatedly on each line
There’s also a really cool concept of Parallel Splitting
You can have multiple workers attend to specific byte ranges (eg. 0 -> 1GB, 1GB -> 2GB)
It’s likely those boundaries cut through middle of a line
As a result each worker:
- Seek to its start offset
- If it’s not at the beginning of a line, skip forward to the next newline
- From there, read lines normally, parsing each as JSON
- Stop when you reach the end offset (and often “finish the last line” safely)