Newline Delimited JSON (NDJSON) is a format where each line is usually a completed JSON value and newline separates records

NDJSON = JSONL (JSON Lines)

Pros of NDJSON:

  • You can process without having to download the whole file
  • You don’t need the full dataset in memory
  • You can use a plain JSON parser repeatedly on each line

There’s also a really cool concept of Parallel Splitting

You can have multiple workers attend to specific byte ranges (eg. 0 -> 1GB, 1GB -> 2GB)

It’s likely those boundaries cut through middle of a line

As a result each worker:

  • Seek to its start offset
  • If it’s not at the beginning of a line, skip forward to the next newline
  • From there, read lines normally, parsing each as JSON
  • Stop when you reach the end offset (and often “finish the last line” safely)