RLE masks are run-length encoded binary masks. Instead of storing every pixel in a segmentation mask, they store how many consecutive pixels belong to the background or foreground.

This is useful because masks often contain large contiguous regions, so the compressed representation is much smaller than saving the full H x W array.

Idea

A binary mask might look like this when flattened into a 1D array:

0 0 0 1 1 1 1 0 0 1

One way to encode it is as runs:

3 zeros, 4 ones, 2 zeros, 1 one

So the RLE representation becomes:

[3, 4, 2, 1]

Depending on the dataset or library, the exact convention changes:

  • some formats store alternating run lengths
  • some store (start, length) pairs
  • some flatten row-major order
  • some flatten column-major order
  • some use 1-indexed positions instead of 0-indexed

Those details matter a lot. Two systems can both say “RLE mask” while meaning slightly different encodings.

Why It Is Used

RLE masks are common in computer vision because they:

  • reduce storage size for sparse or contiguous masks
  • are easy to serialize in JSON or dataset annotations
  • are widely used in segmentation competitions and labeling tools

You will often see them in:

  • COCO-style annotations
  • medical image segmentation datasets
  • instance segmentation pipelines

Example

Suppose a 4 x 4 mask is:

0 0 1 1
0 0 1 1
0 0 0 0
1 1 0 0

If you flatten it, the encoded result depends on the flattening order. That is one of the main sources of bugs when decoding RLE masks.

Practical Notes

  • Always check the expected flattening order before decoding.
  • Always check whether the encoding starts with background or foreground.
  • If the mask looks rotated, shifted, or scrambled, the flattening convention is probably wrong.
  • RLE is great for binary masks, but it is not ideal for every type of image data.

RLE masks are basically a compact way to represent segmentation masks by storing runs of identical pixels rather than every individual pixel.