Tokenizer Logs

Format: month/date/year

12/19/2024

I am going to write my tokenizer application in C++

I learned about Byte Pair Encoding today, this helped me understand better how tokens work.

I also learned a lot about unicode, this was something I hadn’t explored much in the past but it’s quite useful to understand how LLMs process data.

I will start implementing the algorithms tomorrow

My goal is to write good OOP code in C++ and learn how to work with unicode in C++ with the implementation of this

Questions I have: If Byte Pair Encoding splits based on token frequency why are certain big tokens split up into smaller tokens.