Format: month/date/year
12/19/2024
I am going to write my tokenizer application in C++
I learned about Byte Pair Encoding today, this helped me understand better how tokens work.
I also learned a lot about unicode, this was something I hadn’t explored much in the past but it’s quite useful to understand how LLMs process data.
I will start implementing the algorithms tomorrow
My goal is to write good OOP code in C++ and learn how to work with unicode in C++ with the implementation of this
Questions I have: If Byte Pair Encoding splits based on token frequency why are certain big tokens split up into smaller tokens.