Constructing a Thai Tokenizer from Scratch | Milan Tamang

A step-by-step information to constructing a Thai multilingual subword tokenizer primarily based on the BPE algorithm educated on Thai and English datasets utilizing solely Python

**[Image by writer]: The Thai tokenizer encodes and decodes Thai textual content to token IDs and vice versa.**

The primary job of Tokenizer We convert the uncooked enter textual content (on this case Thai, but it surely could possibly be a international language) to a quantity and move it to the mannequin’s transformer, which produces the output as a quantity. Once more, Tokenizer It converts these numbers into textual content that the top consumer can perceive. The excessive stage diagram under explains the above circulation.

[Image by writer]: Diagram displaying the position of the tokenizer within the enter and output circulation of LLM.

Typically, many people are solely concerned with studying how the transformer structure of our mannequin works internally. We frequently overlook studying the main points of vital elements such because the tokenizer. Understanding how the tokenizer works internally and correctly controlling its performance can enhance the accuracy and efficiency of our mannequin.

Just like the Tokenizer, a few of the most vital elements of an LLM implementation pipeline are knowledge preprocessing, analysis, guardrails/safety, testing/monitoring, and so on. I extremely suggest studying extra about these subjects. I solely realized the significance of those elements after I labored on a real-world implementation of the underlying multilingual mannequin, ThaiLLM, in manufacturing.

Why do we’d like a Thai tokenizer or some other international language tokenizer?

Suppose you will have pre-trained a large-scale language mannequin for a number of languages equivalent to Thai, Hindi, Indonesian, Arabic, Chinese language, and so on. utilizing a typical English-based tokenizer. In that case, the mannequin might not produce output that’s acceptable to your explicit area or use case. Therefore, constructing your individual tokenizer in your language of alternative will make the mannequin output extra constant and simpler to know.
Constructing your individual tokenizer additionally provides you full management over the comprehensiveness and inclusiveness of the vocabulary you construct. A complete vocabulary permits the eye mechanism to focus and be taught from extra tokens inside a restricted context size of the sequence, thus making the training extra constant and in the end serving to to enhance mannequin inference.

The excellent news is that after getting constructed the Thai Tokenizer, you may simply construct tokenizers for different languages. All of the constructing steps are the identical, besides that you should practice it on the dataset of the language of your alternative.

Now, there’s a good cause to construct your individual tokenizer. Under are the steps to construct a tokenizer for Thai language.

Construct your individual BPE algorithm
Practice the tokenizer
Tokenizer encoding and decoding capabilities
Load and check the tokenizer

Step 1: Construct your individual BPE (Byte Pair Encoding) algorithm.

The BPE algorithm is used to construct tokenizers in lots of in style LLMs, equivalent to Llama, GPT, and so on. In case your mannequin is predicated on English, you may select one in all these LLM tokenizers. Since we’re constructing a Thai tokenizer, the best choice is to jot down our personal BPE algorithm from scratch and use it to construct our tokenizer. First, let’s take the easy circulation diagram under as a information to know how the BPE algorithm works and begin constructing it accordingly.

[Image by writer]: BPE circulation diagram. Reference instance from Wiki web page (https://en.wikipedia.org/wiki/Byte_pair_encoding)

The circulation diagram examples are introduced in English for simple understanding.

Let’s write the code that implements the BPE algorithm for Thai Tokenizer.

# A easy apply instance to get familiarization with utf-8 encoding to transform strings to bytes. 
textual content = "How are you คุณเป็นอย่างไร"      # Textual content string in each English and Thai
text_bytes = textual content.encode("utf-8")
print(f"Textual content in byte: {text_bytes}")text_list = record(text_bytes)          # Converts textual content bytes to a listing of integer
print(f"Textual content record in integer: {text_list}")

# As I do not need to reinvent the wheel, I might be referencing many of the code block from Andrej Karpathy's GitHub (https://github.com/karpathy/minbpe?tab=readme-ov-file).
# Nevertheless, I will be modifying code blocks particular to constructing our Thai language tokenizer and in addition explaining the codes with the intention to perceive how every code block works and make it simple once you implement code to your use case later.# This module gives entry to the Unicode Character Database (UCD) which defines character properties for all Unicode characters.
import unicodedata
# This operate returns a dictionary with consecutive pairs of integers and their counts within the given record of integers.
def get_stats(ids, stats=None):
stats = {} if stats is None else stats
# zip operate permits to iterate consecutive gadgets from given two record
for pair in zip(ids, ids[1:]): 
# If a pair already exists within the stats dictionary, add 1 to its worth else assign the worth as 0.
stats[pair] = stats.get(pair, 0) + 1
return stats
# As soon as we discover out the record of consecutive pairs of integers, we'll then exchange these pairs with new integer tokens.
def merge(ids, pair, idx):
newids = []
i = 0
# As we'll be merging a pair of ids, therefore the minimal id within the record must be 2 or extra.
whereas i < len(ids):
# If the present id and subsequent id(id+1) exist within the given pair, and the place of id isn't the final, then exchange the two consecutive id with the given index worth.
if ids[i] == pair[0] and that i < len(ids) - 1 and ids[i+1] == pair[1]:
newids.append(idx)
i += 2  # If the pair is matched, the subsequent iteration begins after 2 positions within the record.
else:
newids.append(ids[i])
i += 1  # For the reason that present id pair did not match, so begin iteration from the 1 place subsequent within the record.
# Returns the Merged Ids record
return newids
# This operate checks that utilizing 'unicodedata.class' which returns "C" as the primary letter if it's a management character and we'll have to switch it readable character.
def replace_control_characters(s: str) -> str:
chars = []
for ch in s:
# If the character isn't distorted (which means the primary letter would not begin with "C"), then append the character to chars record.
if unicodedata.class(ch)[0] != "C":
chars.append(ch) 
# If the character is distorted (which means the primary letter has the letter "C"), then exchange it with readable bytes and append to chars record.
else:
chars.append(f"u{ord(ch):04x}") 
return "".be a part of(chars)
# Among the tokens equivalent to management characters like Escape Characters cannot be decoded into legitimate strings. 
# Therefore these should be exchange with readable character equivalent to �
def render_token(t: bytes) -> str:    
s = t.decode('utf-8', errors='exchange')
s = replace_control_characters(s)
return s

Two capabilities Get statistics and merge Outlined within the above code block is the implementation of the BPE algorithm for Thai tokenizer. Now that the algorithm is prepared, let’s write the code to coach the tokenizer.

Step 2: Practice the tokenizer:

Coaching a tokenizer entails producing a vocabulary, which is a database of distinctive tokens (phrases and subwords) together with a novel index quantity assigned to every token. Thai Wiki Dataset Coaching the Thai tokenizer requires a big dataset from Hugging Face. Simply as coaching the LLM requires a considerable amount of knowledge, coaching the tokenizer additionally requires a considerable amount of knowledge. Though not required, you should use the identical dataset to coach the LLM and the tokenizer. For multilingual LLMs, we suggest utilizing each the English and Thai datasets in a 2:1 ratio. That is the usual method adopted by many practitioners.

Let’s begin writing the coaching code.

# Import Common Expression
import regex as re    # Create a Thai Tokenizer class.
class ThaiTokenizer():
def __init__(self):
# The byte pair must be accomplished throughout the associated phrases or sentences that give a correct context. Pairing between unrelated phrases or sentences might give undesirable output.
# To forestall this conduct, we'll implement the LLama 3 common expression sample to make significant chunks of our textual content earlier than implementing the byte pair algorithm.
self.sample = r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^rnp{L}p{N}]?p{L}+|p{N}{1,3}| ?[^sp{L}p{N}]+[rn]*|s*[rn]+|s+(?!S)|s+"                
self.compiled_pattern = re.compile(self.sample)
# Particular tokens are used to supply coherence within the sequence whereas coaching.
# Particular tokens are assigned a novel index quantity and saved in vocabulary. 
self.special_tokens = >': 1101,
'<
# Initialize merges with empty dictionary
self.merges = {}
# Initialize the vocab dictionary by calling the operate _build_vocab which is outlined later on this class.
self.vocab = self._build_vocab()
# Tokenizer coaching operate
def practice(self, textual content, vocab_size):
# Ensure the vocab measurement should be at the very least 256 because the utf-8 encoding for the vary 0-255 are identical because the Ascii character.
assert vocab_size >= 256
# Whole variety of merges into the vocabulary.
num_merges = vocab_size - 256
# Step one is to verify to separate the textual content up into textual content chunks utilizing the sample outlined above.
text_chunks = re.findall(self.compiled_pattern, textual content)
# Every text_chunks might be utf-8 encoded to bytes after which transformed into an integer record.
ids = [list(ch.encode("utf-8")) for ch in text_chunks]
# Iteratively merge the most typical pairs to create new tokens
merges = {} # (int, int) -> int
vocab = {idx: bytes([idx]) for idx in vary(256)} # idx -> bytes
# Till the overall num_merges is reached, discover the widespread pair of consecutive id within the ids record and begin merging them to create a brand new token
for i in vary(num_merges):
# Rely the variety of instances each consecutive pair seems
stats = {}
for chunk_ids in ids:
# Passing in stats will replace it in place, including up counts
get_stats(chunk_ids, stats)
# Discover the pair with the very best depend
pair = max(stats, key=stats.get)
# Mint a brand new token: assign it the subsequent obtainable id
idx = 256 + i
# Change all occurrences of pair in ids with idx
ids = [merge(chunk_ids, pair, idx) for chunk_ids in ids]
# Save the merge
merges[pair] = idx
vocab[idx] = vocab[pair[0]] + vocab[pair[1]]
# Save class variables for use later throughout tokenizer encode and decode
self.merges = merges 
self.vocab = vocab   
# Operate to return a vocab dictionary combines with merges and particular tokens
def _build_vocab(self):
# The utf-8 encoding for the vary 0-255 are identical because the Ascii character. 
vocab = {idx: bytes([idx]) for idx in vary(256)}
# Iterate by means of merge dictionary and add into vocab dictionary
for (p0, p1), idx in self.merges.gadgets():
vocab[idx] = vocab[p0] + vocab[p1]
# Iterate by means of particular token dictionary and add into vocab dictionary
for particular, idx in self.special_tokens.gadgets():
vocab[idx] = particular.encode("utf-8")
return vocab
# After coaching is full, use the save operate to avoid wasting the mannequin file and vocab file.
# Mannequin file might be used to load the tokenizer mannequin for additional use in llm
# Vocab file is only for the aim of human verification
def save(self, file_prefix):        
# Writing to mannequin file 
model_file = file_prefix + ".mannequin"           # mannequin file title
# Mannequin write begins
with open(model_file, 'w') as f:            
f.write("thai tokenizer v1.0n")          # write the tokenizer model
f.write(f"{self.sample}n")              # write the sample utilized in tokenizer            
f.write(f"{len(self.special_tokens)}n")  # write the size of particular tokens
# Write every particular token within the particular format like under
for tokens, idx in self.special_tokens.gadgets():
f.write(f"{tokens} {idx}n")
# Write solely the keys half from the merges dict
for idx1, idx2 in self.merges:
f.write(f"{idx1} {idx2}n")
# Writing to the vocab file
vocab_file = file_prefix + ".vocab"       # vocab file title
# Change the place of keys and values of merge dict and retailer into inverted_merges 
inverted_merges = {idx: pair for pair, idx in self.merges.gadgets()}        
# Vocab write begins
with open(vocab_file, "w", encoding="utf-8") as f:
for idx, token in self.vocab.gadgets():
# render_token operate processes tokens and prevents distorted bytes by changing them with readable character
s = render_token(token)
# If the index of vocab is current in merge dict, then discover its baby index, convert their corresponding bytes in vocab dict and write the characters
if idx in inverted_merges:                    
idx0, idx1 = inverted_merges[idx]
s0 = render_token(self.vocab[idx0])
s1 = render_token(self.vocab[idx1])
f.write(f"[{s0}][{s1}] -> [{s}] {idx}n")
# If index of vocab isn't current in merge dict, simply write it is index and the corresponding string
else:                    
f.write(f"[{s}] {idx}n")
# Operate to load tokenizer mannequin. 
# This operate is invoked solely after the coaching is full and the tokenizer mannequin file is saved.
def load(self, model_file):
merges = {}             # Initialize merge and special_tokens with empty dict
special_tokens = {}     # Initialize special_tokens with empty dict
idx = 256               # Because the vary (0, 255) is already reserved in vocab. So the subsequent index solely begins from 256 and onwards.
# Learn mannequin file
with open(model_file, 'r', encoding="utf-8") as f:
model = f.readline().strip()          # Learn the tokenizer model as outlined throughout mannequin file writing            
self.sample = f.readline().strip()     # Learn the sample utilized in tokenizer            
num_special = int(f.readline().strip()) # Learn the size of particular tokens
# Learn all of the particular tokens and retailer in special_tokens dict outlined earlier
for _ in vary(num_special):
particular, special_idx = f.readline().strip().break up()
special_tokens[special] = int(special_idx)
# Learn all of the merge indexes from the file. Make it a key pair and retailer it in merge dictionary outlined earlier. 
# The worth of this key pair could be idx(256) as outlined above and carry on improve by 1.            
for line in f:
idx1, idx2 = map(int, line.break up())
merges[(idx1, idx2)] = idx
idx += 1
self.merges = merges                  
self.special_tokens = special_tokens  
# Create a remaining vocabulary dictionary by combining merge, special_token and vocab (0-255). _build_vocab operate helps to just do that.
self.vocab = self._build_vocab()

Step 3: Tokenizer encoding and decoding capabilities:

Tokenizer encoding: The tokenizer encoding operate seems to be up the vocabulary and converts the given enter textual content or immediate into a listing of integer IDs, that are then despatched to the transformer block.
Tokenizer Decode: The decoding operate of the tokenizer seems to be on the vocabulary and converts the record of IDs generated from the classification block of the transformer into output textual content.

To make it simpler to know, let’s check out the diagram under.

[Image by writer]: Encoding and decoding capabilities of Thai tokenizer

Let’s write the code that implements the encoding and decoding performance of the tokenizer.

# Tokenizer encode operate takes textual content as a string and returns integer ids record
def encode(self, textual content):      # Outline a sample to determine particular token current within the textual content
special_pattern = "(" + "|".be a part of(re.escape(ok) for ok in self.special_tokens) + ")"        
# Break up particular token (if current) from the remainder of the textual content
special_chunks = re.break up(special_pattern, textual content)        
# Initialize empty ids record 
ids = []                                                
# Loop by means of every of components within the particular chunks record.
for half in special_chunks:
# If the a part of the textual content is the particular token, get the idx of the half from the particular token dictionary and append it to the ids record.
if half in self.special_tokens:                
ids.append(self.special_tokens[part])            
# If the a part of textual content isn't a particular token 
else:                
# Break up the textual content into a number of chunks utilizing the sample we have outlined earlier.
text_chunks = re.findall(self.compiled_pattern, textual content)
# All textual content chunks are encoded individually, then the outcomes are joined                
for chunk in text_chunks:
chunk_bytes = chunk.encode("utf-8")   # Encode textual content to bytes                    
chunk_ids = record(chunk_bytes)         # Convert bytes to record of integer  
whereas len(chunk_ids) >= 2:    # chunks ids record should be at the very least 2 id to kind a byte-pair
# Rely the variety of instances each consecutive pair seems
stats = get_stats(chunk_ids)
# Some idx pair may be created with one other idx within the merge dictionary. Therefore we'll discover the pair with the bottom merge index to make sure we cowl all byte pairs within the merge dict.
pair = min(stats, key=lambda p: self.merges.get(p, float("inf")))
# Break the loop and return if the pair isn't current within the merges dictionary                        
if pair not in self.merges:
break 
# Discover the idx of the pair current within the merges dictionary
idx = self.merges[pair]
# Change the occurrences of pair in ids record with this idx and proceed
chunk_ids = merge(chunk_ids, pair, idx)                    
ids.prolong(chunk_ids)                
return ids
# Tokenizer decode operate takes a listing of integer ids and return strings
def decode(self, ids):
# Initialize empty byte record
part_bytes = []
# Change the place of keys and values of special_tokens dict and retailer into inverse_special_tokens 
inverse_special_tokens = {v: ok for ok, v in self.special_tokens.gadgets()}
# Loop by means of idx within the ids record
for idx in ids:
# If the idx is present in vocab dict, get the bytes of idx and append them into part_bytes record
if idx in self.vocab:
part_bytes.append(self.vocab[idx])
# If the idx is present in inverse_special_tokens dict, get the token string of the corresponding idx, convert it to bytes utilizing utf-8 encode after which append it into part_bytes record
elif idx in inverse_special_tokens:
part_bytes.append(inverse_special_tokens[idx].encode("utf-8"))
# If the idx isn't present in each vocab and particular token dict, throw an invalid error
else:
elevate ValueError(f"invalid token id: {idx}")
# Be part of all the person bytes from the part_byte record
text_bytes = b"".be a part of(part_bytes)
# Convert the bytes to textual content string utilizing utf-8 decode operate. Ensure to make use of "errors=exchange" to switch distorted characters with readable characters equivalent to �.
textual content = text_bytes.decode("utf-8", errors="exchange")
return textual content

Step 4: Load and check the tokenizer.

Lastly, right here comes one of the best a part of this text: on this part, we’ll carry out two fascinating duties.

First, we practice the tokenizer utilizing the Hugging Face Thai Wiki dataset. We selected a small dataset measurement (2.2 MB) to hurry up coaching. Nevertheless, in an actual implementation, we should always select a a lot bigger dataset to get higher outcomes. After coaching is full, we save the mannequin.
Subsequent, load the saved tokenizer mannequin and run checks on the tokenizer’s encoding and decoding capabilities.

Let’s get began.

# Practice the tokenizerimport time   # To caculate the period of coaching completion
# Load coaching uncooked textual content knowledge (thai_wiki dataset) from huggingface. thai_wiki_small.textual content: https://github.com/tamangmilan/thai_tokenizer
texts = open("/content material/thai_wiki_small.txt", "r", encoding="utf-8").learn()
texts = texts.strip()
# Outline vocab measurement
vocab_size = 512
# Initialize a tokenizer mannequin class
tokenizer = ThaiTokenizer()
# Begin practice a tokenizer
start_time = time.time()
tokenizer.practice(texts, vocab_size)
end_time = time.time()
# Save tokenizer: you may change path and filename.
tokenizer.save("./fashions/thaitokenizer")
print(f"Whole time to finish tokenizer coaching: {end_time-start_time:.2f} seconds")
# Output: Whole time to finish tokenizer coaching: 186.11 seconds (3m 6s) [Note: Training duration will be longer if vocab_size is bigger and lesser for smaller vocab_size]

# Take a look at the tokenizer# Initialize a tokenizer mannequin class
tokenizer = ThaiTokenizer()
# Load tokenizer mannequin. This mannequin was saved throughout coaching.
tokenizer.load("./fashions/thaitokenizer.mannequin")
# Invoke and confirm the tokenizer encode and decode operate for English Language
eng_texts = "When society developed in numerous lands"
print(f"English Textual content: {eng_texts}")
encoded_ids = tokenizer.encode(eng_texts)
print(f"Encoded Ids: {encoded_ids}")
decoded_texts = tokenizer.decode(encoded_ids)
print(f"Decoded Texts: {decoded_texts}n")
# Invoke and confirm the tokenizer encode and decode operate for Thai Language
thai_texts = "เมื่อสังคมมีวิวัฒนาการขึ้นในดินแดนต่าง"
print(f"Thai Textual content: {thai_texts}")
thai_encoded_ids = tokenizer.encode(thai_texts)
print(f"Encoded Ids: {thai_encoded_ids}")
thai_decoded_texts = tokenizer.decode(thai_encoded_ids)
print(f"Decoded Texts: {thai_decoded_texts}")

**[Thai Tokenizer]: Thai and English textual content encoding and decoding output.**

Good! Our Thai Tokenizer can now efficiently and precisely encode and decode each Thai and English textual content.

Did you discover that the encoded IDs for English texts are longer than the Thai encoded IDs? It’s because we educated the tokenizer solely on the Thai dataset. Due to this fact, the tokenizer can solely construct a complete vocabulary for Thai. As a result of we didn’t practice on the English dataset, the tokenizer has to encode from the character stage, which leads to longer encoded IDs. As talked about earlier than, for multilingual LLM, we should always practice on each English and Thai datasets in a 2:1 ratio. This may give us balanced and high-quality outcomes.

That is it! Now, you will have written your individual Thai tokenizer from scratch utilizing solely Python. That is fairly cool, because it means that you can simply construct tokenizers for any international language. This might be an amazing assist when implementing multilingual LLM.

Thanks for studying!

Link to Google Colab notebook

References

[1] Andrej Karpathy, Git Hub: Karputian/minbpe

Constructing a Thai Tokenizer from Scratch | Milan Tamang | September 2024

A step-by-step information to constructing a Thai multilingual subword tokenizer primarily based on the BPE algorithm educated on Thai and English datasets utilizing solely Python

Why do we’d like a Thai tokenizer or some other international language tokenizer?

Step 1: Construct your individual BPE (Byte Pair Encoding) algorithm.

Step 2: Practice the tokenizer:

Step 3: Tokenizer encoding and decoding capabilities:

Step 4: Load and check the tokenizer.

BlackRock CEO says he was “flawed about Bitcoin” and this is why

U.S. Senate warns massive tech firms to behave swiftly towards election interference

Converter

Editors Pick

Newsletter

Categories

Related Posts

Leave a Comment Cancel Reply

Latest

Best selling