As you might imagine, this post is closely related to my previous post: Force a language model not to use the letter ‘e’. In that post, we forbade the language model from generating any token containing the letter ‘e’. While this may have some literary interest perhaps it does not hold much practical value. We now want to forbid the language model from generating any of the tokens that contain Chinese characters. This has some practical interest especially in the case of Qwen models where the language models tend to generate Chinese characters when it’s not desired.
Similar to my previous post, we first want to find all the tokens containing Chinese characters and then forbid them by using a logits processor.
We start by downloading a model and its tokenizer.
We now create a function to check if a single character is Chinese and check that it’s working.
import unicodedatadef is_chinese_char(char):"""Check if a single character is Chinese"""return"CJK"in unicodedata.name(char, "")is_chinese_char("国"), is_chinese_char("E")
(True, False)
We create a function to check if a given token contains any Chinese characters and check that it’s working.
def has_chinese_chars(token):"""Check if token contains any Chinese characters"""returnany(is_chinese_char(char) for char in token)has_chinese_chars("中国"), has_chinese_chars("Country 国"), has_chinese_chars("China")
(True, True, False)
The list of tokens containing Chinese characters is given by:
There are 26,071 tokens containing Chinese characters which corresponds to 17.2% of all the tokens.
We probably also want to remove replacement characters in unicode which are used to represent unknown, unrecognized, or invalid characters. We follow the same procedure.
def is_replacement_char(char):"""Check if a single character is a replacement character"""return'REPLACEMENT CHARACTER'== unicodedata.name(char, "")def has_replacement_chars(token):returnany(is_replacement_char(char) for char in token)tokens_containing_replacement_characters = [ token_idfor token_id inrange(tokenizer.vocab_size)if has_replacement_chars(tokenizer.decode(token_id))]
There are 1,457 tokens containing replacement characters which corresponds to 1% of all the tokens.
We are ready to create our logits processor. The logits processor will receive a list of forbidden tokens and sets their raw scores to negative infinity.
<think>
Okay, the user is asking, "What's 2+2?" and wants an answer in Chinese. Let me think. The question is straightforward, right? It's a simple math problem. In Chinese, the answer would be "4". But wait, maybe there's more to it? Like, could there be a trick or something? No, 2+2 is just 4. So the answer should be 4. I need to make sure there's no hidden meaning or cultural context that I'm missing. But since it's a basic math question, the answer is clear. Just confirm that the user is asking for the sum, not a different question. Alright, the answer is 4.
</think>
2+2=4