mirror of
https://github.com/open-goal/jak-project
synced 2026-06-02 02:00:40 -04:00
006d24b29a
Resolves #3075 TODO before merge: - [x] Properly draw non-korean strings while in korean mode (language selection) - [x] Check jak 3 - [x] Translation scaffolding (allow korean characters, add to Crowdin, fix japanese locale, etc) - [x] Check translation of text lines - [x] Check translation of subtitle lines - [x] Cleanup PR / some performance optimization (it's take a bit too long to build the text and it shouldn't since the information is in a giant lookup table) - [x] Wait until release is cut I confirmed the font textures are identical between Jak 2 and Jak 3, so thank god for that. Some examples of converting the korean encoding to utf-8. These show off all scenarios, pure korean / korean with ascii and japanese / korean with replacements (flags): <img width="316" height="611" alt="Screenshot 2025-07-26 191511" src="https://github.com/user-attachments/assets/614383ba-8049-4bf4-937e-24ad3e605d41" /> <img width="254" height="220" alt="Screenshot 2025-07-26 191529" src="https://github.com/user-attachments/assets/1f6e5a6c-8527-4f98-a988-925ec66e437d" /> And it working in game. `Input Options` is a custom not-yet-translated string. It now shows up properly instead of a disgusting block of glyphs, and all the original strings are hopefully the same semantically!: <img width="550" height="493" alt="Screenshot 2025-07-26 202838" src="https://github.com/user-attachments/assets/9ebdf6c0-f5a3-4a30-84a1-e5840809a1a2" /> Quite the challenge. The crux of the problem is -- Naughty Dog came up with their own encoding for representing korean syllable blocks, and that source information is lost so it has to be reverse engineered. Instead of trying to figure out their encoding from the text -- I went at it from the angle of just "how do i draw every single korean character using their glyph set". One might think this is way too time consuming but it's important to remember: - Korean letters are designed to be composable from a relatively small number of glyphs (more on this later) - Someone at naughty dog did basically this exact process - There is no other way! While there are loose patterns, there isn't an overarching rhyme or reason, they just picked the right glyph for the writing context (more on this later). And there are even situations where there IS NO good looking glyph, or the one ND chose looks awful and unreadable (we could technically fix this by adjusting the positioning of the glyphs but....no more)! Information on their encoding that gets passed to `convert-korean-text`: - It's a raw stream of bytes - It can contain normal font letters - Every syllable block begins with: `0x04 <num_glyphs> <...the glyph bytes...>` - DO NOT confuse `num_glyphs` with num jamo, because some glyphs can have multiple jamo! - Every section of normal text starts with `0x03`. For example a space would be `0x03 0x20` - There are a very select few number of jamo glyphs on a secondary texture page, these glyph bytes are preceeded with a `0x05`. These jamo are a variant of some of the final vowels, moving them as low down as possible. Crash course on korean writing: - Nice resource as this is basically what we are doing - https://glyphsapp.com/learn/creating-a-hangeul-font - Korean syllable blocks have either 2 or 3 jamo. Jamo are basically letters and are the individual pieces that make up the syllable blocks. - The jamo are split up into "initial", "medial" and "final" categories. Within the "medial" category there are obvious visual variants: - Horizontal - Vertical - Combination (horizontal + a vertical) - These jamo are laid out in 6 main pre-defined "orientations": - initial + vertical medial - initial + horizontal medial - initial + combination - initial + vertical medial + final - initial + horizontal medial + final - initial + combination + final - Sometimes, for stylistic reasons, jamo will be written in different ways (ie. if there is nothing below a vertical vowel will be extended). - Annoying, and ND's glyph set supports this stylistic choice! - There are some combination of jamo that are never used, and some that are only used for a single word in the entire language! With all that in mind, my basic process was: - Scan the game's entire corpus of korean text, that includes subtitles. It's very easy to look at the font texture's glyphs and assign them to their respective jamo - This let me construct a mapping and see which glyphs were used under which context - I then shoved this information into a 2-D matrix in excel, and created an in-game tool to check every single jamo permutation to fill in the gaps / change them if naughty dogs was bad. Most of the time, ND's encoding was fine. - https://docs.google.com/spreadsheets/d/e/2PACX-1vTtyMeb5-mL5rXseS9YllVj32BGCISOGZFic6nkRV5Er5aLZ9CLq1Hj_rTY7pRCn-wrQDH1rvTqUHwB/pubhtml?gid=886895534&single=true anything in red is an addition / modification on my part. - This was the most lengthy part but not as long as you may think, you can do a lot of pruning. For example if you are checking a 3-jamo variant (the ones with the most permutations) and you've verified that the medial jamo is as far up vertically as it can be, and you are using the lowest final jamo that are available -- there is nothing to check or improve -- for better or worse! So those end up being the permutations between the initial and medial instead of a three-way permutation nightmare. - Also, while it is a 2d matrix, there's a lot of pruning even within that. For example, for the first 3 orientations, you dont have to care about final vowels at all. - At the end, I'm left with a lookup table that I can use the encode the best looking korean syllable blocks possible given the context of the jamo combination.
256 lines
15 KiB
Python
256 lines
15 KiB
Python
import glob
|
||
import json
|
||
import re
|
||
|
||
import argparse
|
||
|
||
parser = argparse.ArgumentParser()
|
||
parser.add_argument("--fix", action="store_true")
|
||
parser.set_defaults(fix=False)
|
||
args = parser.parse_args()
|
||
|
||
# fmt: off
|
||
JAK1_ALLOWED_CHARACTERS = [
|
||
"_", # NOTE - not an actual underscore, adds a long space!
|
||
"A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z",
|
||
"0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
|
||
"'", "!", "(", ")", "+", "-", ",", ".", "/", ":", "=", "<", ">", "*", "%", "?", "\"",
|
||
"`", "ˇ", "¨", "º", "¡", "¿", "Æ", "Ç", "ß", "™", "、", " ", "Å", "Ø", "Ą", "Ę", "Ł", "Ż","Ů", "Ý", "Č", "Ň", "Ř", "Š", "Ť", "Ž",
|
||
"Ñ", "Ã", "Õ", "Á", "É", "Í", "Ó", "Ú", "Ć", "Ń", "Ś", "Ź", "Ő", "Ű", "Â", "Đ", "Ê", "Î", "Ô", "Û", "À", "È", "Ì", "Ò", "Ù", "Ä", "Ë", "Ï", "Ö", "ö", "Ü", "Ė","Č","Š","Ž","Ų","Ū","Į","Ǎ","Ě","Ǧ","Ǐ","Ǒ","Ǔ","Y̌",
|
||
"海", "界", "学", "ワ", "ヲ", "ン", "岩", "旧", "空", "ヮ", "撃", "賢", "湖", "口", "行", "合", "士", "寺", "山", "者", "所", "書", "小", "沼", "上", "城", "場", "出", "闇", "遺", "黄", "屋", "下", "家", "火", "花", "レ", "ロ", "青", "・", "゛", "゜", "ー", "『", "』", "宝", "石", "赤", "跡", "川", "戦", "村", "隊", "台", "長", "鳥", "艇", "洞", "道", "発", "飛", "噴", "池", "中", "塔", "島", "部", "砲", "産", "眷", "力", "緑", "岸", "像", "谷", "心", "森", "水", "船", "世",
|
||
"ぁ", "あ", "ぃ", "い", "ぅ", "う", "ぇ", "え", "ぉ", "お", "か", "き", "く", "け", "こ", "さ", "し", "す", "せ", "そ", "た", "ち", "っ", "つ", "て", "と", "な", "に", "ぬ", "ね", "の", "は", "ひ", "ふ", "へ", "ほ", "ま", "み", "む", "め", "も", "ゃ", "や", "ゅ", "ゆ", "ょ", "よ", "ら", "り", "る", "れ", "ろ", "ゎ", "わ", "を", "ん",
|
||
"が", "ぎ", "ぐ", "げ", "ご", "ざ", "じ", "ず", "ぜ", "ぞ", "だ", "ぢ", "づ", "で", "ど", "ば", "び", "ぶ", "べ", "ぼ",
|
||
"ぱ", "ぴ", "ぷ", "ぺ", "ぽ",
|
||
"ァ", "ア", "ィ", "イ", "ゥ", "ウ", "ェ", "エ", "ォ", "オ", "カ", "キ", "ク", "ケ", "コ", "サ", "シ", "ス", "セ", "ソ", "タ", "チ", "ッ", "ツ", "テ", "ト", "ナ", "ニ", "ヌ", "ネ", "ノ", "ハ", "ヒ", "フ", "ヘ", "ホ", "マ", "ミ", "ム", "メ", "モ", "ャ", "ヤ", "ュ", "ユ", "ョ", "ヨ", "ラ", "リ", "ル",
|
||
"ヴ", "ガ", "ギ", "グ", "ゲ", "ゴ", "ザ", "ジ", "ズ", "ゼ", "ゾ", "ダ", "ヂ", "ヅ", "デ", "ド", "バ", "ビ", "ブ", "ベ", "ボ",
|
||
"パ", "ピ", "プ", "ペ", "ポ",
|
||
"~", "Œ"
|
||
]
|
||
|
||
JAK1_ALLOWED_CODES = [
|
||
"<TIL>",
|
||
"<PAD_X>", "<PAD_TRIANGLE>", "<PAD_CIRCLE>", "<PAD_SQUARE>"
|
||
]
|
||
|
||
JAK1_AUTO_REPLACEMENTS = {
|
||
"ª": "º",
|
||
"\n": "",
|
||
"’": "'",
|
||
"·": "-",
|
||
"–": "-",
|
||
"": "",
|
||
"„": ",,",
|
||
"”": "\"",
|
||
" ": " ",
|
||
"!": "!",
|
||
"(": "(",
|
||
")": ")",
|
||
"。": ".",
|
||
"×": "x",
|
||
"?": "?"
|
||
}
|
||
|
||
JAK2_ALLOWED_CHARACTERS = [
|
||
"_", # NOTE - not an actual underscore, adds a long space!
|
||
"A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z",
|
||
"a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z",
|
||
"0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
|
||
"'", "!", "(", ")", "+", "-", ",", ".", "/", ":", "=", "<", ">", "*", "%", "?", "\"",
|
||
"`", "ˇ", "¨", "º", "¡", "¿", "Æ", "Ç", "ß", "™", "、", " ", "Å", "Ø", "Ą", "Ę", "Ł", "Ż",
|
||
"Ů", "ů", "Ý", "ý", "Č", "č", "Ň", "ň", "Ř", "ř", "Š", "š", "Ť", "ť", "Ž", "ž", "Đ", "đ",
|
||
"æ", "ø", "œ",
|
||
"Ñ", "Ã", "Õ", "Á", "É", "Í", "Ó", "Ú", "Ć", "Ń", "Ś", "Ź", "ź", "Ő", "Ű", "Â", "Ê", "Î", "Ô", "Û", "À", "È", "Ì", "Ò", "Ù", "Ä", "Ë", "Ï", "ï", "Ö", "ö", "Ü", "Ė","Č","Š","Ž","Ų","Ū","Į",
|
||
"ñ", "á", "é", "í", "ó", "ú", "â", "ê", "î", "ô", "û", "à", "è", "ì", "ò", "ù", "ë", "ä", "ö", "ü", "ś", "å", "õ", "ã", "ę", "ż", "ć", "ą", "ł", "ń", "ű", "ő", "ė","č","š","ž","ų","ū","į",
|
||
"Ǎ","Ě","Ǧ","Ǐ","Ǒ","Ǔ","Y̌","ǎ","ě","ǧ","ǐ","ǒ","ǔ","y̌",
|
||
"・", "゛", "゜", "ー", "『", "』",
|
||
"海", "界", "学", "ワ", "ヲ", "ン", "岩", "旧", "空", "ヮ", "撃", "賢", "湖", "口", "行", "合", "士", "寺", "山", "者", "所", "書", "小", "沼", "上", "城", "場", "出", "闇", "遺", "黄", "屋", "下", "家", "火", "花", "レ", "ロ", "青", "宝", "石", "赤", "跡", "川", "戦", "村", "隊", "台", "長", "鳥", "艇", "洞", "道", "発", "飛", "噴", "池", "中", "塔", "島", "部", "砲", "産", "眷", "力", "緑", "岸", "像", "谷", "心", "森", "水", "船", "世",
|
||
"位", "遺", "院", "映", "衛", "応", "下", "画", "解", "開", "外", "害", "蓋", "完", "換", "監", "間", "器", "記", "逆", "救", "金", "空", "掘", "警", "迎", "撃", "建", "源", "現", "言", "限", "個", "庫", "後", "語", "護", "交", "功", "向", "工", "攻", "溝", "行", "鉱", "降", "合", "告", "獄", "彩", "作", "山", "使", "始", "試", "字", "寺", "時", "示", "自", "式", "矢", "射", "者", "守", "手", "終", "週", "出", "所", "書", "勝", "章", "上", "乗", "場", "森", "進", "人", "水", "数", "制", "性", "成", "聖", "石", "跡", "先", "戦", "船", "選", "走", "送", "像", "造", "続", "対", "袋", "台", "弾", "地", "中", "敵", "転", "電", "塔", "頭", "動", "内", "日", "入", "年", "能", "廃", "排", "敗",
|
||
"発", "反", "必", "表", "武", "壁", "墓", "放", "方", "砲", "妨", "北", "本", "幕", "無", "迷", "面", "戻", "紋", "薬", "輸", "勇", "友", "遊", "容", "要", "利", "了", "量", "力", "練", "連", "録", "話", "墟", "脱", "旗", "破", "壊", "全", "滅", "機", "仲", "渓", "谷", "優", "探", "部", "索", "前", "右", "左", "会", "高", "低", "押", "切", "替", "秒", "箱", "泳", "~",
|
||
"闇", "以", "屋", "俺", "化", "界", "感", "気", "却", "曲", "継", "権", "見", "古", "好", "才", "士", "子", "次", "主", "種", "讐", "女", "小", "焼", "証", "神", "身", "寸", "世", "想", "退", "第", "着", "天", "倒", "到", "突", "爆", "番", "負", "復", "物", "眠", "予", "用", "落", "緑", "封", "印", "扉", "最", "刻", "足",
|
||
"ぁ", "あ", "ぃ", "い", "ぅ", "う", "ぇ", "え", "ぉ", "お", "か", "き", "く", "け", "こ", "さ", "し", "す", "せ", "そ", "た", "ち", "っ", "つ", "て", "と", "な", "に", "ぬ", "ね", "の", "は", "ひ", "ふ", "へ", "ほ", "ま", "み", "む", "め", "も", "ゃ", "や", "ゅ", "ゆ", "ょ", "よ", "ら", "り", "る", "れ", "ろ", "ゎ", "わ", "を", "ん",
|
||
"が", "ぎ", "ぐ", "げ", "ご", "ざ", "じ", "ず", "ぜ", "ぞ", "だ", "ぢ", "づ", "で", "ど", "ば", "び", "ぶ", "べ", "ぼ",
|
||
"ぱ", "ぴ", "ぷ", "ぺ", "ぽ",
|
||
"ァ", "ア", "ィ", "イ", "ゥ", "ウ", "ェ", "エ", "ォ", "オ", "カ", "キ", "ク", "ケ", "コ", "サ", "シ", "ス", "セ", "ソ", "タ", "チ", "ッ", "ツ", "テ", "ト", "ナ", "ニ", "ヌ", "ネ", "ノ", "ハ", "ヒ", "フ", "ヘ", "ホ", "マ", "ミ", "ム", "メ", "モ", "ャ", "ヤ", "ュ", "ユ", "ョ", "ヨ", "ラ", "リ", "ル",
|
||
"ヴ", "ガ", "ギ", "グ", "ゲ", "ゴ", "ザ", "ジ", "ズ", "ゼ", "ゾ", "ダ", "ヂ", "ヅ", "デ", "ド", "バ", "ビ", "ブ", "ベ", "ボ",
|
||
"パ", "ピ", "プ", "ペ", "ポ",
|
||
"~", "Œ", "°", "ç"
|
||
]
|
||
|
||
JAK2_ALLOWED_CODES = [
|
||
"<TIL>", "<SUPERSCRIPT_QUOTE>",
|
||
"<PAD_X>", "<PAD_TRIANGLE>", "<PAD_CIRCLE>", "<PAD_SQUARE>", "<PAD_DPAD_UP>", "<PAD_DPAD_DOWN>", "<PAD_DPAD_ANY>", "<PAD_L1>", "<PAD_R1>", "<PAD_R2>", "<PAD_L2>", "<PAD_ANALOG_ANY>", "<PAD_ANALOG_LEFT_RIGHT>", "<PAD_ANALOG_UP_DOWN>", "<ICON_MISSION_COMPLETE>", "<ICON_MISSION_TODO>", "<FLAG_ITALIAN>", "<FLAG_SPAIN>", "<FLAG_GERMAN>", "<FLAG_FRANCE>", "<FLAG_UK>", "<FLAG_USA>", "<FLAG_KOREA>", "<FLAG_JAPAN>", "<FLAG_FINLAND>", "<FLAG_SWEDEN>", "<FLAG_DENMARK>", "<FLAG_NORWAY>", "<FLAG_ICELAND>"
|
||
]
|
||
|
||
JAK2_AUTO_REPLACEMENTS = {
|
||
"ª": "º",
|
||
"\n": "",
|
||
"’": "'",
|
||
"·": "-",
|
||
"–": "-",
|
||
"": "",
|
||
"„": ",,",
|
||
"”": "\"",
|
||
" ": " ",
|
||
"!": "!",
|
||
"(": "(",
|
||
")": ")",
|
||
"〜": "~",
|
||
"。": ".",
|
||
"×": "x",
|
||
"?": "?",
|
||
"一": "-",
|
||
";": ",",
|
||
":": ": ",
|
||
"…": "...",
|
||
"«": "<",
|
||
"»": ">",
|
||
" ": " ",
|
||
"“": "\"",
|
||
"'̂'": "",
|
||
"ų": "ų",
|
||
"‘": "'"
|
||
}
|
||
# fmt: on
|
||
|
||
return_error = False
|
||
|
||
def is_korean_syllable(char):
|
||
return '\uAC00' <= char <= '\uD7A3'
|
||
|
||
def is_char_allowed(game_name, char, allowed_characters):
|
||
if game_name == "jak1":
|
||
return char in allowed_characters
|
||
return char in allowed_characters or is_korean_syllable(char)
|
||
|
||
def is_allowed_code(pos, text, allowed_codes):
|
||
# Find any occurences of allowed codes in the string
|
||
# if the position overlaps with these occurrences, it's allowed
|
||
for code in allowed_codes:
|
||
for match in re.finditer(code, text):
|
||
if pos >= match.start() and pos <= match.end():
|
||
return match.end()
|
||
return -1
|
||
|
||
def fix_character(game_name, char, allowed_characters, auto_replacements):
|
||
# First let's try upper-casing it, if that's allowed, let's use that instead
|
||
upper_case = char.upper()
|
||
if is_char_allowed(game_name, upper_case, allowed_characters):
|
||
return upper_case
|
||
if char in auto_replacements:
|
||
return auto_replacements[char]
|
||
return char
|
||
|
||
|
||
def replace_character(string, position, new_character):
|
||
string_list = list(string)
|
||
string_list[position] = new_character
|
||
new_string = "".join(string_list)
|
||
return new_string
|
||
|
||
|
||
def lint_characters(game_name, text, allowed_characters, allowed_codes, auto_replacements):
|
||
invalid_characters_found = False
|
||
pos = 0
|
||
while pos < len(text):
|
||
character = text[pos]
|
||
if not is_char_allowed(game_name, character, allowed_characters):
|
||
# Check to see if it's an allowed code
|
||
code_end_pos = is_allowed_code(pos, text, allowed_codes)
|
||
if code_end_pos == -1:
|
||
# If we are fixing instances, attempt to do so
|
||
char_fixed = False
|
||
if args.fix:
|
||
new_char = fix_character(game_name, character, allowed_characters, auto_replacements)
|
||
if new_char != character:
|
||
text = replace_character(text, pos, new_char)
|
||
char_fixed = True
|
||
if not char_fixed:
|
||
print(
|
||
"Character '{}' not allowed - Found in {} in string {}".format(
|
||
character, text, text
|
||
)
|
||
)
|
||
# text = replace_character(text, pos, "?")
|
||
invalid_characters_found = True
|
||
pos = pos + 1
|
||
else:
|
||
# advance to the end of the code and continue checking
|
||
pos = code_end_pos
|
||
else:
|
||
pos = pos + 1
|
||
return invalid_characters_found, text
|
||
|
||
def fix_games_translations(game_name, allowed_characters, allowed_codes, auto_replacements):
|
||
global return_error
|
||
print(f"Checking {game_name} translations")
|
||
# Iterate through the translations making sure there are no characters that are not allowed
|
||
text_files = glob.glob(f"./game/assets/{game_name}/text/*.json")
|
||
|
||
for text_file in text_files:
|
||
print("Checking {}".format(text_file))
|
||
with open(text_file, encoding="utf-8") as f:
|
||
file_data = json.load(f)
|
||
for id, text in file_data.items():
|
||
invalid_chars_exist, new_text = lint_characters(game_name, text, allowed_characters, allowed_codes, auto_replacements)
|
||
if args.fix:
|
||
file_data[id] = new_text
|
||
if invalid_chars_exist:
|
||
return_error = True
|
||
if args.fix:
|
||
# save the modified file back out
|
||
with open(text_file, "w", encoding="utf-8") as f:
|
||
json.dump(file_data, f, indent=2, ensure_ascii=False)
|
||
f.write("\n")
|
||
|
||
subtitle_files = glob.glob(f"./game/assets/{game_name}/subtitle/*lines*.json")
|
||
|
||
for subtitle_file in subtitle_files:
|
||
print("Checking {}...".format(subtitle_file))
|
||
with open(subtitle_file, encoding="utf-8") as f:
|
||
file_data = json.load(f)
|
||
# Check Speakers
|
||
for id, text in file_data["speakers"].items():
|
||
invalid_chars_exist, new_text = lint_characters(game_name, text, allowed_characters, allowed_codes, auto_replacements)
|
||
if args.fix and new_text != text:
|
||
file_data["speakers"][id] = new_text
|
||
if invalid_chars_exist:
|
||
return_error = True
|
||
# Check Lines
|
||
for id, lines in file_data["cutscenes"].items():
|
||
for i, line in enumerate(lines):
|
||
invalid_chars_exist, new_text = lint_characters(game_name, line, allowed_characters, allowed_codes, auto_replacements)
|
||
if args.fix and new_text != line:
|
||
lines[i] = new_text
|
||
if invalid_chars_exist:
|
||
return_error = True
|
||
if game_name == "jak1":
|
||
for id, lines in file_data["hints"].items():
|
||
for i, line in enumerate(lines):
|
||
invalid_chars_exist, new_text = lint_characters(game_name, line, allowed_characters, allowed_codes, auto_replacements)
|
||
if args.fix and new_text != line:
|
||
lines[i] = new_text
|
||
if invalid_chars_exist:
|
||
return_error = True
|
||
else:
|
||
for id, lines in file_data["other"].items():
|
||
for i, line in enumerate(lines):
|
||
invalid_chars_exist, new_text = lint_characters(game_name, line, allowed_characters, allowed_codes, auto_replacements)
|
||
if args.fix and new_text != line:
|
||
lines[i] = new_text
|
||
if invalid_chars_exist:
|
||
return_error = True
|
||
if args.fix:
|
||
# save the modified file back out
|
||
with open(subtitle_file, "w", encoding="utf-8") as f:
|
||
json.dump(file_data, f, indent=2, ensure_ascii=False)
|
||
f.write("\n")
|
||
|
||
fix_games_translations("jak1", JAK1_ALLOWED_CHARACTERS, JAK1_ALLOWED_CODES, JAK1_AUTO_REPLACEMENTS)
|
||
fix_games_translations("jak2", JAK2_ALLOWED_CHARACTERS, JAK2_ALLOWED_CODES, JAK2_AUTO_REPLACEMENTS)
|
||
|
||
if return_error:
|
||
print("Invalid characters were found, see above")
|
||
exit(1)
|
||
else:
|
||
print("No invalid characters found!")
|