Files
Tyler Wilding 006d24b29a game: Support korean in Jak 2 and Jak 3 (#3988)
Resolves #3075 

TODO before merge:
- [x] Properly draw non-korean strings while in korean mode (language
selection)
- [x] Check jak 3
- [x] Translation scaffolding (allow korean characters, add to Crowdin,
fix japanese locale, etc)
- [x] Check translation of text lines
- [x] Check translation of subtitle lines
- [x] Cleanup PR / some performance optimization (it's take a bit too
long to build the text and it shouldn't since the information is in a
giant lookup table)
- [x] Wait until release is cut

I confirmed the font textures are identical between Jak 2 and Jak 3, so
thank god for that.

Some examples of converting the korean encoding to utf-8. These show off
all scenarios, pure korean / korean with ascii and japanese / korean
with replacements (flags):
<img width="316" height="611" alt="Screenshot 2025-07-26 191511"
src="https://github.com/user-attachments/assets/614383ba-8049-4bf4-937e-24ad3e605d41"
/>
<img width="254" height="220" alt="Screenshot 2025-07-26 191529"
src="https://github.com/user-attachments/assets/1f6e5a6c-8527-4f98-a988-925ec66e437d"
/>

And it working in game. `Input Options` is a custom not-yet-translated
string. It now shows up properly instead of a disgusting block of
glyphs, and all the original strings are hopefully the same
semantically!:
<img width="550" height="493" alt="Screenshot 2025-07-26 202838"
src="https://github.com/user-attachments/assets/9ebdf6c0-f5a3-4a30-84a1-e5840809a1a2"
/>

Quite the challenge. The crux of the problem is -- Naughty Dog came up
with their own encoding for representing korean syllable blocks, and
that source information is lost so it has to be reverse engineered.
Instead of trying to figure out their encoding from the text -- I went
at it from the angle of just "how do i draw every single korean
character using their glyph set".

One might think this is way too time consuming but it's important to
remember:
- Korean letters are designed to be composable from a relatively small
number of glyphs (more on this later)
- Someone at naughty dog did basically this exact process
- There is no other way! While there are loose patterns, there isn't an
overarching rhyme or reason, they just picked the right glyph for the
writing context (more on this later). And there are even situations
where there IS NO good looking glyph, or the one ND chose looks awful
and unreadable (we could technically fix this by adjusting the
positioning of the glyphs but....no more)!

Information on their encoding that gets passed to `convert-korean-text`:
- It's a raw stream of bytes
- It can contain normal font letters
- Every syllable block begins with: `0x04 <num_glyphs> <...the glyph
bytes...>`
- DO NOT confuse `num_glyphs` with num jamo, because some glyphs can
have multiple jamo!
- Every section of normal text starts with `0x03`. For example a space
would be `0x03 0x20`
- There are a very select few number of jamo glyphs on a secondary
texture page, these glyph bytes are preceeded with a `0x05`. These jamo
are a variant of some of the final vowels, moving them as low down as
possible.

Crash course on korean writing:
- Nice resource as this is basically what we are doing -
https://glyphsapp.com/learn/creating-a-hangeul-font
- Korean syllable blocks have either 2 or 3 jamo. Jamo are basically
letters and are the individual pieces that make up the syllable blocks.
- The jamo are split up into "initial", "medial" and "final" categories.
Within the "medial" category there are obvious visual variants:
  - Horizontal
  - Vertical
  - Combination (horizontal + a vertical)
- These jamo are laid out in 6 main pre-defined "orientations":
  - initial + vertical medial
  - initial + horizontal medial
  - initial + combination
  - initial + vertical medial + final
  - initial + horizontal medial + final
  - initial + combination + final
- Sometimes, for stylistic reasons, jamo will be written in different
ways (ie. if there is nothing below a vertical vowel will be extended).
  - Annoying, and ND's glyph set supports this stylistic choice!
- There are some combination of jamo that are never used, and some that
are only used for a single word in the entire language!

With all that in mind, my basic process was:
- Scan the game's entire corpus of korean text, that includes subtitles.
It's very easy to look at the font texture's glyphs and assign them to
their respective jamo
- This let me construct a mapping and see which glyphs were used under
which context
- I then shoved this information into a 2-D matrix in excel, and created
an in-game tool to check every single jamo permutation to fill in the
gaps / change them if naughty dogs was bad. Most of the time, ND's
encoding was fine.
-
https://docs.google.com/spreadsheets/d/e/2PACX-1vTtyMeb5-mL5rXseS9YllVj32BGCISOGZFic6nkRV5Er5aLZ9CLq1Hj_rTY7pRCn-wrQDH1rvTqUHwB/pubhtml?gid=886895534&single=true
anything in red is an addition / modification on my part.
- This was the most lengthy part but not as long as you may think, you
can do a lot of pruning. For example if you are checking a 3-jamo
variant (the ones with the most permutations) and you've verified that
the medial jamo is as far up vertically as it can be, and you are using
the lowest final jamo that are available -- there is nothing to check or
improve -- for better or worse! So those end up being the permutations
between the initial and medial instead of a three-way permutation
nightmare.
- Also, while it is a 2d matrix, there's a lot of pruning even within
that. For example, for the first 3 orientations, you dont have to care
about final vowels at all.
- At the end, I'm left with a lookup table that I can use the encode the
best looking korean syllable blocks possible given the context of the
jamo combination.
2025-08-16 19:35:47 -04:00

366 lines
12 KiB
C++

/*!
* @file FontUtils.cpp
*
* Code for handling text and strings in Jak 1's "large font" format.
*
* MAKE SURE THIS FILE IS ENCODED IN UTF-8!!! The various strings here depend on it.
* Always verify the encoding if string detection suddenly goes awry.
*/
#include "font_utils.h"
#include <algorithm>
#include <stdexcept>
#include <string_view>
#include "common/util/Assert.h"
#include "common/util/FileUtil.h"
#include "common/util/font/dbs/font_db_jak1.h"
#include "common/util/font/dbs/font_db_jak2.h"
#include "common/util/font/dbs/font_db_jak3.h"
#include "common/util/font/font_utils_korean.h"
#include "common/util/string_util.h"
#include "common/versions/versions.h"
#include "fmt/format.h"
void from_json(const json& j, KoreanLookupEntry& obj) {
json_deserialize_if_exists(defaultGlyph);
json_deserialize_if_exists(alternatives);
}
std::map<GameTextVersion, GameTextFontBank*> g_font_banks = {
{GameTextVersion::JAK1_V1, &g_font_bank_jak1_v1},
{GameTextVersion::JAK1_V2, &g_font_bank_jak1_v2},
{GameTextVersion::JAK2, &g_font_bank_jak2},
{GameTextVersion::JAK3, &g_font_bank_jak3}};
const std::unordered_map<std::string, GameTextVersion> sTextVerEnumMap = {
{"jak1-v1", GameTextVersion::JAK1_V1},
{"jak1-v2", GameTextVersion::JAK1_V2},
{"jak2", GameTextVersion::JAK2},
{"jak3", GameTextVersion::JAK3}};
const std::string& get_text_version_name(GameTextVersion version) {
for (auto& [name, ver] : sTextVerEnumMap) {
if (ver == version) {
return name;
}
}
throw std::runtime_error(fmt::format("invalid text version {}", fmt::underlying(version)));
}
GameTextVersion get_text_version_from_name(const std::string& name) {
return sTextVerEnumMap.at(name);
}
GameTextFontBank::GameTextFontBank(GameTextVersion version,
std::vector<EncodeInfo>* encode_info,
std::vector<ReplaceInfo>* replace_info,
std::unordered_set<char>* passthrus)
: m_version(version), m_passthrus(passthrus) {
// Insert the encode and replacement info into a Trie, much faster lookups that way
for (const auto& encoding : *encode_info) {
m_encode_to_utf8_trie.insert(encoding.game_bytes, encoding);
m_encode_to_game_trie.insert(encoding.utf8, encoding);
}
for (const auto& replacement : *replace_info) {
m_replace_to_utf8_trie.insert(replacement.game_encoding, replacement);
m_replace_to_game_trie.insert(replacement.utf8_string, replacement);
}
}
bool GameTextFontBank::is_language_id_korean(const int language_id) const {
if (m_version == GameTextVersion::JAK2 && language_id == 6) {
return true;
} else if (m_version == GameTextVersion::JAK3 && language_id == 7) {
return true;
}
return false;
}
GameTextFontBank* get_font_bank(GameTextVersion version) {
return g_font_banks.at(version);
}
GameTextFontBank* get_font_bank_from_game_version(GameVersion version) {
if (version == GameVersion::Jak1) {
// Jak 1 has been patched to use V2
return get_font_bank(GameTextVersion::JAK1_V2);
} else if (version == GameVersion::Jak2) {
auto font_bank = get_font_bank(GameTextVersion::JAK2);
return font_bank;
} else if (version == GameVersion::Jak3) {
return get_font_bank(GameTextVersion::JAK3);
} else {
ASSERT_MSG(false, "Unsupported game for get_font_bank_from_game_version");
}
}
GameTextFontBank* get_font_bank(const std::string& name) {
if (auto it = sTextVerEnumMap.find(name); it == sTextVerEnumMap.end()) {
throw std::runtime_error(fmt::format("unknown text version {}", name));
} else {
return get_font_bank(it->second);
}
}
bool font_bank_exists(GameTextVersion version) {
return g_font_banks.find(version) != g_font_banks.cend();
}
std::string GameTextFontBank::replace_to_game(const std::string& str) const {
std::string newstr;
newstr.reserve(str.size());
for (int i = 0; i < str.length();) {
const ReplaceInfo* remap = m_replace_to_game_trie.find_longest_prefix(str, i);
if (!remap) {
newstr.push_back(str[i]);
i += 1;
} else {
if (!remap->utf8_alternative.empty()) {
newstr.append(remap->utf8_alternative);
} else {
newstr.append(remap->game_encoding);
}
i += remap->utf8_string.size();
}
}
return newstr;
}
std::string GameTextFontBank::encode_utf8_to_game(const std::string& str) const {
std::string newstr;
newstr.reserve(str.size());
for (int i = 0; i < str.length();) {
auto match = m_encode_to_game_trie.find_longest_prefix(str, i);
if (!match) {
newstr.push_back(str[i]);
i += 1;
} else {
for (auto b : match->game_bytes) {
newstr.push_back(b);
}
i += match->utf8.size();
}
}
return newstr;
}
/*!
* Turn a normal readable string into a string readable in the in-game font encoding and converts
* \cXX escape sequences
*/
std::string GameTextFontBank::convert_utf8_to_game(const std::string& str) const {
return encode_utf8_to_game(replace_to_game(str));
}
std::string GameTextFontBank::replace_to_utf8(const std::string& str) const {
std::string result;
result.reserve(str.size());
for (size_t i = 0; i < str.size();) {
const ReplaceInfo* remap = m_replace_to_utf8_trie.find_longest_prefix(str, i);
if (!remap) {
result.push_back(str[i]);
i += 1;
} else {
result.append(remap->utf8_string);
i += remap->game_encoding.size();
}
}
return result;
}
bool GameTextFontBank::valid_char_range(const char& in) const {
if (m_version == GameTextVersion::JAK1_V1 || m_version == GameTextVersion::JAK1_V2) {
return ((in >= '0' && in <= '9') || (in >= 'A' && in <= 'Z') ||
m_passthrus->find(in) != m_passthrus->end()) &&
in != '\\';
} else if (m_version == GameTextVersion::JAK2 || m_version == GameTextVersion::JAK3 ||
m_version == GameTextVersion::JAKX) {
return ((in >= '0' && in <= '9') || (in >= 'A' && in <= 'Z') || (in >= 'a' && in <= 'z') ||
m_passthrus->find(in) != m_passthrus->end()) &&
in != '\\';
}
return false;
}
std::string GameTextFontBank::encode_game_to_utf8(const std::string& str) const {
std::string newstr;
newstr.reserve(str.size());
for (size_t i = 0; i < str.size();) {
auto encoding = m_encode_to_utf8_trie.find_longest_prefix(str, i);
if (!encoding) {
// No match: copy valid characters as-is, or escape unknown bytes
unsigned char c = static_cast<unsigned char>(str[i]);
if (valid_char_range(c) || c == '\n' || c == '\t' || c == '\\' || c == '"') {
newstr.push_back(c);
} else {
newstr += fmt::format("\\c{:02x}", c);
}
++i;
} else {
// Found a match: append its UTF-8 sequence
newstr.append(encoding->utf8);
i += encoding->game_bytes.size(); // advance past matched game bytes
}
}
return newstr;
}
std::string GameTextFontBank::convert_game_to_utf8(const char* in) const {
// Encode and apply replacement ONCE
std::string decoded = replace_to_utf8(encode_game_to_utf8(in));
// Escape special characters while writing directly into result
std::string result;
result.reserve(decoded.size());
for (size_t i = 0; i < decoded.size(); ++i) {
char c = decoded[i];
if (c == '\n') {
result += "\\n";
} else if (c == '\t') {
result += "\\t";
} else if (c == '\\') {
if (i < decoded.size() - 1 && decoded[i + 1] == 'c') {
result.push_back(c); // preserve \cXX
} else {
result += "\\\\";
}
} else if (c == '"') {
result += "\\\"";
} else {
result.push_back(c);
}
}
return result;
}
std::string GameTextFontBank::convert_utf8_to_game_korean(const std::string& str) {
ASSERT_MSG(m_version == GameTextVersion::JAK2 || m_version == GameTextVersion::JAK3,
"Korean is not supported for any game other than Jak 2 and Jak 3 right now");
if (!m_korean_db.has_value()) {
const auto db_file_path =
file_util::get_file_path({"game/assets/fonts/jak2_jak3_korean_db.json"});
if (file_util::file_exists(db_file_path)) {
auto raw_data = file_util::read_text_file(db_file_path);
auto json_data = parse_commented_json(raw_data, "jak2_jak3_korean_db.json");
std::unordered_map<std::string, KoreanLookupOrientations> temp_db;
json_data.get_to(temp_db);
m_korean_db = temp_db;
}
}
std::string output;
output.reserve(str.size());
std::string non_korean_buffer = "";
size_t i = 0;
while (i < str.size()) {
char32_t cp = str_util::next_utf8_char(str, i);
if (font_util_korean::is_korean_syllable(cp)) {
// flush any non-korean buffer
if (!non_korean_buffer.empty()) {
output += 0x3;
// encode / remap it
output += encode_utf8_to_game(replace_to_game(non_korean_buffer));
non_korean_buffer = "";
}
// write out the korean character
output += font_util_korean::game_encode_korean_syllable(str, cp, m_korean_db.value());
} else {
non_korean_buffer += str_util::utf8_encode(cp);
}
}
// flush any non-korean buffer
if (!non_korean_buffer.empty()) {
output += 0x3;
// encode / remap it
output += encode_utf8_to_game(replace_to_game(non_korean_buffer));
non_korean_buffer = "";
}
return output;
}
std::string GameTextFontBank::convert_korean_game_to_utf8(const char* in) const {
ASSERT_MSG(m_version == GameTextVersion::JAK2 || m_version == GameTextVersion::JAK3,
"Korean is not supported for any game other than Jak 2 and Jak 3 right now");
// Korean strings are fully bitstrings, in other words, it's just a bunch of bytes
// Some info on the layout:
// - Every korean syllable block starts with a `4`
// - The following byte indicates how many glyphs are drawn for that syllable block
// - Each jamo that makes up the syllable block follows as a single byte
// - Unless the jamo is part of the "extra" texture page, in which case it's preceeded by a `5`.
// There are very few jamo that are and they are only applicable for the final consonant
// - The korean strings can contain non-korean characters. These are preceeded by a `3`
// - For example a space would be `3 20`
// - It might be more accurate to say that a 3 signifies "consume characters as normal until
// something else is encountered (ie. flags or more complex font encodings)"
std::string result;
std::string_view str(in);
u64 index = 0;
u8 curr_byte = 0;
bool in_syllable_block = false;
std::string jamo_buffer = "";
std::string non_korean_buffer = "";
int num_syllable_glyphs = 0;
while (index < str.length()) {
curr_byte = str.at(index);
// new syllable block
if (curr_byte == 4) {
in_syllable_block = true;
if (index + 1 < str.length()) {
num_syllable_glyphs = str.at(index + 1);
index++;
}
index++;
// flush any non-korean characters
if (!non_korean_buffer.empty()) {
// handle remap
std::string remapped_str = replace_to_utf8(encode_game_to_utf8(non_korean_buffer));
result += remapped_str;
non_korean_buffer = "";
}
continue;
}
if (in_syllable_block) {
// extra page
std::string glyph_key;
u8 hex_byte = curr_byte;
if (curr_byte == 5 && index + 1 < str.length()) {
hex_byte = str.at(index + 1);
glyph_key = fmt::format("extra_0x{:02x}", hex_byte);
index++;
} else {
glyph_key = fmt::format("0x{:02x}", hex_byte);
}
const auto jamo_list = jamo_glyph_mappings_jak2.find(glyph_key);
ASSERT_MSG(jamo_list != jamo_glyph_mappings_jak2.end(),
fmt::format("{} not found in jamo glyph lookup table", glyph_key));
for (const auto& jamo : jamo_list->second) {
jamo_buffer += jamo;
}
num_syllable_glyphs--;
if (num_syllable_glyphs == 0) {
in_syllable_block = false;
result += font_util_korean::compose_korean_containing_text(jamo_buffer);
jamo_buffer = "";
}
} else {
if (curr_byte != 0x3) {
non_korean_buffer.push_back(curr_byte);
}
}
index++;
}
// flush any non-korean characters
if (!non_korean_buffer.empty()) {
// handle remap
std::string remapped_str = replace_to_utf8(encode_game_to_utf8(non_korean_buffer));
result += remapped_str;
non_korean_buffer = "";
}
return result;
}