This tutorial exhibits you learn how to use BudouX to introduce clever, phrase-aware line breaks in languages the place whitespace does not naturally exist, reminiscent of Japanese, Chinese language, and Thai. First, arrange the library and work with its default parser to grasp how uncooked textual content is break up into significant chunks. Now let’s transfer on to HTML conversion. Right here we visually see how BudouX improves readability in constrained layouts by inserting invisible breakpoints. As you progress, you’ll dig deeper into the underlying mannequin and study its realized options and weights to grasp how selections are made. Additionally, you will experiment with working with customized fashions, combine BudouX into sensible workflows reminiscent of line wrapping and JSON-based pipelines, and consider its efficiency. We’ll additionally construct a minimal end-to-end coaching pipeline to get an intuitive understanding of how such light-weight ML fashions are constructed.
import subprocess, sys
def pip(*pkgs):
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", *pkgs])
pip("budoux")
import json, time, textwrap, html, random, re, os, tempfile
from pathlib import Path
import budoux
from IPython.show import HTML, show, Markdown
print(f" BudouX model: {budoux.__version__ if hasattr(budoux,'__version__') else 'put in'}")
def header(title):
show(Markdown(f"## {title}"))
header("1⃣ Default parsers — Japanese / Chinese language (Simplified & Conventional) / Thai")
samples = {
"Japanese (ja)": ("今日は天気です。BudouXは機械学習を用いた改行整形ツールです。",
budoux.load_default_japanese_parser()),
"Simplified Chinese language": ("今天是晴天。BudouX 是一个使用机器学习的换行整理工具。",
budoux.load_default_simplified_chinese_parser()),
"Conventional Chinese language": ("今天是晴天。BudouX 是一個使用機器學習的換行整理工具。",
budoux.load_default_traditional_chinese_parser()),
"Thai (th)": ("วันนี้อากาศดีมากและฉันอยากออกไปเดินเล่นที่สวนสาธารณะ",
budoux.load_default_thai_parser()),
}
for title, (textual content, parser) in samples.gadgets():
chunks = parser.parse(textual content)
print(f"n• {title}")
print(f" uncooked : {textual content}")
print(f" parsed: '.be part of(chunks) ({len(chunks)} phrases)")
Set up BudouX and arrange all the mandatory imports to begin working with the library. Load the default parsers for a number of languages, move pattern sentences to them, and watch how the textual content is break up into significant phrases. This can enable you perceive the core options of BudouX and the way it handles completely different language constructs out of the field.
header("2⃣ HTML translation with `translate_html_string`")
ja_parser = budoux.load_default_japanese_parser()
html_in = "今日は<b>とても天気</b>です。"
html_out = ja_parser.translate_html_string(html_in)
seen = html_out.substitute("u200b", "·")
print("Enter HTML :", html_in)
print("Output HTML :", html_out)
print("Visualised :", seen)
demo_text = ("BudouXは機械学習を用いて、CJK言語の文章を意味のある"
"フレーズに分割し、自然な位置で改行できるようにします。")
demo_html = ja_parser.translate_html_string(demo_text)
show(HTML(f"""
<div model="show:flex; hole:16px; font-family:'Hiragino Sans',sans-serif;">
<div model="width:140px; border:2px stable #c33; padding:8px;">
<b model="colour:#c33;">
Plain</b><br>{demo_text}
</div>
<div model="width:140px; border:2px stable #2a8; padding:8px;">
<b model="colour:#2a8;">
BudouX
{demo_html} Use BudouX to remodel HTML strings by inserting hidden breakpoints that enhance textual content wrapping. Visualize the impact by evaluating the rendering of plain textual content in a restricted structure with the BudouX-enhanced output. We additionally examine the inner mannequin construction and discover function classes and weights to grasp how segmentation selections are realized.
header("4⃣ Loading a customized mannequin with `budoux.Parser(mannequin)`")
neutered = {cat: {okay: 0 for okay in d} for cat, d in ja_model.gadgets()}
flat_parser = budoux.Parser(neutered)
print("All-zero mannequin output :", flat_parser.parse("今日は天気です。"))
print("Default mannequin output :", ja_parser.parse("今日は天気です。"))
header("5⃣ Sensible: customized separators, line-wrapping, JSON export")
def wrap_with_budoux(textual content, parser, max_width=12, sep="n"):
strains, present = [], ""
for phrase in parser.parse(textual content):
if len(present) + len(phrase) > max_width and present:
strains.append(present); present = phrase
else:
present += phrase
if present: strains.append(present)
return sep.be part of(strains)
novel = ("吾輩は猫である。名前はまだ無い。どこで生れたかとんと見当がつかぬ。"
"何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。")
print("Wrapped at width 12:")
print(wrap_with_budoux(novel, ja_parser, max_width=12))
seg = {"textual content": novel, "phrases": ja_parser.parse(novel)}
print("nJSON payload (first 120 chars):", json.dumps(seg, ensure_ascii=False)[:120], "...")
Experiment together with your customized mannequin by altering all function weights to zero and observing how the segmentation conduct adjustments. Subsequent, we’ll implement a sensible text-wrapping perform that respects BudouX phrase boundaries to enhance readability. Lastly, export the segmented output as JSON for simple integration into downstream methods or front-end functions.
header("6⃣ Efficiency benchmark")
big_text = novel * 200
t0 = time.perf_counter()
phrases = ja_parser.parse(big_text)
elapsed = time.perf_counter() - t0
print(f"Parsed {len(big_text):,} chars → {len(phrases):,} phrases "
f"in {elapsed*1000:.1f} ms ({len(big_text)/elapsed/1000:.0f}okay chars/sec)")
header("7⃣ Mini end-to-end coach (toy demo)")
training_lines = [
"私は▁遅刻魔で、▁待ち合わせに▁いつも▁遅刻して▁しまいます。",
"メールで▁待ち合わせ▁相手に▁一言、▁「ごめんね」と▁謝れば▁どうにか▁なると▁思って▁いました。",
"海外では▁ケータイを▁持って▁いない。",
"今日は▁とても▁いい▁天気です。",
"明日は▁雨が▁降る▁かも▁しれません。",
"週末は▁友達と▁映画を▁見に▁行きます。",
] * 20
SEP = "u2581"
def extract_features(s, i):
def g(idx): return s[idx] if 0 <= idx < len(s) else ""
feats = []
for off in (-3,-2,-1,0,1,2):
feats.append(f"U{off}:{g(i+off)}")
for off in (-2,-1,0,1):
feats.append(f"B{off}:{g(i+off)}{g(i+off+1)}")
for off in (-1,0):
feats.append(f"T{off}:{g(i+off)}{g(i+off+1)}{g(i+off+2)}")
return feats
def make_examples(strains):
X, y = [], []
for line in strains:
clear = line.substitute(SEP, "")
breaks = set()
j = 0
for ch in line:
if ch == SEP: breaks.add(j)
else: j += 1
for i in vary(1, len(clear)):
X.append(extract_features(clear, i))
y.append(1 if i in breaks else -1)
return X, y
X, y = make_examples(training_lines)
print(f"Coaching examples: {len(X)} (positives: {sum(1 for v in y if v==1)})")
Benchmark BudouX’s efficiency to evaluate its effectivity in processing giant quantities of textual content. Subsequent, begin constructing a minimal coaching pipeline by getting ready labeled information and extracting options round potential breakpoints. This offers perception into how the coaching information is structured and the way options contribute to segmentation selections.
def adaboost(X, y, rounds=80):
n = len(y)
w = [1/n]*n
feat_set = sorted({f for fx in X for f in fx})
fmap = [set(fx) for fx in X]
model_rounds = []
for r in vary(rounds):
best_feat, best_err, best_pol = None, 1.0, 1
for f in feat_set:
err_pos = sum(w[i] for i in vary(n) if (f in fmap[i]) != (y[i]==1))
err_neg = 1 - err_pos
if err_pos < best_err: best_feat, best_err, best_pol = f, err_pos, +1
if err_neg < best_err: best_feat, best_err, best_pol = f, err_neg, -1
if best_err >= 0.5 - 1e-9: break
eps = max(best_err, 1e-6)
alpha = 0.5 * ( (1-eps)/eps ) ** 0.5
new_w = []
for i in vary(n):
pred = best_pol if best_feat in fmap[i] else -best_pol
new_w.append(w[i] * (0.5 if pred == y[i] else 2.0))
s = sum(new_w); w = [x/s for x in new_w]
model_rounds.append((best_feat, best_pol, alpha))
return model_rounds
print("Coaching (it is a toy coach — be affected person ~10s)...")
t0 = time.perf_counter()
rounds = adaboost(X, y, rounds=60)
print(f"Achieved in {time.perf_counter()-t0:.1f}s, {len(rounds)} stumps saved.")
appropriate = 0
for fx, label in zip(X, y):
rating = sum(a if (f in fx) == (p==1) else -a for f,p,a in rounds)
pred = 1 if rating > 0 else -1
appropriate += (pred == label)
print(f"Coaching accuracy of toy mannequin: {appropriate/len(X)*100:.1f}%")
print("
For a manufacturing mannequin, use `scripts/prepare.py` from the BudouX repo with the matching function extractor — this part is illustrative.")
header("8⃣ Actual-world demo — slim column comparability")
paragraph = ("BudouXはGoogleが開発したオープンソースの改行ライブラリです。"
"機械学習モデルを使って、文章を意味のあるフレーズに分割し、"
"読みやすい位置でのみ改行が起こるようにします。"
"依存関係がなく軽量なため、ウェブサイトやモバイルアプリに"
"簡単に組み込むことができます。")
show(HTML(f"""
<div model="show:flex; hole:24px; font-family:'Hiragino Sans','Yu Gothic',sans-serif; font-size:15px;">
<div model="flex:1; border:2px stable #c33; padding:12px; max-width:180px;">
<b model="colour:#c33;">With out BudouX</b>
<p model="line-height:1.7;">{paragraph}</p>
</div>
<div model="flex:1; border:2px stable #2a8; padding:12px; max-width:180px;">
<b model="colour:#2a8;">With BudouX</b>
<p model="line-height:1.7;">{ja_parser.translate_html_string(paragraph)}</p>
</div>
</div>
<p model="font-size:12px;colour:#666;">Resize the browser/Colab pane to see the distinction extra clearly — BudouX by no means breaks a phrase mid-word.</p>
"""))
print("n
Tutorial accomplished. Attempt connecting BudouX's output to your personal UI. ”)
Implement a easy AdaBoost-based coaching loop to construct a toy segmentation mannequin from scratch. Consider your mannequin’s accuracy to grasp how properly it learns phrase boundaries out of your information. Lastly, we current a real-world comparability that exhibits how BudouX improves readability in slim layouts and strengthens its sensible worth.
In conclusion, we now have developed a complete understanding of how BudouX applies machine studying to unravel the fragile drawback of pure line breaks in CJK and related languages. We discovered it really works effectively with out main dependencies and is ideal for net and cellular integration. Via hands-on exploration, from parsing and HTML rendering to mannequin introspection, customization, and even coaching, we realized learn how to use BudouX and learn how to prolong and adapt it to our personal use instances. This provides you each the sensible instruments and conceptual readability you must confidently incorporate phrase recognition textual content segmentation into real-world functions.
Please test Full code here. Discover a whole lot of ML/Information Science Click here for collaboration notes. Additionally, be at liberty to comply with us Twitter Do not forget to hitch us 130,000+ ML subreddits and subscribe our newsletter. hold on! Are you on telegram? You can now also participate by telegram.
Must accomplice with us to advertise your GitHub repository, Hug Face Web page, product launch, webinar, and so on.?connect with us
This submit, “Methods to construct smarter multilingual textual content wrapping via parsing, HTML rendering, mannequin introspection, and toy coaching with BudouX” was first revealed on MarkTechPost.

