GoogleのSODAの日本語オンデバイスASRモデルの仕組み

背景

ChromeOSの日本語SODAモデルを解剖した
SODA は Google のオンデバイス音声認識まわりで使われている仕組み。ライブ字幕、音声入力、音声コマンドの裏側にいるやつ
前に書いた音声系の技術メモでは、ASR、VAD、TTS、フルデュプレックス会話、リングバッファ、句読点、Endpointing などの用語を整理した。今回はその続きとして、実際のオンデバイス ASR モデルがどういう部品でできているかを見た
見たかったのはこのあたり:
- オンデバイス ASR がどのくらいのサイズで動いているか
- VAD と ASR 本体がどう分かれているか
- prefetch が何をしているか
- 語頭を落とさないために何をしているか
- EOT、つまりターン終了をどう判断しているか
- RNN-T、LAS、FST がどう組み合わされているか
- 日本語向けにどんな辞書やコンテキスト注入が入っているか
- OSS で似たものを作るなら何が必要か

先に結論

見えた構成はかなり堅い
軽い VAD を常時動かし、重い Conformer Encoder は発話区間だけに使う
RNN-Tで逐次認識し、FSTで言語スコアを持ち、LAS Rescorerで候補を再評価する
最後に句読点、normalizer、context injection が効く

特に重要なのはこの3つ。

prefetch.decision は別モデルではなく、同じ VAD posterior を別閾値で解釈したもの
SHORT モードは prefetch 閾値が 0.17 と低く、語頭を拾うためにかなり攻めている
EOT は VAD だけでなく、音響スコア、FST/LAS の言語スコア、hotword、DecoderEndpointerStream の統合パラメータで決まる

「話したら文字になる」だけに見えるが、実際にはかなり細かい交通整理がある。

基本情報

項目	値
言語	ja-JP
バージョン識別子	`cnch24d3` (Conch 2024, revision 3)
内部バージョン	v5072
アーキテクチャ	Causal Conformer Encoder + RNN-T Decoder
合計サイズ	約 138MB（全ファイル合計）
ランタイム	TFLite (`TFL3` FlatBuffer, MLIR変換済み)
設定フォーマット	Protocol Buffers (バイナリ)
ビルド日時	2024年7月2日 15:00:00（訓練完了）
リリース日時	2025年1月7日（ChromeOS R133 DLC）
経過時間	2026年6月時点で約 17ヶ月前

世代・バージョン体系

世代	識別子	時期	備考
Legacy	`df24d2`	〜2024年以前	旧世代
Conch	`cnch24d3`	2025年1月	← これ

cnch24d3 = Conch 24 (2024年設計) d3 (第3リビジョン)。Conch世代でアーキテクチャ刷新。

サーバーモデルをオンデバイス向けに変換したもの：

1
/placer/prod/home/speech-placer/.../20240702150000_OD_CONVERTED_DEFAULT_SERVER_V101_WITH_CONFIDENCE/

OD_CONVERTED = On-Device 変換済み、V101_WITH_CONFIDENCE = confidence スコア付き v101

全体パイプライン

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
PCM音声 (s16le, 16kHz, mono)
    │
    ▼
[AudioDecoderStream] → waveform
    │
    ▼
[Framer] frame=25ms(400samples), shift=10ms(160samples)
    │
    ▼
[Window (Hann窓)] → windowed frames
    │
    ▼
[FFT]
    │
    ▼
[FilterBank] 80次元 log-mel フィルタバンク
    │
    ▼
[FrameStacker × N] → [SubsampleStream]
    │
    ├──────────────────────────────────────────────────┐
    │                                                   │
    ▼ 【VAD パス（軽量・並列）】                        │ 【エンコーダパス（重・ゲート制御）】
[AppendClusterIdStream]                                 │
(VOICE_SEARCH/CAPTION/FARFIELD/TELEPHONY)              │
    │                                                   │
    ▼                                                   │
[VAD LSTM: 436KB]                                       │
 → prefetch.decision                                    │
 → vad.audio_level_events                               │
 → vad.decision ─────────────────────────────────┐     │
 → vad.decision_for_segmenter                    │     │
         │                                        │     │
         ▼                                        │     │
    [EndpointerEventStream]                       │     │
    [SegmenterStream] ◄───────────────────────────┘     │
         │ (発話フレームのみ通過)                        │
         ├──────────────────────────────────────────────┘
         ▼
[FrameNormalizeStream] (mean_stddev: 240次元)
    │
    ▼
[AppendClusterIdStream (TELEPHONY)]
    │
    ▼
[FrameStacker] → [SubsampleStream] (エンコーダ用積み重ね)
    │
    ▼
[CausalEncoderStream]
 Conformer × 12層 (107MB)
 ← streaming states: 85テンソル
    │
    ▼
[DecoderEndpointerStream]
 RNN-T Decoder + EndpointerStream
 ← vad.decision (音響的発話判断)
 ← prefetch.decision (先読み音響判断)
 ← FST言語モデル (言語的文完成度判断)
    │
    ├── 2nd Pass: LAS Rescorer (3.9MB)
    │
    ▼
[CombinedResultStream]
    │
    ▼
[後処理パイプライン]
 1. remove_decorators (FST)
 2. pron_cleanup (FST)
 3. CapitalizationNormalizer (FST)
 4. PunctuationNormalizer (BiLSTM 2.5MB)
 5. porn_normalizer (FST)
    │
    ▼
[RecognitionEventStream] → テキスト出力

ファイル一覧・サイズ詳細

endtoendmodel/ (メイン ASR)

ファイル	サイズ	役割
`large-encoder.tflite`	107MB (108,803KB)	Causal Conformer Encoder
`large-decoder.tflite`	8.6MB (9,025KB)	RNN-T LSTM Decoder
`large-joint_posterior.tflite`	4.6MB (4,865KB)	Joint Network: AM+LM posterior
`large-joint_prior.tflite`	4.3MB (4,429KB)	Joint Network: LM prior のみ
`large-rank_candidates_acoustic.tflite`	3.8MB (4,030KB)	LAS N-best Rescorer
`large.wpm.portable`	188KB	WordPiece モデル (サブワード分割)
`large.syms.compact`	31KB	語彙シンボルテーブル (OpenFST CompactSymbolTable)
`ONDEVICE_LARGE_CONTINUOUS.mean_stddev`	1.9KB	フィルタバンク正規化パラメータ (240次元)
`ONDEVICE_LARGE_SHORT.mean_stddev`	1.9KB	同上 (SHORT モード用・同一値)

エンコーダが全体の 94% を占める。 これがボトルネック。

acousticmodel/ (VAD / エンドポインタ)

ファイル	サイズ	用途
`DEFAULT_BF_VAD_EP_MD_UF_DARWINN_CONTINUOUS.endpointer_portable_lstm_model`	436KB	連続発話 VAD LSTM
`DEFAULT_BF_EOQ_EP_MD_UF_DARWINN_SHORT.endpointer_portable_lstm_model`	436KB	短文用 VAD LSTM (EOQ)
`SODA_DICTATION_EP_UNIFIED_FRONTEND_LANGID.endpointer_portable_lstm_model`	471KB	言語ID統合 VAD
`*.endpointer_portable_lstm_mean_stddev`	2.0KB	VAD 特徴量正規化 (256次元)
`SODA_DICTATION_EP_UNIFIED_FRONTEND_LANGID.endpointer_portable_lstm_mean_stddev`	4.1KB	LangID 統合 VAD 正規化 (528次元)

denorm/ (後処理)

ファイル	サイズ	役割
`lm.pruned.sorted.fst`	12MB	FST 言語モデル (プルーニング済みソート済み)
`lm.pruned.sorted.syms`	1.4MB	FST 言語モデル語彙
`transducer.pruned.fst`	3.6MB	トランスデューサ FST
`PUNCTUATION_LSTM.model.int8.tflite`	2.4MB	句読点挿入 BiLSTM (int8量子化)
`PUNCTUATION_LSTM.syms`	20KB	句読点 LSTM 語彙 2,332トークン
`porn_normalizer_ondevice.mfar`	616KB	不適切語フィルタ (MFAR=Mapped FST Archive)
`pron_cleanup_denormalizer.mfar`	28KB	発音整形
`remove_decorators_ondevice.mfar`	336KB	装飾除去
`punctuation_converter_config.pb`	76KB	句読点変換マッピング (protobuf)

langid/ (言語識別)

ファイル	サイズ	役割
`ONDEVICE_langid.tflite`	2.4MB	43言語識別 Conformer モデル
`application_params_langid_stream_multiclass_ONDEVICE`	528B	対応言語リスト・設定

context_prebuilt/ (コンテキスト注入)

ファイル	サイズ	役割
`mozc.dic`	9.4MB	Mozc 辞書 (日本語形態素解析)
`japanese_family_name.txt`	125KB	日本語姓名辞書
`portable_ja_verbalizer.far`	1.2MB	数字・記号→日本語読み FST
`ja-JP_nga_popular-media_STD_FST.fst`	228KB	メディアタイトル FST
`ja-JP_nga_hotword-popular-media_STD_FST.fst`	228KB	ホットワード+メディア FST
`ja-JP_nga_device-actions_STD_FST.fst`	99KB	デバイス操作コマンド FST
`ja-JP_assistant_hotword_STD_FST.fst`	2KB	アシスタントホットワード FST
`songs.txt` / `apps.txt` / `contacts.txt` 等	各数KB〜128KB	動的コンテキスト用テキスト

音声フロントエンド詳細

特徴量パラメータ（config から確定）

ステップ	パラメータ
サンプリングレート	16,000 Hz
フレーム長	400 samples = 25ms
フレームシフト	160 samples = 10ms
窓関数	Hann窓
FilterBank 次元数	80次元 log-mel
VAD 入力	80-mel フレーム × 3 = 240次元 + クラスタ ID 16次元 = 256次元
Encoder 正規化入力	80-mel × 3フレーム積み重ね = 240次元

正規化パラメータ（mean_stddev から実測）

エンコーダの 240次元入力に対する統計量 (CONTINUOUS = SHORT, 同一値)：

統計量	最小値	最大値	特徴
平均 (mean)	7.398	9.262	log-mel エネルギースケール
標準偏差 (stddev)	1.643	4.026	高周波帯ほど分散大

平均が 7〜9 ≈ log(1000〜8000)スケール → log-mel フィルタバンクの典型値 ✓
標準偏差が後半(高周波)ほど大きい → 環境ノイズの影響を受けやすい高周波の分散が大 ✓

VAD の 256次元：

Field 1 (mean): 範囲 [0.000, 8.379]（クラスタID次元は 0 付近）
Field 2 (stddev): 範囲 [1.000, 3.703]

Conformer Encoder 詳細

モデル規模

項目	値
ファイルサイズ	107MB
Conformer ブロック数	12層 (conf_0 〜 conf_11)
入力次元	240次元 (80-mel × 3フレーム積み重ね)
ストリーミング状態テンソル	85個 (previous_state.00 〜 .84)
ストリーミング状態/層	≈ 7 state tensors × 12層 + 1
モデル変換	MLIR Converted (TensorFlow → TFLite)
TFLite バージョン	TFL3 (min runtime: 2.19.0)

各 Conformer ブロックの構造（全12層共通）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
入力 x
  │
  ▼
[fflayer_start] — Feed-Forward 1 (×0.5 Macaron scale)
  │
  ▼
[trans_atten]   — Multi-Head Self-Attention (因果的・動的長さ)
  └─ multihead_atten/StreamStep/dynamic_length
     ← streaming で left-context のみ参照（Causal）
  │
  ▼
[lconv]         — Local Convolution Module
  └─ depthwise_conv/strided_slice (深さ方向畳み込み)
  │
  ▼
[fflayer_end]   — Feed-Forward 2 (×0.5 Macaron scale)
  │
  ▼
[output_ln]     — Layer Normalization
  │
  ▼
出力

これは Macaron-style Conformer (Chan et al., 2022)：

FFN を MHSA の前後に配置（0.5 スケール）
Local Convolution が音響的局所パターンを学習
Causal MHSA がグローバルな依存関係を学習
streaming 推論のため dynamic_length で左コンテキストのみ参照

ストリーミング実装の構造

TFLite の状態テンソル名からわかる 3段構成：

1
2
3
conformer_encoder_1: previous_state.00〜.84 (入力状態テンソル, 85個)
conformer_encoder_2: 実際の計算グラフ (extractor/graph_layer/conf_0〜conf_11)
conformer_encoder_3: next_state.00〜.84 (出力状態テンソル, 85個)

各フレームを逐次処理し、状態テンソルで文脈を保持する streaming 設計。

RNN-T Decoder 詳細

アーキテクチャ

項目	値
ファイルサイズ	8.6MB
型	シンプル LSTM Decoder
状態テンソル	`previous_decoder_state_0` / `next_decoder_state_0` (1ペア)
入出力	`inputs` → `outputs`

単一 LSTM セル（状態テンソルが 1つだけ）のシンプルな Prediction Network。

Joint Network の分離設計

Joint Network が posterior と prior に分離されている：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
Encoder出力 (am) + Decoder出力 (lm)
         │
    ┌────┴────────────────────┐
    │                         │
    ▼                         ▼
[joint_posterior.tflite]   [joint_prior.tflite]
am_proj + lm_proj_1         lm_proj_2 のみ
    │ Tanh → LogSoftmax       │ Tanh
    │                         │
    ▼                         ▼
P(y|x, y_{<t})          P(y|y_{<t})
    (音響+言語)           (言語のみ)

なぜ分離するか：

posterior / prior により、暗黙的な LM shallow fusion が可能
テスト時に LM の重みを調整できる
joint_prior は LM のみなので、LM 置き換え (LM subtraction) にも使える
lm_proj_1/MatMul = 音響+言語の合算投影
lm_proj_2/MatMul = 言語単独投影（prior）

Joint Network の演算グラフ（posterior）

1
2
3
4
5
6
7
8
9
concat(am_proj, lm_proj) → Tanh → add_1
    ↓
am_proj_1/MatMul → lm_proj_1/MatMul → MatMul;add_2
    ↓
Softplus → Softplus1 → Softplus2
    ↓
strided_slice_1 → Neg_1 → sub_12 → add_3;Neg_1
    ↓
LogSoftmax → outputs

2nd Pass Rescorer (LAS)

アーキテクチャ概要

項目	値
ファイルサイズ	3.8MB
アーキテクチャ	Listen, Attend, Spell (LAS)
用途	N-best 仮説の再スコアリング + confidence

演算グラフ（child_decoders_0）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
入力: encoded (エンコーダ出力), candidate_labels (N-best仮説), encoded_padding
    │
    ▼
[embedding_lookup] + [位置エンコーディング (Sin/Cos)]
    │
    ▼
[self_atten] — Self-Attention (仮説内部の依存関係)
  └─ MultiHeadedAttention/ComputeContextVectorWithSource
     PackSource (K,V キャッシュ)
  │
  ▼
[aux_atten] — Cross-Attention (エンコーダ出力への注意)
  └─ MultiHeadedAttention/ComputeContextVectorWithSource
  │
  ▼
[fflayer] — Feed-Forward (tr_fflayer, layer norm付き)
  └─ fflayer_0 (Relu) → fflayer_1
  │
  ▼
[confidence_source_proj/MatMul]
    │
    ▼
[confidence_mlp_0] — Confidence スコア出力
  └─ Sigmoid (信頼度 0〜1)
    │
    ▼
outputs: スコア (仮説ランキング用)

INFLUENCE_MODEL_TWIDDLER = 再スコアリング重みを実行時に調整する仕組み（A/Bテスト・実験対応）。

VAD / EOQ LSTM 詳細

3種類の VAD モデル

モデル名	サイズ	用途
`DEFAULT_BF_VAD_EP_MD_UF_DARWINN_CONTINUOUS`	436KB	連続発話用 (字幕・ライブ)
`DEFAULT_BF_EOQ_EP_MD_UF_DARWINN_SHORT`	436KB	短文用 EOQ (IME・コマンド)
`SODA_DICTATION_EP_UNIFIED_FRONTEND_LANGID`	471KB	言語ID 統合 VAD (多言語)

入力特徴量

1
2
3
4
5
6
7
VAD 入力 = 256次元
  = 80-mel × 3フレーム (240次元)  ... 実測 mean ~6-8 (log-mel scale)
  + クラスタ ID 埋め込み (16次元)  ... mean ~0 付近

クラスタ ID の種類:
  CONTINUOUS モード: VOICE_SEARCH, CAPTION, FARFIELD, TELEPHONY (4クラス)
  SHORT モード:      FARFIELD のみ (1クラス)

FARFIELD (遠距離マイク向け)が SHORT モードでのデフォルト → 近距離の発話も遠距離扱いで堅牢に処理。

prefetch の仕組み：別モデルではない

結論

prefetch.decision は新しいモデルではない。同じ VAD LSTM が出力する vad.posterior（0〜1 の発話確率）を、異なる閾値・パラメータを持つ別の EndpointerStream インスタンスで処理したもの。

1
2
3
4
5
vad.posterior (単一 LSTM の出力)
  │
  ├─ EndpointerStream(threshold=低め) → prefetch.decision   ← 早く反応
  ├─ EndpointerStream(threshold=高め) → vad.decision        ← 確実に判定
  └─ EndpointerStream(threshold=同等) → vad.decision_for_segmenter → Encoder ゲート

config バイナリから実測した閾値パラメータ

CONTINUOUS モード（字幕・ライブ）：

ストリーム	発話開始閾値	備考
`prefetch.decision`	0.69	posterior > 69% でプリフェッチ開始
`vad.decision`	0.30	posterior > 30% で発話確定
`vad.decision_for_segmenter`	0.30	Encoder へのゲート、同じ閾値

SHORT モード（IME・コマンド）：3ステージ構成

ステージ	ストリーム	閾値	役割
1st	`prefetch.decision`	0.17	posterior > 17% で即バッファ開始
2nd	`confirmation`	0.60	posterior > 60% で発話確定 ← SHORT 専用
3rd	`vad.decision`	0.10	最終判定
gate	`vad.decision_for_segmenter`	0.01	Encoder へのゲート（非常に低い）

なぜ「語頭をよく認識できる」のか

SHORT モードの設計から読み取れる意図：

1
2
CONTINUOUS:  prefetch 閾値 0.69  ← 発話がかなり確実になってから反応
SHORT:       prefetch 閾値 0.17  ← わずかでも音が来たら即バッファ開始

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
時間軸:
音声波形: ...無音... |あ────────────────|
発話確率: ...0.0... 0.05→0.20→0.60→0.90...

SHORT mode:
  ├─ prefetch (0.17): posterior が 0.17 を超えた瞬間にバッファ開始
  │    → 「あ」の破裂音の立ち上がり (~5ms) を捕捉
  ├─ confirmation (0.60): 0.60 を超えたら「発話確定」
  │    → エンコーダへ過去バッファ分を遡って投入
  └─ vad.decision (0.10): 終了判定（非常に低い = 微妙な発話でも続行）

CONTINUOUS mode:
  └─ prefetch (0.69): 確実な発話のみ反応 → 語頭クリッピングリスクあり

SHORT モードで語頭が取れる理由： prefetch 閾値が 0.17 と低いため、発話確率が少し上がっただけで即座にバッファリングを開始する。その後 confirmation(0.60) で確定し、バッファ内の音声（語頭を含む）をエンコーダに渡す。

音声バッファ（プリロール）の実装

config パラメータから：

1
2
3
4
prefetch.decision の追加パラメータ:
  field=3 int=360  ← バッファ長 360ms（推定）
  field=6 int=1    ← モード設定
  field=7 float=0.0 ← オフセット

360ms 分の音声がバッファされる → 発話確定後、最大 360ms 遡って処理できる。

常時録音とリングバッファ

SODA は常時マイク入力を録音し続けている。発話を検出してから録音を開始するのではない。

1
2
3
4
5
6
7
8
常時稼働しているもの：
  マイク → ring buffer への書き込み（ずっと）
  VAD LSTM の推論（ずっと）

発話検出後にのみ起動するもの：
  Conformer Encoder（重い）
  RNN-T Decoder
  LAS Rescorer

VAD LSTM が小さい (436KB) のはこのためで、常時動かすコンポーネントは軽くするという設計方針。

リングバッファの構造：

1
2
3
4
5
6
7
物理メモリ（11,520 bytes の配列）
  16000 Hz × 2 bytes × 0.36s = 11,520 bytes

[0      ][1      ][2      ]...[N-1    ]
                                ↑
                    write pointer が端まで達したら 0 に戻る
                    → データは連続して存在し続ける

prefetch が発火した瞬間（posterior > 0.17）：

1
2
3
4
read_start = (write_pos - 360ms分) % buffer_size

→ 360ms 前の位置から読み出し開始
→ Encoder への投入は「今から」ではなく「360ms 前から」

折り返し（wrap-around）について：
360ms 分の読み出しがバッファの端をまたぐ場合も、後半→前半の 2回読みで対応できる。物理的な切れ目はなく、データの欠損は発生しない。バッファサイズ (360ms) と読み戻し量 (360ms) が一致しているため、書き込みが読み出しを追い越すことはない。

各 EndpointerStream の役割まとめ

ストリーム	役割	後段
`prefetch.decision`	バッファ開始のトリガー（低閾値）	Decoder に注入（バッファリング指示）
`vad.decision`	発話確定・終了判定（中閾値）	Decoder に注入（エンドポインタ判断）
`vad.decision_for_segmenter`	Encoder ゲート（LOW 閾値）	SegmenterStream → Encoder
`confirmation`	(SHORT のみ) 発話確認	vad.decision の前段フィルタ
`vad.audio_level_events`	音量レベル監視	AudioLevelEventStream → デバッグ用

EOT（ターン終了）予測の完全な仕組み

config バイナリからすべてのパラメータを実測した結果、EOT 判断は 4つの層が連携していることが判明した。

Layer 1：音響層（VAD LSTM → EndpointerStream × 3〜4）

同一の VAD LSTM posterior を、閾値の異なる複数の EndpointerStream に通す：

ストリーム	CONTINUOUS 閾値	SHORT 閾値	動作
`prefetch.decision`	0.69	0.17	バッファ開始トリガー（低確信でも反応）
`confirmation`	―	0.60	SHORT 専用：発話確認ステージ
`vad.decision`	0.30	0.10	発話確定・終了判定
`vad.decision_for_segmenter`	0.30	0.01	Encoder へのゲート（最終段）

SHORT モードで prefetch 閾値が 0.17 と極端に低い理由：
→ 発話確率が 17% を超えた瞬間にバッファを開始し、語頭クリッピングを防ぐ。
→ その後 confirmation (0.60) で誤検出をフィルタする 2段階設計。

Layer 2：言語層（FST + LAS Rescorer）

FST ビーム探索で「文として完成しているか」をスコア化：

パラメータ	CONTINUOUS	SHORT	意味
`acoustic_scale`	2.300	2.125	音響モデルの重み係数
`beam`	5.0	5.0	FST ビーム幅
`max_beam`	10.0	10.0	最大ビーム幅
`word_end_weight`	0.0693	0.0693	単語境界重み
`insertion_penalty`	−0.002	−0.002	挿入ペナルティ
LAS rescorer weight	1.3595	1.7043	N-best リスコアリングの重み
LAS extra weight	0.0	1.4000	SHORT 専用の追加重み

acoustic_scale 差の意味：

CONTINUOUS (2.30) → 音響モデルをより信頼（話者が何を言ったかを優先）
SHORT (2.12) → 相対的に LM の影響を大きくし、コマンド文法を優先

LAS 重み差の意味：

SHORT (1.70) は CONTINUOUS (1.36) より 25% 高い → 2nd pass rescorer をより強く信頼
SHORT に専用の追加重み 1.40 がある → confirmation 段階のスコアと組み合わせる

Layer 3：統合層（DecoderEndpointerStream 内部パラメータ）

config のバイナリ extension フィールドから実測した固定パラメータ（CONTINUOUS / SHORT 共通）：

1
2
3
4
5
6
acoustic_endpointer_weight  =   3.803557   ← VAD 音響信号の重み
silence_penalty_floor       = −32.002899   ← 無音時のログ確率床
large_beam_floor            = −26837.197   ← ビーム探索の底値
lm_endpointer_weight        =   3.631859   ← 言語スコアの重み
threshold                   =  −2.046974   ← ターン終了判定閾値（log 空間）
score_bias                  = 604.288      ← 判断スコアのバイアス（log 空間）

SHORT 専用の追加パラメータ（extra 221 bytes）：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
normalization_a    = −2.1251    ← スコア正規化
normalization_b    =  0.3751    ← スコア正規化
weight_extra_1     =  3.1500    ← EOQ 追加重み
weight_extra_2     =  2.0186    ← confirmation 重み
large_penalty_1    = −57.502    ← 強いペナルティ（誤検出防止）
small_penalty      = −0.3797
weight_extra_3     =  1.6000
micro_threshold    =  0.0007    ← 極小閾値（hotword 判定？）
small_threshold    =  0.1000
large_penalty_2    = −47.450    ← 強いペナルティ
weight_final       =  2.0000

Layer 4：ホットワード層（SHORT 専用）

SHORT モードの DecoderEndpointerStream には、ホットワードが EOT ロジックに直接組み込まれている：

1
2
3
4
"ok google"    ← 検出 → ターン即リセット
"okay google"  ← 同上
"hey google"   ← 同上
" google"      ← 同上

これらが検出されると EOT ではなく「新しいターンの開始」として処理される。

全体の判断フロー

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
音声フレーム
    │
    ├─[VAD LSTM]─────────────────────────────────────────────────────────
    │  posterior → EndpointerStream (4段)
    │                 │
    │         prefetch (0.17/0.69) → バッファ開始
    │         confirmation (0.60 / SHORT only) → 発話確定
    │         vad.decision (0.10/0.30) ──────────────────────────────────┐
    │         vad.decision_for_segmenter ──→ [Segmenter] ─→ [Encoder]   │
    │                                                           │         │
    │                                                    [FST Decoder]   │
    │                                                    acoustic_scale  │
    │                                                    = 2.30/2.12     │
    │                                                           │         │
    │                                                    [LAS Rescorer]  │
    │                                                    weight          │
    │                                                    = 1.36/1.70     │
    │                                                           │         │
    └──────────────────────────────────────────────────→ [DecoderEndpointerStream]
                                                          │
                           acoustic_weight=3.80           │
                           lm_weight=3.63                 │
                           silence_penalty=−32.0          │
                           threshold=−2.05                │
                                                          ▼
                                                   ターン終了 / 継続

CONTINUOUS vs SHORT の本質的な違い

観点	CONTINUOUS	SHORT
VAD 感度	保守的 (prefetch=0.69)	攻撃的 (prefetch=0.17)
語頭クリッピング	発生しやすい	バッファ+2段階で防止
誤検出防止	prefetch 閾値が高い	confirmation (0.60) で防止
LM 信頼度	音響優先 (2.30)	LM との均衡 (2.12)
LAS Rescorer	軽め (1.36)	重め (1.70+1.40)
ホットワード	なし	あり (ok/okay/hey google)
追加パラメータ	なし	221 byte 分の拡張設定

全パラメータ詳細（config 実測値）

FstSearchParams（ビーム探索設定）

パラメータ	CONTINUOUS	SHORT	意味
beam	5.0	5.0	FST ビーム幅
max_beam	10.0	10.0	最大ビーム幅
word_end_weight	0.0693	0.0693	単語境界重み (≈ln(1.072))
insertion_penalty	−0.002	−0.002	挿入ペナルティ
acoustic_scale	2.300	2.125	音響モデルのスケール係数
n-best	100	100	Rescorer に渡す仮説数

LasNbestRescorer パラメータ

1
2
n-best candidates: 8     ← 上位8仮説を LAS でリスコアリング
score_weight: 1.0

SegmenterStream パラメータ

1
2
CONTINUOUS: max_silence ≈ 30,000単位 / max_session ≈ 60,000単位
SHORT:      より短い設定

EOQ 接続グラフ（CONTINUOUS モード完全版）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
input
  └─ AudioDecoderStream → waveform_frame → windowed_frame → fft_energies
                                                    └─ filter_bank (80次元 log-mel)
                                                              └─ frame_stacker
                                                                    └─ sampled_stacked_filterbanks
                                                                              │
                    ┌─────────────────────────────────────────────────────────┤
                    │ 【VAD LSTM パス】                                        │
                    ▼                                                          │
             vad.cluster_id (AppendClusterIdStream)                            │
                    └─ vad.posterior (LstmComputeStream, 436KB LSTM)           │
                              │                                                │
                    ┌─────────┼─────────────────────┐                         │
                    │         │                      │                         │
                    ▼         ▼                      ▼                         │
           prefetch.decision  vad.audio_level_events  vad.decision             │
           (EndpointerStream) (AudioLevelEventStream)  (EndpointerStream)      │
                    │                                  │                        │
                    │               ┌──────────────────┤                        │
                    │               │ vad.decision      │                        │
                    │               │ _for_segmenter    │                        │
                    │               │ (EndpointerStream)│                        │
                    │               │      │            │                        │
                    │               │      ▼            │                        │
                    │               │ endpointer_events │                        │
                    │               │ (EndpointerEventStream)                    │
                    │               │      │            │                        │
                    │               └─► segmenter ◄────┘                        │
                    │               (SegmenterStream)                            │
                    │                    │ ◄──────────────────────────────────────┘
                    │        【発話フレームのみ通過】
                    │                    │
                    │         tflite_frame_normalize_stream (FrameNormalizeStream)
                    │              └─ mean_stddev (240次元)
                    │                    │
                    │         audio_features.clusterid (AppendClusterIdStream)
                    │                    │
                    │         stack_for_stacking_layer (FrameStacker)
                    │                    │
                    │         sample_for_stacking_layer (SubsampleStream)
                    │                    │
                    │              conformer_encoder (CausalEncoderStream)
                    │                    │
                    └──► decoder (DecoderEndpointerStream)
                         (vad.decision + prefetch.decision + encoder出力 → ターン判断)
                              │
                         concat_endpointer_events (ParallelConcatStream)
                              │
                         concat_endpointer_event_filter (EndpointerEventFilterStream)
                              │
                         finalize_result.combined_result (CombinedResultStream)
                              │
                         frame_events → nbest_event_filter → recognition_events

VAD が「エンコーダの前段」にある理由

懸念	対策
エンコーダが重い (107MB) → 無音フレームへの無駄な計算	VAD (436KB) で先にフィルタリング
無音フレームが Conformer の状態を汚染する可能性	SegmenterStream が無音フレームをブロック
リアルタイム性の確保	VAD が軽量 (436KB) → 低レイテンシで決定

ターン終了判断の仕組み

DecoderEndpointerStream が複数信号を統合：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
音響信号 ─── vad.decision (無音検出)
             prefetch.decision (先読み予測)
                    │
                    ▼
言語信号 ─── FST スコア (文の完成度)
                    │
                    ▼
             DecoderEndpointerStream
             ┌─────────────────────────────┐
             │ VAD: 「沈黙あり」           │
             │ FST スコア高い → 文完成     │ → セッション終了・出力
             │ FST スコア低い → 文未完成   │ → セッション継続
             └─────────────────────────────┘

EOQ だけではターンを閉じない。言語的な文完成度と合わせて判断する。

認識モード比較

項目	CONTINUOUS	SHORT
用途	字幕・ライブ文字起こし	IME入力・音声コマンド
VAD モデル	`DEFAULT_BF_VAD_EP_MD_UF_DARWINN_CONTINUOUS`	`DEFAULT_BF_EOQ_EP_MD_UF_DARWINN_SHORT`
エンドポインタ	緩め (会話継続)	厳しめ (発話終了検出)
クラスタ ID	VOICE_SEARCH, CAPTION, FARFIELD, TELEPHONY	FARFIELD のみ
確認ステップ	なし	`confirmation` EndpointerStream あり
Entity Injection	なし	JDQ_ONDEVICE あり (連絡先・楽曲等)
ホットワード	参照しない	“ok google”, “okay google”, “hey google”
識別子末尾	`ONDEVICE_LARGE_CONTINUOUS`	`ONDEVICE_LARGE_SHORT`

トークナイザー詳細

語彙規模

項目	値
総トークン数	約 7,388
うち日本語トークン	約 6,902 (93%)
残り	ASCII / 記号 / 特殊トークン
ファイル	`endtoendmodel/large.wpm.portable` (188KB, protobuf)
語彙シンボルテーブル	`endtoendmodel/large.syms.compact` (31KB, OpenFST CompactSymbolTable)

単語境界マーカー

▁（U+2581）を単語先頭に付与（SentencePiece と同方式）：

1
2
▁ありがとう   ← 単語の先頭
う             ← 単語の途中（サブワード継続）

特殊トークン

1
2
3
4
5
6
7
8
9
<sos>        ← 文開始 (decoder 初期入力)
</S>         ← 文終了
<UNK>        ← 未知語
<noise>      ← 環境ノイズ
<sil>        ← 無音 (silence)
<epsilon>    ← FST イプシロン遷移
<text_only>  ← テキストのみモード
<phi>        ← Entity Injection 用
<sigma>      ← Entity Injection 用

エンティティクラストークン（動的コンテキスト注入）

SHORT モードのみ有効：

1
2
3
4
5
6
7
<contacts> / </contacts>     ← 連絡先名 (CONTACTS)
<songs> / </songs>           ← 楽曲名 (SONG)
<pkg_contacts> / </pkg_contacts>   ← パッケージ連絡先
<pkg_songs> / </pkg_songs>         ← パッケージ楽曲
<pkg_artists> / </pkg_artists>     ← アーティスト名
<apps> / </apps>             ← アプリ名 (APP)
#NUMBER                      ← 数字エンティティ

JDQ_ONDEVICE (Just-in-time Dynamic Query) システムで実行時注入：

pattern_tagger.fst → 音素列をエンティティに対応付け
token_masker.fst → 語彙マスキング
blocklist.fst → ブロックリスト適用
base_wordlist.syms → ベース語彙リスト

句読点 LSTM 詳細

アーキテクチャ

項目	値
ファイルサイズ	2.4MB (int8量子化)
アーキテクチャ	Bidirectional LSTM
語彙サイズ	2,332 トークン
量子化	int8

重みテンソル構造（BidirectionalSequenceLSTM）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
Forward LSTM (fw):
  fw_input2input_weights1    (I→I ゲート)
  fw_input2forget_weights2   (I→F ゲート)
  fw_input2cell_weights3     (I→C ゲート)
  fw_input2output_weights4   (I→O ゲート)
  fw_rec2input_weights5      (再帰→I)
  fw_rec2forget_weights6     (再帰→F)
  fw_rec2cell_weights7       (再帰→C)
  fw_rec2output_weights8     (再帰→O)
  fw_cell2input_weights9     (Peephole I)
  fw_cell2forget_weights10   (Peephole F)
  fw_cell2output_weights11   (Peephole O)
  fw_*_bias12-15
  fw_activation_state35, fw_input_cell_state36

Backward LSTM (bw): 同様の構造 (bw_*18-38)

後段:
  weights50, bias51 → output52 → ... → output57

Peephole 接続あり = cell state から各ゲートに直接接続する LSTM 拡張形。

句読点変換ルール (punctuation_converter_config.pb から)

日本語句読点は発話形から書字形へ変換：

「まる」「ピリオド」→ 。 / ．（ホモフォン複数対応）
「てん」「コンマ」→ 、 / ，
type: INTERMEDIATE = 文中にも挿入可能
upcase_after: false = 英語と異なり日本語は大文字化なし

言語モデル FST 詳細

FST とは

FST = Finite State Transducer（有限状態変換器）。

テキストコーパスから学習した n-gram 確率を、巨大なグラフ（状態遷移図）として事前にコンパイルしたもの。「この単語の次に来やすい単語は何か」という言語的な自然さをスコア化する。

1
2
3
4
"今日は" の次に来やすい単語は？
  → "天気" (確率高)
  → "何を" (確率高)
  → "宇宙船" (確率低)

FST の入力はテキストトークンであり、音声（メルスペクトログラム等）ではない。
音声は Encoder → RNN-T Decoder でトークン列に変換された後、FST に渡される。

なぜニューラル LM ではなく FST を使うか

観点	FST	ニューラル LM
速度	非常に速い（グラフ探索のみ）	遅い（行列演算）
ビーム探索との統合	同時実行可能	バッチ処理が必要
メモリ	小さい（pruning 後 31KB）	大きい
精度	n-gram レベル	高い（文脈理解）

RNN-T のビーム探索（100本同時）と FST スコアリングをリアルタイムで並走させるために FST が採用されている。

ファイル構成

ファイル	サイズ	説明
`lm.pruned.sorted.fst`	12MB	主 LM FST（プルーニング・ソート済み）
`transducer.pruned.fst`	3.6MB	トランスデューサ FST
`lm.pruned.sorted.syms`	1.4MB	語彙シンボルテーブル
`large.syms.compact` (ja)	31KB	Decoder が使うシンボルテーブル（極限 pruning 後）
`large.syms.compact` (en)	22KB	英語版

SigmaFst 型 = sigma 遷移（全記号にマッチ）を持つ FST
ビーム探索時に FstSearchParams でビーム幅等を制御
DecoderEndpointerStream が FST スコアを参照して文完成度を判断（lm_weight=3.632）

SODA のパイプライン内での位置づけ

1
2
3
4
5
6
7
RNN-T Decoder（ビーム探索 n-best=100）
    ↓ トークン列の仮説
FST 言語モデル
    ↓ 言語スコア付き仮説
LAS Rescorer（上位 8本に絞って精密評価）
    ↓
DecoderEndpointerStream（EOT 判定）

LAS Rescorer が必要な理由

RNN-T 単体の限界

RNN-T はフレームごとに左から右へ逐次的にトークンを生成する（一方向・局所的）。「文全体を見て最も自然な解釈はどれか」という大局的な判断が苦手。

1
2
3
4
ビーム探索の結果（スコアが僅差）：
  仮説1: "今日は天気がいいですね"   score: -12.3
  仮説2: "今日は点気がいいですね"   score: -12.5  ← "天" を "点" と聞き違え
  仮説3: "今日は天気がいい世ね"     score: -12.7

スコアが近い場合、RNN-T + FST だけでは誤りを弾けないことがある。

LAS Rescorer の解決方法

Encoder 出力と仮説テキストを 双方向 Attention で全体を見渡しながら照合して再評価。

1
2
3
4
5
Encoder 出力（音声の全文脈）
    ↓
LAS cross-attention ← "点気" は音響的にあり得るか？
    ↓
"天気" の方が音響・言語両面で自然 → 仮説1を選択

RNN-T が「速さ優先で生成」し、LAS が「精度優先で選別」する役割分担。

日英のモデルサイズ比較

コンポーネント	日本語	英語
VAD LSTM	437KB	471KB
FST (syms.compact)	31KB	22KB
LAS Rescorer (acoustic)	3.9MB	3.1MB
LAS Rescorer (full)	なし	48MB

日本語には 48MB のフル LAS が存在しない。
英語は self-attention + cross-attention のフル構成、日本語は音響特徴のみの軽量版で代用している。

特殊トークン一覧（英語版から判明）

音声認識の語彙（wpm.portable）に含まれる特殊トークン：

トークン	用途
`<s>` / `</s>`	文の開始・終了（FST が使う）
`<UNK>`	未知語
`<noise>`	雑音
`<epsilon>`	RNN-T の blank（「何も出力しない」）
`<reject>`	認識拒否
`{comma}`, `{period}`, `{new-line}` 等	音声コマンドによる句読点入力（dictation 用）
`{smiley-face}`, `{heart}` 等	絵文字の発話認識
`<contacts>`, `<song>`, `<app>` 等	エンティティ注入用の文脈タグ

Googleの論文にあった、<pause> や <eos> は存在しない。 EOT 判定はトークンレベルではなく DecoderEndpointerStream が担う。

FST と LAS Rescorer の自前構築

FST（n-gram LM）→ GPU 不要、KenLM で構築可能

FST はニューラルネットではなく、テキストコーパスから n-gram を数えてグラフにコンパイルするだけ。

1
2
3
4
5
# n-gram 学習
kenlm/build/bin/lmplz -o 4 --memory 8G < corpus.txt > lm.arpa

# バイナリ化
kenlm/build/bin/build_binary trie lm.arpa lm.bin

必要データ量の目安：

データ	サイズ	品質
Wikipedia のみ	約 20GB（展開後）	書き言葉のみ、口語弱
Wikipedia + ニュース	30〜50GB	実用的な最低ライン
+ Common Crawl (CC-100)	85GB 追加	口語・SNS 表現を補強

日本語は口語データ（SNS・会話）が入っているかどうかが品質を大きく左右する。

sherpa-onnx への統合

sherpa-onnx は KenLM による LM rescoring をネイティブサポートしている。

1
2
3
4
sherpa-onnx \
    --lm=lm_4gram.bin \
    --lm-scale=0.1 \    # SODA の acoustic_scale の逆数に相当
    ...

--lm-scale は 0.05〜0.3 の範囲で WER が最小になる値を探す。

LAS Rescorer → 3090 (24GB) で学習可能だがデータが課題

項目	状況
モデルサイズ (48MB)	3090 (24GB VRAM) で余裕
アーキテクチャ	Transformer、OSS 実装多数
学習データ	ここがボトルネック

LAS の学習には「Encoder 出力 × 大量音声データ」が必要。 SODA の Encoder は使えないため、sherpa-onnx の Zipformer 等を Encoder として代用する構成になる。

現実的な構成（3090 で実現可能）：

1
2
3
4
5
Zipformer encoder（sherpa-onnx、OSS）
    ↓ Encoder 出力
自前 LAS Rescorer（PyTorch で学習）
    ↓
自前 FST（KenLM + 大規模コーパス）

GPU 的には問題ない。壁になるのはペアデータ（音声＋正解テキスト）の量。
日本語なら最低数百時間、英語なら数千時間規模が現実的な目安。

言語識別 (LangID) モデル

項目	値
ファイルサイズ	2.4MB
アーキテクチャ	Conformer Encoder (12層、ASR より小型)
対応言語数	43言語
使用 VAD	`SODA_DICTATION_EP_UNIFIED_FRONTEND_LANGID` (471KB)
特徴量次元	528次元 (mean_stddev ファイルサイズから算出)

対応言語リスト： af-ZA, ar-EG, bg-BG, ca-ES, cmn-Hans-CN, cs-CZ, da-DK, de-DE, el-GR, en-US, es-US, eu-ES, fa-IR, fi-FI, fil-PH, fr-FR, gl-ES, he-IL, hi-IN, hu-HU, id-ID, is-IS, it-IT, ja-JP, ko-KR, lt-LT, ms-MY, nb-NO, nl-NL, pl-PL, pt-BR, ro-RO, ru-RU, sk-SK, sl-SI, sr-RS, sv-SE, th-TH, tr-TR, uk-UA, vi-VN, yue-Hant-HK, zu-ZA

LangID モデルも ASR と同じ Conformer アーキテクチャを使用（conf_0〜conf_11、trans_atten の multihead_atten が確認できる）。

コンテキスト機能（DeviceContext）

config から読み取れる端末状態フィーチャー（アシスタント連携時のみ）：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
alarm-status      ← アラーム状態 (FIRING/SCHEDULED/SNOOZED)
timer-status      ← タイマー状態 (FIRING/PAUSED/RUNNING)
call-state        ← 通話状態 (INCOMING/OUTGOING/IN_CALL)
media-state       ← 再生状態 (PLAYING)
hotword-active    ← ホットワード検出中かどうか
client-id         ← クライアント種別:
                     AUTO, MOBILE_WARM_WORDS, NGA, ENHANCED_VOICE_DICTATION, HUB_MODE
foreground-app    ← フォアグラウンドアプリ (Android パッケージ名):
                     com.google.android.gm (Gmail)
                     com.google.android.keep (Keep)
                     com.google.android.apps.messaging
                     com.whatsapp
                     com.google.android.apps.dynamite (Chat)
                     com.google.android.apps.docs.editors.docs
input-field-type  ← 入力フィールド種別 (TYPE_TEXT_VARIATION_EMAIL_ADDRESS 等)
latitude/longitude ← 位置情報
asr-state         ← ASR状態 (HOTQUERY/not-found)
experiment-labels ← A/Bテストラベル

フォアグラウンドアプリ別にコンテキストモジュールが切り替わる：

Gmail 起動中 → メールアドレス・連絡先を優先認識
Keep 起動中 → ノート内容を優先
Messages/WhatsApp → 連絡先名を優先

クローン設計のポイント

必要なコンポーネントと代替

コンポーネント	このモデル	OSS 代替
Conformer Encoder (Causal)	12層、Macaron style, 107MB	ESPnet / k2 / NeMo
RNN-T Decoder	シンプル LSTM	Torchaudio RNN-T / ESPnet
Joint Network (posterior+prior)	分離設計で LM fusion 可能	標準 RNN-T joint でも可
LAS Rescorer	2nd pass, 3.8MB	オプション (品質向上用)
語彙	WordPiece ~7400	SentencePiece (BPE/Unigram)
言語モデル	剪定済みFST (12MB)	KenLM + OpenFST
VAD	LSTM 436KB	Silero VAD / WebRTC VAD
句読点	BiLSTM int8 2.4MB	CTranslate2 + punct model

日本語固有の処理

mozc.dic (9.4MB): Google 日本語入力の辞書 → 代替: NEologd, UniDic
JapaneseSegmentedVariantsProvider: Mozc で読み仮名バリアント生成
JapaneseContactNameTextTransformer: 人名の読み処理
JapaneseTextSegmentationTransformer: TTS向け分かち書き
portable_ja_verbalizer.far (1.2MB): 数字・記号→日本語読み上げ FST

クローン時のスケール参考

要素	この SODA モデル	一般的な OSS 日本語 ASR
Encoder 層数	12 (Conformer)	12〜24 (Transformer/Conformer)
入力特徴量	80-mel, 10ms shift	80-mel, 10ms shift ← 同じ
語彙	7,388 WordPiece	3,000〜10,000 BPE/Unigram
合計サイズ	138MB	100〜300MB (精度依存)
モデル訓練規模	不明 (Google 内部)	ESPnet推奨: 1,000h+

デバッグメモ

問題1: libsoda.so が読み込めない

OSError: libc++.so.1: cannot open shared object file → sudo apt-get install -y libc++1

問題2: .img 展開失敗 (0バイト)

ChromeOS DLC は Squashfs 形式。7z 非対応。 → unsquashfs を使う (squashfs-tools パッケージ)

問題3: arecord “Device or resource busy”

PipeWire が plughw:X,X を専有。 → parec (PulseAudio録音) を使う

問題4: 音声受信できるが認識結果が出ない (volteer)

volteer の libsoda.so は bitflip パッチ未実装。 → hatch (Intel 10th gen) の libsoda.so を使う (x86_64 互換)

問題5: マイクゲイン低すぎ

デフォルト27% → RMS が小さすぎて認識されない。 → pactl set-source-volume <source> 65% → 100%はクリッピング (peak=32768) するので 60〜70% が安定

問題6: `AUDIO_LEVEL` の rms が常に 0.0

AUDIO_LEVEL メッセージの rms はループバック音声のレベル。マイクではない。正常動作。

マイク音声レベルの確認:

1
2
3
4
import struct, math
samples = [struct.unpack_from('<h', audio, i)[0] for i in range(0, len(audio)-1, 2)]
rms = math.sqrt(sum(s*s for s in samples) / len(samples))
# 無音: ~100-500、発話: ~1000-5000

実行ログ

実際にローカルでデバッグように動かした時のログ:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
I0000 00:00:1780981769.242031  131504 terse_processor.cc:956] Final recognition has been created.
I0000 00:00:1780981769.292684   81168 terse_processor.cc:3025] TerseProcessor HasRecognitionEvent With Final
I0000 00:00:1780981769.292715   81168 terse_processor.cc:3355] Audio pin advanced by: 3.27s
I0000 00:00:1780981769.292757   81168 terse_processor.cc:3415] Longform resets session because this session is 3.27s long.
I0000 00:00:1780981769.292768   81168 terse_processor.cc:1645] Cancelling session.
W0000 00:00:1780981769.293127   81168 decoder_endpointer_stream.cc:242] Acoustic ep reader thread cancelled.
W0000 00:00:1780981769.293153   81168 decoder_endpointer_stream.cc:242] Prefetch reader thread cancelled.
I0000 00:00:1780981769.293413   81168 terse_processor.cc:1492] Starting session.
I0000 00:00:1780981772.313959  131538 terse_processor.cc:956] Final recognition has been created.
I0000 00:00:1780981772.361874   81168 terse_processor.cc:3025] TerseProcessor HasRecognitionEvent With Final
I0000 00:00:1780981772.361964   81168 terse_processor.cc:3355] Audio pin advanced by: 3.09s
I0000 00:00:1780981772.362024   81168 terse_processor.cc:3415] Longform resets session because this session is 3.09s long.
I0000 00:00:1780981772.362034   81168 terse_processor.cc:1645] Cancelling session.
W0000 00:00:1780981772.362443   81168 decoder_endpointer_stream.cc:242] Acoustic ep reader thread cancelled.
W0000 00:00:1780981772.362457   81168 decoder_endpointer_stream.cc:242] Prefetch reader thread cancelled.
I0000 00:00:1780981772.362697   81168 terse_processor.cc:1492] Starting session.
* こんにちは。
* 今日 は 暑 いっ す ね。
I0000 00:00:1780981778.144007  131597 terse_processor.cc:956] Final recognition has been created.
I0000 00:00:1780981778.183203   81168 terse_processor.cc:3025] TerseProcessor HasRecognitionEvent With Final
I0000 00:00:1780981778.183354   81168 terse_processor.cc:3355] Audio pin advanced by: 5.82s
I0000 00:00:1780981778.183476   81168 terse_processor.cc:3415] Longform resets session because this session is 5.82s long.
I0000 00:00:1780981778.183500   81168 terse_processor.cc:1645] Cancelling session.
W0000 00:00:1780981778.183914   81168 decoder_endpointer_stream.cc:242] Acoustic ep reader thread cancelled.
W0000 00:00:1780981778.183933   81168 decoder_endpointer_stream.cc:242] Prefetch reader thread cancelled.
I0000 00:00:1780981778.184291   81168 terse_processor.cc:1492] Starting session.
I0000 00:00:1780981780.743660   81168 soda_async_impl.cc:1291] Current audio timestamp: 1780981780087778
I0000 00:00:1780981782.286159  131631 terse_processor.cc:956] Final recognition has been created.
I0000 00:00:1780981782.343491   81168 terse_processor.cc:3025] TerseProcessor HasRecognitionEvent With Final
I0000 00:00:1780981782.343561   81168 terse_processor.cc:3355] Audio pin advanced by: 4.08s
I0000 00:00:1780981782.343734   81168 terse_processor.cc:3415] Longform resets session because this session is 4.08s long.
I0000 00:00:1780981782.343751   81168 terse_processor.cc:1645] Cancelling session.
W0000 00:00:1780981782.344365   81168 decoder_endpointer_stream.cc:242] Acoustic ep reader thread cancelled.
W0000 00:00:1780981782.344380   81168 decoder_endpointer_stream.cc:242] Prefetch reader thread cancelled.
I0000 00:00:1780981782.344607   81168 terse_processor.cc:1492] Starting session.
* トゥワード ファースト アンド アキュレイト お なかなか いい。 度 い や い い。 よ く で き
I0000 00:00:1780981785.503698  131657 terse_processor.cc:956] Final recognition has been created.
I0000 00:00:1780981785.543589   81168 terse_processor.cc:3025] TerseProcessor HasRecognitionEvent With Final
I0000 00:00:1780981785.543892   81168 terse_processor.cc:3355] Audio pin advanced by: 3.24s
I0000 00:00:1780981785.543985   81168 terse_processor.cc:3415] Longform resets session because this session is 3.24s long.
I0000 00:00:1780981785.543999   81168 terse_processor.cc:1645] Cancelling session.
W0000 00:00:1780981785.544823   81168 decoder_endpointer_stream.cc:242] Acoustic ep reader thread cancelled.
W0000 00:00:1780981785.544850   81168 decoder_endpointer_stream.cc:242] Prefetch reader thread cancelled.
I0000 00:00:1780981785.545216   81168 terse_processor.cc:1492] Starting session.
* 制度 いや いい。
I0000 00:00:1780981787.976933  131684 terse_processor.cc:956] Final recognition has been created.
I0000 00:00:1780981788.042046   81168 terse_processor.cc:3025] TerseProcessor HasRecognitionEvent With Final
I0000 00:00:1780981788.042132   81168 terse_processor.cc:3355] Audio pin advanced by: 2.46s
I0000 00:00:1780981788.042187   81168 terse_processor.cc:3415] Longform resets session because this session is 2.46s long.
I0000 00:00:1780981788.042198   81168 terse_processor.cc:1645] Cancelling session.
W0000 00:00:1780981788.042530   81168 decoder_endpointer_stream.cc:242] Acoustic ep reader thread cancelled.
W0000 00:00:1780981788.042550   81168 decoder_endpointer_stream.cc:242] Prefetch reader thread cancelled.
I0000 00:00:1780981788.042832   81168 terse_processor.cc:1492] Starting session.
I0000 00:00:1780981791.378381  131706 terse_processor.cc:956] Final recognition has been created.
I0000 00:00:1780981791.434164   81168 terse_processor.cc:3025] TerseProcessor HasRecognitionEvent With Final
I0000 00:00:1780981791.434272   81168 terse_processor.cc:3355] Audio pin advanced by: 3.42s
I0000 00:00:1780981791.434355   81168 terse_processor.cc:3415] Longform resets session because this session is 3.42s long.
I0000 00:00:1780981791.434371   81168 terse_processor.cc:1645] Cancelling session.
W0000 00:00:1780981791.434811   81168 decoder_endpointer_stream.cc:242] Acoustic ep reader thread cancelled.
W0000 00:00:1780981791.434831   81168 decoder_endpointer_stream.cc:242] Prefetch reader thread cancelled.
I0000 00:00:1780981791.435088   81168 terse_processor.cc:1492] Starting session.
* よく でき てる じゃん ま そら そっ か モデル の 重 さ が 全然 違う から な。
I0000 00:00:1780981810.275178  131733 terse_processor.cc:956] Final recognition has been created.
I0000 00:00:1780981810.314296   81168 terse_processor.cc:3025] TerseProcessor HasRecognitionEvent With Final
I0000 00:00:1780981810.314378   81168 terse_processor.cc:3355] Audio pin advanced by: 18.87s
I0000 00:00:1780981810.314577   81168 terse_processor.cc:3415] Longform resets session because this session is 18.87s long.
I0000 00:00:1780981810.314622   81168 terse_processor.cc:1645] Cancelling session.
W0000 00:00:1780981810.315045   81168 decoder_endpointer_stream.cc:242] Acoustic ep reader thread cancelled.
W0000 00:00:1780981810.315065   81168 decoder_endpointer_stream.cc:242] Prefetch reader thread cancelled.
I0000 00:00:1780981810.315482   81168 terse_processor.cc:1492] Starting session.
I0000 00:00:1780981810.692990   81168 soda_async_impl.cc:1291] Current audio timestamp: 1780981810087778
* ふう、 疲れ た。
I0000 00:00:1780981818.837511  131828 terse_processor.cc:956] Final recognition has been created.
I0000 00:00:1780981818.893274   81168 terse_processor.cc:3025] TerseProcessor HasRecognitionEvent With Final
I0000 00:00:1780981818.893306   81168 terse_processor.cc:3355] Audio pin advanced by: 8.55s
I0000 00:00:1780981818.893395   81168 terse_processor.cc:3415] Longform resets session because this session is 8.55s long.
I0000 00:00:1780981818.893421   81168 terse_processor.cc:1645] Cancelling session.
W0000 00:00:1780981818.893979   81168 decoder_endpointer_stream.cc:242] Acoustic ep reader thread cancelled.
W0000 00:00:1780981818.893999   81168 decoder_endpointer_stream.cc:242] Prefetch reader thread cancelled.
I0000 00:00:1780981818.894312   81168 terse_processor.cc:1492] Starting session.
I0000 00:00:1780981822.227670  131905 terse_processor.cc:956] Final recognition has been created.
I0000 00:00:1780981822.282227   81168 terse_processor.cc:3025] TerseProcessor HasRecognitionEvent With Final
I0000 00:00:1780981822.282250   81168 terse_processor.cc:3355] Audio pin advanced by: 3.39s
I0000 00:00:1780981822.282292   81168 terse_processor.cc:3415] Longform resets session because this session is 3.39s long.
I0000 00:00:1780981822.282306   81168 terse_processor.cc:1645] Cancelling session.
W0000 00:00:1780981822.282704   81168 decoder_endpointer_stream.cc:242] Acoustic ep reader thread cancelled.
W0000 00:00:1780981822.282722   81168 decoder_endpointer_stream.cc:242] Prefetch reader thread cancelled.
I0000 00:00:1780981822.282949   81168 terse_processor.cc:1492] Starting session.
I0000 00:00:1780981839.129265  131926 terse_processor.cc:956] Final recognition has been created.
I0000 00:00:1780981839.173272   81168 terse_processor.cc:3025] TerseProcessor HasRecognitionEvent With Final
I0000 00:00:1780981839.173296   81168 terse_processor.cc:3355] Audio pin advanced by: 16.92s
I0000 00:00:1780981839.173470   81168 terse_processor.cc:3415] Longform resets session because this session is 16.92s long.
I0000 00:00:1780981839.173500   81168 terse_processor.cc:1645] Cancelling session.
W0000 00:00:1780981839.173929   81168 decoder_endpointer_stream.cc:242] Acoustic ep reader thread cancelled.
W0000 00:00:1780981839.173948   81168 decoder_endpointer_stream.cc:242] Prefetch reader thread cancelled.
I0000 00:00:1780981839.174332   81168 terse_processor.cc:1492] Starting session.
I0000 00:00:1780981840.714665   81168 soda_async_impl.cc:1291] Current audio timestamp: 1780981840087778
I0000 00:00:1780981870.732931   81168 soda_async_impl.cc:1291] Current audio timestamp: 1780981870087778
I0000 00:00:1780981877.912018  132051 terse_processor.cc:956] Final recognition has been created.
I0000 00:00:1780981877.962263   81168 terse_processor.cc:3025] TerseProcessor HasRecognitionEvent With Final
I0000 00:00:1780981877.962296   81168 terse_processor.cc:3355] Audio pin advanced by: 38.85s
I0000 00:00:1780981877.962633   81168 terse_processor.cc:3415] Longform resets session because this session is 38.85s long.
I0000 00:00:1780981877.962689   81168 terse_processor.cc:1645] Cancelling session.
W0000 00:00:1780981877.963052   81168 decoder_endpointer_stream.cc:242] Acoustic ep reader thread cancelled.
W0000 00:00:1780981877.963075   81168 decoder_endpointer_stream.cc:242] Prefetch reader thread cancelled.
I0000 00:00:1780981877.963615   81168 terse_processor.cc:1492] Starting session.
I0000 00:00:1780981886.348369  132223 terse_processor.cc:956] Final recognition has been created.
I0000 00:00:1780981886.412668   81168 terse_processor.cc:3025] TerseProcessor HasRecognitionEvent With Final
I0000 00:00:1780981886.412775   81168 terse_processor.cc:3355] Audio pin advanced by: 8.37s
I0000 00:00:1780981886.412905   81168 terse_processor.cc:3415] Longform resets session because this session is 8.37s long.
I0000 00:00:1780981886.412921   81168 terse_processor.cc:1645] Cancelling session.
W0000 00:00:1780981886.413291   81168 decoder_endpointer_stream.cc:242] Acoustic ep reader thread cancelled.
W0000 00:00:1780981886.413313   81168 decoder_endpointer_stream.cc:242] Prefetch reader thread cancelled.
I0000 00:00:1780981886.413729   81168 terse_processor.cc:1492] Starting session.
* そら そっ か。

感想

GoogleのSODAは単一の巨大モデルではなく、かなり実用品寄りのパイプラインだった
モデルは一般的なReazonSpeechを学習したicefallと同じようにパラメーターの約90%がEncoder
そして、その巨大なEncoderを常時回さず、436KBのVADを先に置く
語頭を落とさないためにprefetch とリングバッファを使う
発話終了はVADだけで決めず、FSTとLASの言語スコアも見る
日本語のために Mozc辞書、読み変換、姓名辞書、動的コンテキスト注入も入っている
音声会話システムを作ると、ASRのモデル精度だけを見がちになる
でも体験として効くのは、発話開始、語頭保持、発話終了、partial/final result、句読点、コンテキスト注入、マイクゲイン、UI 表示の全部
数十億人に使われている仕組みだけあってよくできていた

目次

背景

先に結論

基本情報

世代・バージョン体系

全体パイプライン

ファイル一覧・サイズ詳細

endtoendmodel/ (メイン ASR)

acousticmodel/ (VAD / エンドポインタ)

denorm/ (後処理)

langid/ (言語識別)

context_prebuilt/ (コンテキスト注入)

音声フロントエンド詳細

特徴量パラメータ（config から確定）

正規化パラメータ（mean_stddev から実測）

Conformer Encoder 詳細

モデル規模

各 Conformer ブロックの構造（全12層共通）

ストリーミング実装の構造

RNN-T Decoder 詳細

アーキテクチャ

Joint Network の分離設計

Joint Network の演算グラフ（posterior）

2nd Pass Rescorer (LAS)

アーキテクチャ概要

演算グラフ（child_decoders_0）

VAD / EOQ LSTM 詳細

3種類の VAD モデル

入力特徴量

prefetch の仕組み：別モデルではない

結論

config バイナリから実測した閾値パラメータ

なぜ「語頭をよく認識できる」のか

音声バッファ（プリロール）の実装

常時録音とリングバッファ

各 EndpointerStream の役割まとめ

EOT（ターン終了）予測の完全な仕組み

Layer 1：音響層（VAD LSTM → EndpointerStream × 3〜4）

Layer 2：言語層（FST + LAS Rescorer）

Layer 3：統合層（DecoderEndpointerStream 内部パラメータ）

Layer 4：ホットワード層（SHORT 専用）

全体の判断フロー

CONTINUOUS vs SHORT の本質的な違い

全パラメータ詳細（config 実測値）

FstSearchParams（ビーム探索設定）

LasNbestRescorer パラメータ

SegmenterStream パラメータ

EOQ 接続グラフ（CONTINUOUS モード完全版）

VAD が「エンコーダの前段」にある理由

ターン終了判断の仕組み

認識モード比較

トークナイザー詳細

語彙規模

単語境界マーカー

特殊トークン

エンティティクラストークン（動的コンテキスト注入）

句読点 LSTM 詳細

アーキテクチャ

重みテンソル構造（BidirectionalSequenceLSTM）

句読点変換ルール (punctuation_converter_config.pb から)

言語モデル FST 詳細

FST とは

なぜニューラル LM ではなく FST を使うか

ファイル構成

SODA のパイプライン内での位置づけ

LAS Rescorer が必要な理由

RNN-T 単体の限界

LAS Rescorer の解決方法

日英のモデルサイズ比較

特殊トークン一覧（英語版から判明）

FST と LAS Rescorer の自前構築

FST（n-gram LM）→ GPU 不要、KenLM で構築可能

sherpa-onnx への統合

LAS Rescorer → 3090 (24GB) で学習可能だがデータが課題

言語識別 (LangID) モデル

コンテキスト機能（DeviceContext）

クローン設計のポイント

必要なコンポーネントと代替

日本語固有の処理

クローン時のスケール参考

問題6: `AUDIO_LEVEL` の rms が常に 0.0