Kneser-Ney Smoothing を試してみた

Kneser-Ney Smoothing は高性能な言語モデルである。と、よく聞かされて知っているつもりだけど、まだ一度も試したことがなかったので、試してみた。
コードはここ。

https://github.com/shuyo/iir/blob/master/ngram/knsmooth.py

実験用にべったり書いているのでコピペは多いし、速度やメモリの効率も悪いが、まあ気にしないで。

コーパスは最初手元の適当なニュースコーパスを使っていたんだけど、それだと再現検証できないので、nltk のコーパスを使うように変更した。
nltk.corpus の下にあるコーパスモジュール名を -c オプションで与えると、そのコーパスを使って additive smoothing と Kneser-Ney smoothing の perplexity を出してくれる。
デフォルトでは Brown コーパスを使うのだが、例えばロイターコーパスを使いたければこんな感じ。

$ ./knsmooth.py -c reuters

ちなみに、与えられた名前に対して対応するコーパスモジュールを動的に読み込むところはこんなコード。

    m = __import__('nltk.corpus', globals(), locals(), [opt.corpus], -1)
    corpus = getattr(m, opt.corpus)

これで opt.corpus に "brown" が入っていれば corpus は nltk.corpus.brown になるし、"movie_reviews" なら nltk.corpus.movie_reviews になるって仕掛け。

出力はこんな感じ。brown コーパスの場合。

$ ./knsmooth.py -c brown --seed=0
found corpus : brown (D=500)
# of terms=1045022, vocabulary size=47349

UNIGRAM:
additive smoother: alpha1=1.0000, perplexity=1171.142
Kneser-Ney: heuristic D=0.604, perplexity=1171.068
Kneser-Ney: minimum D=1.000, perplexity=1152.640
modified Kneser-Ney: perplexity=1155.159

BIGRAM:
additive smoother: alpha2=0.0031, perplexity=1335.287
Kneser-Ney: heuristic D=0.761, perplexity=562.420
Kneser-Ney: minimum D=0.872, perplexity=556.637
modified Kneser-Ney: perplexity=543.748

TRIGRAM:
additive smoother: alpha3=0.0010, perplexity=9491.406
Kneser-Ney: heuristic D=0.881, perplexity=549.436
Kneser-Ney: minimum D=0.965, perplexity=542.305
modified Kneser-Ney: perplexity=520.433

コーパスからランダムにある割合でテストデータを選び、残りを訓練コーパスとしている。テストデータの割合は -r オプションで与えられる(デフォルト = 0.1)。テストデータを選ぶ乱数のシードは --seed で与える。
単語はコーパスモジュールの words() が返すものをすべて使っている（つまり記号類も有効）。正規化は小文字化のみしている。

結果は 1-gram から 3-gram まで、それぞれ以下の perplexity を求めている。

加算スムージング。パラメータは [0.0, 1.0] の範囲で perplexity が最小になる値を選ぶ
Kneser-Ney スムージング。パラメータはヒューリスティックな値を選ぶ
Kneser-Ney スムージング。パラメータは [0.0, 1.0] の範囲で perplexity が最小になる値を選ぶ
modified Kneser-Ney スムージング。パラメータはヒューリスティックな値を選ぶ

結果をまとめると、こんな感じか。

1-gram では大差はない( interpolate ほぼしないから当然っちゃあ当然 )
2-gram 以上では Kneser-Ney は加算スムージングよりはるかに良い性能を叩き出す
modified Kneser-Ney はオリジナルの Kneser-Ney より確かに性能が良いが、定性的に違いがわかるほどのレベルではないだろう。計算コストを考えるとオリジナルで十分に思える。
3-gram で加算スムージングが悪化しているのはコーパスが小さいせいか（それにしても悪すぎるんだが……）。Kneser-Ney はそんな状況でもちゃんと性能アップしている。

というわけで、確かに Kneser-Ney はなかなかいいね、という結果に。
KN で尤度を求めるには N1+ やらを数えて覚えておかないといけないので大変。modified KN だと数えておかないといけないものがさらに多くて、KN の2倍くらい大変。
でも尤度を求めるのではなく、「文生成などのために次の単語の分布を求める」という場合には N1 類を数えて保持しておく必要がない（語彙をぐるっと舐めている間に、どれだけ discount したかは当然わかる）ので、KN / modified KN ともに気楽に使える。

ロイターコーパスと movie_reviews コーパスの結果もおまけ。

$ ./knsmooth.py -c reuters --seed=0
found corpus : reuters (D=10788)
# of terms=1549612, vocabulary size=29774

UNIGRAM:
additive smoother: alpha1=1.0000, perplexity=854.850
Kneser-Ney: heuristic D=0.564, perplexity=855.384
Kneser-Ney: minimum D=1.000, perplexity=850.667
modified Kneser-Ney: perplexity=850.833

BIGRAM:
additive smoother: alpha2=0.0018, perplexity=207.485
Kneser-Ney: heuristic D=0.670, perplexity=143.591
Kneser-Ney: minimum D=0.704, perplexity=143.527
modified Kneser-Ney: perplexity=142.154

TRIGRAM:
additive smoother: alpha3=0.0004, perplexity=512.933
Kneser-Ney: heuristic D=0.762, perplexity=93.349
Kneser-Ney: minimum D=0.799, perplexity=93.246
modified Kneser-Ney: perplexity=89.477

$ ./knsmooth.py -c movie_reviews --seed=0
found corpus : movie_reviews (D=2000)
# of terms=1422742, vocabulary size=38181

UNIGRAM:
additive smoother: alpha1=1.0000, perplexity=908.244
Kneser-Ney: heuristic D=0.561, perplexity=908.691
Kneser-Ney: minimum D=1.000, perplexity=902.749
modified Kneser-Ney: perplexity=903.249

BIGRAM:
additive smoother: alpha2=0.0023, perplexity=515.530
Kneser-Ney: heuristic D=0.739, perplexity=283.122
Kneser-Ney: minimum D=0.806, perplexity=282.376
modified Kneser-Ney: perplexity=278.357

TRIGRAM:
additive smoother: alpha3=0.0007, perplexity=2735.080
Kneser-Ney: heuristic D=0.851, perplexity=239.729
Kneser-Ney: minimum D=0.918, perplexity=238.210
modified Kneser-Ney: perplexity=230.498