極大部分文字列の味見 / 自然言語処理勉強会＠東京 #3

[岡野原+ 2008] 全ての部分文字列を考慮した文書分類

http://ci.nii.ac.jp/naid/110006980330

n-gram と異なり、任意長の部分文字列を素性の候補としたい
- ただしそのままでは素性数が文章長の二乗のオーダー
素性の候補となる「極大部分文字列」を suffix array/lcp/WDT から線形時間で求める
- ２回以上現れる任意の部分文字列を素性とするのと同等
- 一般に、極大部分文字列は全ての部分文字列よりはるかに少ない(trigram〜fivegram ぐらい)
grafting/L1 正則化付きロジスティック回帰により、有効な素性(重みが非ゼロ)を効率的に学習

極大部分文字列

例) abracadabra
- "bra" は "abra" の部分文字列としてしか現れない　→　"bra" という素性は "abra" の中に含まれているので必要ない
- そのような関係がある部分文字列同士を同値類としてまとめ、それぞれの類のなかで最長のものを「極大部分文字列」とする。
極大部分文字列は suffix tree の内節点でかつ２種類以上の BWT 文字列をもつことと同値
- ２回以上現れる極大部分文字列を suffix array/lcp/WDT から線形時間で求めることが可能
- suffix array は「ほとんどの場合」(文字列が十分ランダムなら)線形時間で構築可能

ナイーブな実装

極大部分文字列の抽出の実装 by Ruby 1.9
- とってもナイーブ( 時間・空間ともに >= O(N^2) )

#!/usr/bin/env ruby
# -*- coding: utf-8 -*-
# maximal substrings extraction

END_MARK = "\u0000"
#text = "abracadabra" + END_MARK
text = ARGV.map{|fn| open(fn) {|f| f.read }.gsub(/\u000d/, "")}.join("\u0001") + END_MARK
sarray = Array.new(text.length){|i|i}.sort_by{|i|text[i..-1]}
lcp = Array.new(text.length){0}
bwt = Array.new(text.length)

pre_pos = -1
pre_bwt = nil
maximal = Hash.new
sarray.each_with_index do |pos, i|
  bwt[i] = if pos==0 then END_MARK else text[pos-1] end
  if i>0 then
    l = lcp[i] = (0..(text.length-pos)).find{|j| text[pre_pos+j] != text[pos+j]}
    maximal[text[pos..(pos+l-1)]]=1 if l>0 && pre_bwt != bwt[i]
  end
  pre_pos = pos
  pre_bwt = bwt[i]
end

#sarray.each_with_index do |pos, i|
#  puts "#{i} #{lcp[i]} #{bwt[i]} #{text[pos..(pos+10)]}"
#end
p maximal.keys

実験(素性の数)

Alice's Adventures in Wonderland (145283 バイト)
- unigram : 69
- bigram : 1186
- trigram : 6283
- fourgram : 17709
- fivegram : 34912
- maximal substrings : 34058

吾輩は猫である１〜４章 (256147 バイト : utf-8)
- unigram : 2188
- bigram : 19749
- tringram : 45032
- fourgram : 63754
- fivegram : 74997
- maximal substrings : 21088