Apache Nutch のプラグインの作り方

オープンソース Web 検索エンジン Apache Nutch の概要 - Mi manca qualche giovedi`? と Apache Nutch のプラグインと言語判別 - Mi manca qualche giovedi`? の続き。
Apache Nutch 1.2 をベースに、 IndexingFilter extension-point へのプラグインを作成する例で、プラグインを作り方をみていく。

IndexingFilter extension-point

IndexingFilter extension-point はメタデータをインデックスへ追加するプラグインで、Hadoop 上で動作する Indexer の MapTask から呼び出される。
他の extension-points も Injecter, Generator, Fetcher, Parser などのジョブ(オープンソース Web 検索エンジン Apache Nutch の概要 - Mi manca qualche giovedi`? 参照)の MapTask (もしくは ReduceTask?) から呼ばれるものであるはずなので、デバッグ方法などは Hadoop に準拠する。

IndexingFilter extension-point へのプラグインは org.apache.nutch.indexer.IndexingFilter を implements したクラスとして実装する。
IndexingFilter は以下の２つのメソッドが宣言されている。

  void addIndexBackendOptions(Configuration conf);
  NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks);

addIndexBackendOptions() はインデックスに追加するメタデータフィールドの定義情報を指定するためのもの。
LuceneWriter.addFieldOptions() の呼び出しを必要数だけ書けばいい(はず)。詳しくは LuceneWriter.addFieldOptions() を調べればよい(はず……)。

filter() にはメタデータの抽出と追加を記述する。
こういったメソッドやその仕様は extension-point によって異なるので、それぞれ調べる必要がある。標準のプラグインの中から、実装したい extension-points を使っているものを探して、そのソースを読むのが一番確実で早い。

org.apache.hadoop.conf.Configurable を extends している extension-point の場合(ほとんどの extension-pionts が当てはまる)は、さらに setConf() と getConf() を実装する。

  void setConf(Configuration conf);
  Configuration getConf();

setConf() では渡された org.apache.hadoop.conf.Configuration オブジェクトを保持しつつ、必要な初期化を行う。getConf() では保持している Configuration オブジェクトを返す。このあたりは後のコード例を参照。
Configuration オブジェクトを通じて、nutch-(default|site).xml に指定された設定パラメータが取得できるので、この値を使って必要な初期化処理を書く。

実装

プラグイン開発するには、apache-nutch-1.2-bin.tar.gz を展開して、以下の２つのライブラリを参照する。

nutch-1.2.jar
lib/hadoop-0.20.2-core.jar

コード例(スケルトン)を以下に示す。

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.Inlinks;
import org.apache.nutch.indexer.IndexingException;
import org.apache.nutch.indexer.IndexingFilter;
import org.apache.nutch.indexer.NutchDocument;
import org.apache.nutch.indexer.lucene.LuceneWriter;
import org.apache.nutch.metadata.Metadata;
import org.apache.nutch.parse.Parse;

public class SampleFilter implements IndexingFilter {
	private Configuration conf = null;

	public SampleFilter() {
		// 引数無しのコンストラクタ
	}

	public void setConf(Configuration conf) {
		this.conf = conf;
		// TODO: ここに必要な初期化を書く
		// 設定値は conf.getFloat([パラメータ名], [未定義時のデフォルト値]) などとして参照
	}

	public Configuration getConf() {
		return this.conf;
	}

	// 以上はほぼ全ての extension-points 共通

	// 以下は IndexingFilter extension-point 固有

	public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
			CrawlDatum datum, Inlinks inlinks) throws IndexingException {
		// TODO: ここにプラグインの本体を実装する
	}

	public void addIndexBackendOptions(Configuration conf) {
		// 追加するメタデータフィールドの定義情報を書く
		LuceneWriter.addFieldOptions("lang", LuceneWriter.STORE.YES,
				LuceneWriter.INDEX.UNTOKENIZED, conf);
	}
}

IndexingFilter#filter() では、NutchDocument doc などからメタデータを抽出し、doc.add() によりそれを追加する。これも実際のコードを見た方が早い。
以下は、Language Detection Library for Java を用いた言語判定プラグインを書く場合の実装例。

	public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
			CrawlDatum datum, Inlinks inlinks) throws IndexingException {

		// meta タグに language の指定があれば、それを言語とする。
		String lang = parse.getData().getParseMeta().get(Metadata.LANGUAGE);
		if (lang == null) {

			// 指定がなければ Language Detection Library を使って統計情報から言語を推定する
			StringBuilder text = new StringBuilder();
			text.append(parse.getData().getTitle()).append(" ").append(parse.getText());
			try {
				Detector detector = DetectorFactory.create();
				detector.append(text.toString());
				lang = detector.detect();
			} catch (LangDetectException e) {
				throw new IndexingException("Detection failed.", e);
			}
		}
		if (lang == null) lang = "unknown";

		// ドキュメントのインデックスに言語メタデータを追加し、返す
		doc.add("lang", lang);
		return doc;
	}

次は plugin.xml の書き方と、49言語判定プラグインの紹介。