Goint Nutchy

Posted on 2007-11-08 09:49:01 EET.

I've been meaning to write a "note to self" about this for a while but been just too busy...

Anyways, recently there was a need to have non-MidCOM sites searchable via midcom.helper.search, merging results on style level would have been too ugly a hack (though it also works, or making an iframe for a separate set of results from some other indexer), so on to find a spider that we could use to feed data to Solr.

Apache Nutch was already using Lucene, but used it internally, luckily this page explained a way to make Nutch talk with a Solr backend.

I got it working with 0.9 release of Nutch and a combined Nutch+MidCOM Solr schema (get it from Midgard SVN). However since Nutch builds the abstracts or summaries at result display time in stead of storing them in index we don't have those available, it should be possible to hook a generic summarizer to the index phase but I didn't have the time to look into that at the time, also my Java skills are very rusty...

As linked page states, the implementation is a bit naive and stub, but it works for the basic need of getting stuff to the index and searchable, during this I also got an idea of making a Nutch backend to MidCOM indexer service (Nutch has RSS output of the results as well), though that would work only for 100% public sites the result time summarizing is rather nice.

Back

Layout Copyright © 2006 Finnish Teleservice Center Ltd Oy - Site Powered by Midgard CMS