sirdf.com

zootreeves · Newbie Joined: 10 Dec 2005 Posts: 8

Does anyone know which open source fulltext database would be be best to use?

http://www.seg.rmit.edu.au/zettair/
http://swish-e.org/
http://lucene.apache.org/java/docs/
http://www.xapian.org/

Or any others? I understand nutch uses lucene right?

runarb · Site Admin Joined: 29 Oct 2006 Posts: 4

Yes, Nutch is built on Lucene. But I am not sure if Nutch uses Lucene directly, or its one modified version.

How many documents are you planning to index?

If you are to only index some million document you should look into Berkeley DB.

If you are planning to index the web Swish-e statement on the front page: â€œSwish-e is ideally suited for collections of a million documents or smallerâ€ is not a god sign.

Lucene is the only open source indexing system I know about that have been used successfully to index large portions of the web. It is used by many, and well documented. It is even written a book on how to use it to create a search engine with (Lucene in Action, http://www.amazon.com/gp/product/193239428...lance&n=283155)

But Lucene is written in Java, and Java isnâ€™t especially fast. Traditionally search engines is written in C/C++

Have anyone look at Mifluz ( http://gnu.mirrormonster.com/software/mifluz/doc.en.html )? It is a GNU project to make a library to store a full text inverted index. Written in C++

_________________
CTO @ Searchdaimon company search.

Phoog · Newbie Joined: 11 Dec 2005 Posts: 7

What I have heard of, Nutch is using Lucene in its regular form, what Nutch does is adding a spider and so on.

And, I have a small project thats coming up, Im only going to index about 1-2 million documents. What fulltext db do you guys suggest for that?

By the way, have anyone here actully tried Lucene or Mifluz?

Thanks

masidani · Member Joined: 10 Jan 2006 Posts: 23

Lucene is written in Java, but has been ported to other languages - do a google on clucene for the C++ version, for example.

masidani · Member Joined: 10 Jan 2006 Posts: 23

QUOTE (Phoog @ Dec 22 2005, 02:45 AM)

What I have heard of, Nutch is using Lucene in its regular form, what Nutch does is adding a spider and so on.

And, I have a small project thats coming up, Im only going to index about 1-2 million documents. What fulltext db do you guys suggest for that?

By the way, have anyone here actully tried Lucene or Mifluz?

Thanks

I think this is true, but nutch seems to have fully integrated lucene into its package structure, rather than having it exist as a separate "lucene" package. The nutch website seems a bit vague as the version relationship between nutch and lucene.

Simon
Masidani

masidani · Member Joined: 10 Jan 2006 Posts: 23

See this link for an interesting comparison of various off-the-shelf search engine indexing components: http://www.cs.yorku.ca/~cs211299/pdf/Read6...ttardi.tera.pdf.

It appears to suggest that zettair is better than lucene in terms of performance.

Simon
Masidani

runarb · Site Admin Joined: 29 Oct 2006 Posts: 4

Seems that that page cannot be found.

But the dokuments is her also:
http://www.cs.yorku.ca/~mladen/pdf/Read6_u...ttardi.tera.pdf

bdw, nice find

old_expat · Newbie Joined: 12 Apr 2006 Posts: 9

runarb · Site Admin Joined: 29 Oct 2006 Posts: 4

try http://trec.nist.gov/pubs/trec13/papers/upisa-tera.pdf
_________________
CTO @ Searchdaimon company search.

masidani · Member Joined: 10 Jan 2006 Posts: 23

Just one thing to note about the Nutch-Lucene relationship - although Nutch uses Lucene, the relationship is not always kept in tandem with the latest release of Lucene. This is, I suspect, because there's some tailoring of Lucene in the Nutch version - i.e, it's not a straightforward drop of Lucene into Nutch.

Simon