View previous topic :: View next topic |
Author |
Message |
zootreeves Newbie
Joined: 10 Dec 2005 Posts: 8
|
|
Back to top |
|
|
runarb Site Admin
Joined: 29 Oct 2006 Posts: 4
|
Posted: Sun Dec 11, 2005 3:43 am Post subject: |
|
|
Yes, Nutch is built on Lucene. But I am not sure if Nutch uses Lucene directly, or its one modified version.
How many documents are you planning to index?
If you are to only index some million document you should look into Berkeley DB.
If you are planning to index the web Swish-e statement on the front page: “Swish-e is ideally suited for collections of a million documents or smaller†is not a god sign.
Lucene is the only open source indexing system I know about that have been used successfully to index large portions of the web. It is used by many, and well documented. It is even written a book on how to use it to create a search engine with (Lucene in Action, http://www.amazon.com/gp/product/193239428...lance&n=283155)
But Lucene is written in Java, and Java isn’t especially fast. Traditionally search engines is written in C/C++
Have anyone look at Mifluz ( http://gnu.mirrormonster.com/software/mifluz/doc.en.html )? It is a GNU project to make a library to store a full text inverted index. Written in C++
_________________ CTO @ Searchdaimon company search. |
|
Back to top |
|
|
Phoog Newbie
Joined: 11 Dec 2005 Posts: 7
|
Posted: Thu Dec 22, 2005 2:45 am Post subject: |
|
|
What I have heard of, Nutch is using Lucene in its regular form, what Nutch does is adding a spider and so on.
And, I have a small project thats coming up, Im only going to index about 1-2 million documents. What fulltext db do you guys suggest for that?
By the way, have anyone here actully tried Lucene or Mifluz?
Thanks |
|
Back to top |
|
|
masidani Member
Joined: 10 Jan 2006 Posts: 23
|
Posted: Tue Jan 10, 2006 4:05 pm Post subject: |
|
|
Lucene is written in Java, but has been ported to other languages - do a google on clucene for the C++ version, for example.
|
|
Back to top |
|
|
masidani Member
Joined: 10 Jan 2006 Posts: 23
|
Posted: Tue Jan 10, 2006 4:09 pm Post subject: |
|
|
QUOTE (Phoog @ Dec 22 2005, 02:45 AM) | What I have heard of, Nutch is using Lucene in its regular form, what Nutch does is adding a spider and so on.
And, I have a small project thats coming up, Im only going to index about 1-2 million documents. What fulltext db do you guys suggest for that?
By the way, have anyone here actully tried Lucene or Mifluz?
Thanks |
I think this is true, but nutch seems to have fully integrated lucene into its package structure, rather than having it exist as a separate "lucene" package. The nutch website seems a bit vague as the version relationship between nutch and lucene.
Simon
Masidani
|
|
Back to top |
|
|
masidani Member
Joined: 10 Jan 2006 Posts: 23
|
Posted: Tue Jan 10, 2006 4:19 pm Post subject: |
|
|
See this link for an interesting comparison of various off-the-shelf search engine indexing components: http://www.cs.yorku.ca/~cs211299/pdf/Read6...ttardi.tera.pdf.
It appears to suggest that zettair is better than lucene in terms of performance.
Simon
Masidani
|
|
Back to top |
|
|
runarb Site Admin
Joined: 29 Oct 2006 Posts: 4
|
|
Back to top |
|
|
old_expat Newbie
Joined: 12 Apr 2006 Posts: 9
|
Posted: Wed Apr 26, 2006 10:48 am Post subject: |
|
|
I got a NOT FOUND to this link as well. |
|
Back to top |
|
|
runarb Site Admin
Joined: 29 Oct 2006 Posts: 4
|
|
Back to top |
|
|
masidani Member
Joined: 10 Jan 2006 Posts: 23
|
Posted: Wed Apr 26, 2006 5:35 pm Post subject: |
|
|
Just one thing to note about the Nutch-Lucene relationship - although Nutch uses Lucene, the relationship is not always kept in tandem with the latest release of Lucene. This is, I suspect, because there's some tailoring of Lucene in the Nutch version - i.e, it's not a straightforward drop of Lucene into Nutch.
Simon
|
|
Back to top |
|
|
|