sirdf.com Forum Index sirdf.com
Search & Information Retrieval Development Forum
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Best Fulltext database?

 
Post new topic   Reply to topic    sirdf.com Forum Index -> Making a search engine
View previous topic :: View next topic  
Author Message
zootreeves
Newbie


Joined: 10 Dec 2005
Posts: 8

PostPosted: Sat Dec 10, 2005 4:28 pm    Post subject: Reply with quote

Does anyone know which open source fulltext database would be be best to use?

http://www.seg.rmit.edu.au/zettair/
http://swish-e.org/
http://lucene.apache.org/java/docs/
http://www.xapian.org/

Or any others? I understand nutch uses lucene right?
Back to top
View user's profile Send private message
runarb
Site Admin


Joined: 29 Oct 2006
Posts: 4

PostPosted: Sun Dec 11, 2005 3:43 am    Post subject: Reply with quote

Yes, Nutch is built on Lucene. But I am not sure if Nutch uses Lucene directly, or its one modified version.

How many documents are you planning to index?

If you are to only index some million document you should look into Berkeley DB.


If you are planning to index the web Swish-e statement on the front page: “Swish-e is ideally suited for collections of a million documents or smaller” is not a god sign.

Lucene is the only open source indexing system I know about that have been used successfully to index large portions of the web. It is used by many, and well documented. It is even written a book on how to use it to create a search engine with (Lucene in Action, http://www.amazon.com/gp/product/193239428...lance&n=283155)

But Lucene is written in Java, and Java isn’t especially fast. Traditionally search engines is written in C/C++


Have anyone look at Mifluz ( http://gnu.mirrormonster.com/software/mifluz/doc.en.html )? It is a GNU project to make a library to store a full text inverted index. Written in C++

_________________
CTO @ Searchdaimon company search.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Phoog
Newbie


Joined: 11 Dec 2005
Posts: 7

PostPosted: Thu Dec 22, 2005 2:45 am    Post subject: Reply with quote

What I have heard of, Nutch is using Lucene in its regular form, what Nutch does is adding a spider and so on.

And, I have a small project thats coming up, Im only going to index about 1-2 million documents. What fulltext db do you guys suggest for that?

By the way, have anyone here actully tried Lucene or Mifluz?

Thanks
Back to top
View user's profile Send private message
masidani
Member


Joined: 10 Jan 2006
Posts: 23

PostPosted: Tue Jan 10, 2006 4:05 pm    Post subject: Reply with quote

Lucene is written in Java, but has been ported to other languages - do a google on clucene for the C++ version, for example.
Back to top
View user's profile Send private message
masidani
Member


Joined: 10 Jan 2006
Posts: 23

PostPosted: Tue Jan 10, 2006 4:09 pm    Post subject: Reply with quote

QUOTE (Phoog @ Dec 22 2005, 02:45 AM)
What I have heard of, Nutch is using Lucene in its regular form, what Nutch does is adding a spider and so on.

And, I have a small project thats coming up, Im only going to index about 1-2 million documents. What fulltext db do you guys suggest for that?

By the way, have anyone here actully tried Lucene or Mifluz?

Thanks



I think this is true, but nutch seems to have fully integrated lucene into its package structure, rather than having it exist as a separate "lucene" package. The nutch website seems a bit vague as the version relationship between nutch and lucene.

Simon
Masidani
Back to top
View user's profile Send private message
masidani
Member


Joined: 10 Jan 2006
Posts: 23

PostPosted: Tue Jan 10, 2006 4:19 pm    Post subject: Reply with quote

See this link for an interesting comparison of various off-the-shelf search engine indexing components: http://www.cs.yorku.ca/~cs211299/pdf/Read6...ttardi.tera.pdf.

It appears to suggest that zettair is better than lucene in terms of performance.

Simon
Masidani
Back to top
View user's profile Send private message
runarb
Site Admin


Joined: 29 Oct 2006
Posts: 4

PostPosted: Tue Jan 10, 2006 6:29 pm    Post subject: Reply with quote

Seems that that page cannot be found.

But the dokuments is her also:
http://www.cs.yorku.ca/~mladen/pdf/Read6_u...ttardi.tera.pdf

bdw, nice find

Quote:
Indexing times were 13 min. for IXE, 6 min. for Zettair and 4 hours for
Lucene


Is the bad Lucene performance because is't writend in Java? Is Java realy that slow?
_________________
CTO @ Searchdaimon company search.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
old_expat
Newbie


Joined: 12 Apr 2006
Posts: 9

PostPosted: Wed Apr 26, 2006 10:48 am    Post subject: Reply with quote

Quote:
But the dokuments is her also:
http://www.cs.yorku.ca/~mladen/pdf/Read6_u...ttardi.tera.pdf


I got a NOT FOUND to this link as well.
Back to top
View user's profile Send private message
runarb
Site Admin


Joined: 29 Oct 2006
Posts: 4

PostPosted: Wed Apr 26, 2006 1:53 pm    Post subject: Reply with quote

try http://trec.nist.gov/pubs/trec13/papers/upisa-tera.pdf
_________________
CTO @ Searchdaimon company search.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
masidani
Member


Joined: 10 Jan 2006
Posts: 23

PostPosted: Wed Apr 26, 2006 5:35 pm    Post subject: Reply with quote

Just one thing to note about the Nutch-Lucene relationship - although Nutch uses Lucene, the relationship is not always kept in tandem with the latest release of Lucene. This is, I suspect, because there's some tailoring of Lucene in the Nutch version - i.e, it's not a straightforward drop of Lucene into Nutch.

Simon
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    sirdf.com Forum Index -> Making a search engine All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group