sirdf.com Forum Index sirdf.com
Search & Information Retrieval Development Forum
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Essential reading

 
Post new topic   Reply to topic    sirdf.com Forum Index -> Making a search engine
View previous topic :: View next topic  
Author Message
runarb
Site Admin


Joined: 29 Oct 2006
Posts: 4

PostPosted: Sat May 08, 2004 4:45 pm    Post subject: Reply with quote

What is the essential reading on search engine development?

I have found help in these:

Books:
Modern Information Retrieval
Ricardo Baeza-Yates, Berthier Ribeiro-Neto
http://www.sims.berkeley.edu/~hearst/irbook/

Managing Gigabytes
Ian Witten, Ian H. Witten, Allistair Moffat, Timothy C. Bell
http://www.cs.mu.oz.au/mg/


Webbooks:
INFORMATION RETRIEVAL
C. J. van RIJSBERGEN
http://www.dcs.gla.ac.uk/Keith/Preface.html


Articles:
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Larry Page, Sergei Brin
http://citeseer.ist.psu.edu/brin98anatomy.html

Building a Distributed Full-Text Index for the Web
Sergey Melnik, Sriram Raghavan, Beverly Yang, Hector Garcia-Molina
http://citeseer.ist.psu.edu/478324.html

The PageRank Citation Ranking: Bringing Order to the Web
Larry Page,Sergey Brin, R. Motwani, T. Winograd
http://citeseer.ist.psu.edu/page98pagerank.html


Search Engines and Web Dynamics
Knut Magne Risvik, Rolf Michelsen
http://citeseer.ist.psu.edu/risvik02search.html

Focused Crawling Using Context Graphs
M. Diligenti, F.M. Coetzee, S. Lawrence, C.L. Giles, M. Gori
http://citeseer.ist.psu.edu/diligenti00focused.html

Authoritative Sources in a Hyperlinked Environment
Jon M Kleinberg
http://citeseer.ist.psu.edu/kleinberg99aut...horitative.html
_________________
CTO @ Searchdaimon company search.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Fischerlaender
Member


Joined: 08 May 2004
Posts: 11
Location: Osterhofen, Bavaria, Germany

PostPosted: Sun May 23, 2004 4:55 pm    Post subject: Reply with quote

Great post, thanks. Here are some additions:

Books:
Mining the Web
Soumen Chakrabarti
http://www.cs.berkeley.edu/~soumen/mining-the-web/
This book especially covers some basic concepts in Web IR. It's not a book about the handling of big amounts of data, but how to exploit the link structure of the web for building a good search engine.

Articles:
Searching the Web
Arasu, Arvind; Cho, Junghoo; Garcia-Molina, Hector; Paepcke, Andreas; Raghavan, Sriram
http://dbpubs.stanford.edu:8090/pub/2000-37
This article offers a very comprehensive overview of current Web search engine design. IMHO a must-read.

Efficient Crawling Through URL Ordering
Cho, J.; Garcia-Molina, H.; Page, L.
http://dbpubs.stanford.edu:8090/pub/1998-51
Shows methods on how to retrieve the "best" pages first.

Inferring Web Communities from Link Topology
David Gibson, Jon Kleinberg, Prabhakar Raghavan
http://citeseer.ist.psu.edu/gibson98inferring.html
Explains how the link structure of the web can be used to find web communities and hence authority pages.
_________________
<a href='http://www.neomo.de' target='_blank'>http://www.neomo.de</a> - die Suchmaschinen-Alternative (Testversion)
Back to top
View user's profile Send private message Visit poster's website
cwenz
Newbie


Joined: 23 May 2004
Posts: 1

PostPosted: Fri Jun 11, 2004 1:13 pm    Post subject: Reply with quote

one more:

Books:
Web Document Analysis
Apostolos Antonacopoulos, Jianying Hu [Eds.]
http://www.worldscientific.com/books/compsci/5375.html
This book contains of several papers about various aspects of document analysis, including web image processing, content extraction, CAPTCHAs aso.

Back to top
View user's profile Send private message
runarb
Site Admin


Joined: 29 Oct 2006
Posts: 4

PostPosted: Mon Dec 26, 2005 1:51 am    Post subject: Reply with quote

Lucene in Action
Erik Hatcher and Otis Gospodnetić
http://www.manning.com/books/hatcher2

Apache Lucene is a open source, high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Lucene in Action is the authoritative guide to it. It covers analysis, indexing, searching and querying, as well as crawling the web, and handling different files, like HTML, Word, PDF and XML.

Reviews:
http://www.oscom.org/events/oscom4/proposals/lucene
http://www.amazon.com/exec/obidos/tg/detai...394281?v=glance
_________________
CTO @ Searchdaimon company search.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
runarb
Site Admin


Joined: 29 Oct 2006
Posts: 4

PostPosted: Wed Apr 19, 2006 1:02 am    Post subject: Reply with quote

Practical Issues of Crawling Large Web Collections
Carlos Castillo and Ricardo Baeza-Yates

During large crawls of the Web, we have observed several anomalies in the implementation of the basic protocols by some Web sites. These anomalies impose costs on the design of a Web crawler and reduce the findability of information on the Web.

We document several issues related to networking, DNS, HTTP, HTML and application programming. Our aim is to help Web crawler designers and Web application developers to improve the interoperability of their Web systems.


http://www.dcc.uchile.cl/~ccastill/papers/...eb_crawling.pdf


_________________
CTO @ Searchdaimon company search.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Display posts from previous:   
Post new topic   Reply to topic    sirdf.com Forum Index -> Making a search engine All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group