sirdf.com

runarb · Site Admin Joined: 29 Oct 2006 Posts: 4

What is the essential reading on search engine development?

I have found help in these:

Books:
Modern Information Retrieval
Ricardo Baeza-Yates, Berthier Ribeiro-Neto
http://www.sims.berkeley.edu/~hearst/irbook/

Managing Gigabytes
Ian Witten, Ian H. Witten, Allistair Moffat, Timothy C. Bell
http://www.cs.mu.oz.au/mg/

Webbooks:
INFORMATION RETRIEVAL
C. J. van RIJSBERGEN
http://www.dcs.gla.ac.uk/Keith/Preface.html

Articles:
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Larry Page, Sergei Brin
http://citeseer.ist.psu.edu/brin98anatomy.html

Building a Distributed Full-Text Index for the Web
Sergey Melnik, Sriram Raghavan, Beverly Yang, Hector Garcia-Molina
http://citeseer.ist.psu.edu/478324.html

The PageRank Citation Ranking: Bringing Order to the Web
Larry Page,Sergey Brin, R. Motwani, T. Winograd
http://citeseer.ist.psu.edu/page98pagerank.html

Search Engines and Web Dynamics
Knut Magne Risvik, Rolf Michelsen
http://citeseer.ist.psu.edu/risvik02search.html

Focused Crawling Using Context Graphs
M. Diligenti, F.M. Coetzee, S. Lawrence, C.L. Giles, M. Gori
http://citeseer.ist.psu.edu/diligenti00focused.html

Authoritative Sources in a Hyperlinked Environment
Jon M Kleinberg
http://citeseer.ist.psu.edu/kleinberg99aut...horitative.html
_________________
CTO @ Searchdaimon company search.

Fischerlaender · Posted: Sun May 23, 2004 4:55 pm Post subject:

Great post, thanks. Here are some additions:

Books:
Mining the Web
Soumen Chakrabarti
http://www.cs.berkeley.edu/~soumen/mining-the-web/
This book especially covers some basic concepts in Web IR. It's not a book about the handling of big amounts of data, but how to exploit the link structure of the web for building a good search engine.

Articles:
Searching the Web
Arasu, Arvind; Cho, Junghoo; Garcia-Molina, Hector; Paepcke, Andreas; Raghavan, Sriram
http://dbpubs.stanford.edu:8090/pub/2000-37
This article offers a very comprehensive overview of current Web search engine design. IMHO a must-read.

Efficient Crawling Through URL Ordering
Cho, J.; Garcia-Molina, H.; Page, L.
http://dbpubs.stanford.edu:8090/pub/1998-51
Shows methods on how to retrieve the "best" pages first.

Inferring Web Communities from Link Topology
David Gibson, Jon Kleinberg, Prabhakar Raghavan
http://citeseer.ist.psu.edu/gibson98inferring.html
Explains how the link structure of the web can be used to find web communities and hence authority pages.
_________________
<a href='http://www.neomo.de' target='_blank'>http://www.neomo.de</a> - die Suchmaschinen-Alternative (Testversion)

cwenz · Newbie Joined: 23 May 2004 Posts: 1

one more:

Books:
Web Document Analysis
Apostolos Antonacopoulos, Jianying Hu [Eds.]
http://www.worldscientific.com/books/compsci/5375.html
This book contains of several papers about various aspects of document analysis, including web image processing, content extraction, CAPTCHAs aso.

runarb · Site Admin Joined: 29 Oct 2006 Posts: 4

Lucene in Action
Erik Hatcher and Otis Gospodnetić
http://www.manning.com/books/hatcher2

Apache Lucene is a open source, high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Lucene in Action is the authoritative guide to it. It covers analysis, indexing, searching and querying, as well as crawling the web, and handling different files, like HTML, Word, PDF and XML.

Reviews:
http://www.oscom.org/events/oscom4/proposals/lucene
http://www.amazon.com/exec/obidos/tg/detai...394281?v=glance
_________________
CTO @ Searchdaimon company search.

runarb · Site Admin Joined: 29 Oct 2006 Posts: 4

Practical Issues of Crawling Large Web Collections
Carlos Castillo and Ricardo Baeza-Yates

During large crawls of the Web, we have observed several anomalies in the implementation of the basic protocols by some Web sites. These anomalies impose costs on the design of a Web crawler and reduce the findability of information on the Web.

We document several issues related to networking, DNS, HTTP, HTML and application programming. Our aim is to help Web crawler designers and Web application developers to improve the interoperability of their Web systems.

http://www.dcc.uchile.cl/~ccastill/papers/...eb_crawling.pdf

_________________
CTO @ Searchdaimon company search.