|
sirdf.com Search & Information Retrieval Development Forum
|
View previous topic :: View next topic |
Author |
Message |
runarb Site Admin
Joined: 29 Oct 2006 Posts: 4
|
Posted: Sat May 08, 2004 4:45 pm Post subject: |
|
|
What is the essential reading on search engine development?
I have found help in these:
Books:
Modern Information Retrieval
Ricardo Baeza-Yates, Berthier Ribeiro-Neto
http://www.sims.berkeley.edu/~hearst/irbook/
Managing Gigabytes
Ian Witten, Ian H. Witten, Allistair Moffat, Timothy C. Bell
http://www.cs.mu.oz.au/mg/
Webbooks:
INFORMATION RETRIEVAL
C. J. van RIJSBERGEN
http://www.dcs.gla.ac.uk/Keith/Preface.html
Articles:
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Larry Page, Sergei Brin
http://citeseer.ist.psu.edu/brin98anatomy.html
Building a Distributed Full-Text Index for the Web
Sergey Melnik, Sriram Raghavan, Beverly Yang, Hector Garcia-Molina
http://citeseer.ist.psu.edu/478324.html
The PageRank Citation Ranking: Bringing Order to the Web
Larry Page,Sergey Brin, R. Motwani, T. Winograd
http://citeseer.ist.psu.edu/page98pagerank.html
Search Engines and Web Dynamics
Knut Magne Risvik, Rolf Michelsen
http://citeseer.ist.psu.edu/risvik02search.html
Focused Crawling Using Context Graphs
M. Diligenti, F.M. Coetzee, S. Lawrence, C.L. Giles, M. Gori
http://citeseer.ist.psu.edu/diligenti00focused.html
Authoritative Sources in a Hyperlinked Environment
Jon M Kleinberg
http://citeseer.ist.psu.edu/kleinberg99aut...horitative.html _________________ CTO @ Searchdaimon company search. |
|
Back to top |
|
|
Fischerlaender Member
Joined: 08 May 2004 Posts: 11 Location: Osterhofen, Bavaria, Germany
|
Posted: Sun May 23, 2004 4:55 pm Post subject: |
|
|
Great post, thanks. Here are some additions:
Books:
Mining the Web
Soumen Chakrabarti
http://www.cs.berkeley.edu/~soumen/mining-the-web/
This book especially covers some basic concepts in Web IR. It's not a book about the handling of big amounts of data, but how to exploit the link structure of the web for building a good search engine.
Articles:
Searching the Web
Arasu, Arvind; Cho, Junghoo; Garcia-Molina, Hector; Paepcke, Andreas; Raghavan, Sriram
http://dbpubs.stanford.edu:8090/pub/2000-37
This article offers a very comprehensive overview of current Web search engine design. IMHO a must-read.
Efficient Crawling Through URL Ordering
Cho, J.; Garcia-Molina, H.; Page, L.
http://dbpubs.stanford.edu:8090/pub/1998-51
Shows methods on how to retrieve the "best" pages first.
Inferring Web Communities from Link Topology
David Gibson, Jon Kleinberg, Prabhakar Raghavan
http://citeseer.ist.psu.edu/gibson98inferring.html
Explains how the link structure of the web can be used to find web communities and hence authority pages. _________________ <a href='http://www.neomo.de' target='_blank'>http://www.neomo.de</a> - die Suchmaschinen-Alternative (Testversion) |
|
Back to top |
|
|
cwenz Newbie
Joined: 23 May 2004 Posts: 1
|
Posted: Fri Jun 11, 2004 1:13 pm Post subject: |
|
|
one more:
Books:
Web Document Analysis
Apostolos Antonacopoulos, Jianying Hu [Eds.]
http://www.worldscientific.com/books/compsci/5375.html
This book contains of several papers about various aspects of document analysis, including web image processing, content extraction, CAPTCHAs aso.
|
|
Back to top |
|
|
runarb Site Admin
Joined: 29 Oct 2006 Posts: 4
|
Posted: Mon Dec 26, 2005 1:51 am Post subject: |
|
|
Lucene in Action
Erik Hatcher and Otis Gospodnetić
http://www.manning.com/books/hatcher2
Apache Lucene is a open source, high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
Lucene in Action is the authoritative guide to it. It covers analysis, indexing, searching and querying, as well as crawling the web, and handling different files, like HTML, Word, PDF and XML.
Reviews:
http://www.oscom.org/events/oscom4/proposals/lucene
http://www.amazon.com/exec/obidos/tg/detai...394281?v=glance _________________ CTO @ Searchdaimon company search. |
|
Back to top |
|
|
runarb Site Admin
Joined: 29 Oct 2006 Posts: 4
|
Posted: Wed Apr 19, 2006 1:02 am Post subject: |
|
|
Practical Issues of Crawling Large Web Collections
Carlos Castillo and Ricardo Baeza-Yates
During large crawls of the Web, we have observed several anomalies in the implementation of the basic protocols by some Web sites. These anomalies impose costs on the design of a Web crawler and reduce the findability of information on the Web.
We document several issues related to networking, DNS, HTTP, HTML and application programming. Our aim is to help Web crawler designers and Web application developers to improve the interoperability of their Web systems.
http://www.dcc.uchile.cl/~ccastill/papers/...eb_crawling.pdf
_________________ CTO @ Searchdaimon company search. |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|