sirdf.com Forum Index sirdf.com
Search & Information Retrieval Development Forum
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Crawler Policy

 
Post new topic   Reply to topic    sirdf.com Forum Index -> Running a search engine
View previous topic :: View next topic  
Author Message
Fischerlaender
Member


Joined: 08 May 2004
Posts: 11
Location: Osterhofen, Bavaria, Germany

PostPosted: Tue Jul 13, 2004 10:14 am    Post subject: Reply with quote

One thing which is essential for any crawler-based search engine is to find a reasonable policy for your crawler.

Crawlers can consume a lot of ressources from other people, so you have to be careful what you do. The most efficient way for a crawler to fetch documents would be to start with the robots.txt file, analyze it and then start downloading all documents from this host as fast as possible. Thus time for crawling could be minimized, because you only have to download the robots.txt file once and DNS requests are also at a minimum.

Obviously the webmaster of the host wouldn't be too happy about such a crawler policy. So I'm doing it like this: The crawler gets a junk (about ten or twenty) of URLs from a single host in a row. It fetches the appropriate robots.txt file and downloads all of the ten or twenty URLs from that host with a minimum time lag of one second between every request. Are those URLs downloaded the crawler contacts a different host.

I found that this policy is a good compromise between DNS and robots.txt traffic and being nice to the web hosts out there.
_________________
<a href='http://www.neomo.de' target='_blank'>http://www.neomo.de</a> - die Suchmaschinen-Alternative (Testversion)
Back to top
View user's profile Send private message Visit poster's website
runarb
Site Admin


Joined: 29 Oct 2006
Posts: 4

PostPosted: Wed Jul 14, 2004 9:31 pm    Post subject: Reply with quote

Quote:

It fetches the appropriate robots.txt file and downloads all of the ten or twenty URLs from that host with a minimum time lag of one second between every request. Are those URLs downloaded the crawler contacts a different host.


Does the crawler have to sit around waiting for the 1 sec delay, or is it crawling other url's in between?

How many url's can you ca crawl a day using this setup? The Boitho crawler can crawl ca. 1,2 million url's pr. day on a 2.4 gz laptop with 256 mb ram and 2 mbit ADSL line.
_________________
CTO @ Searchdaimon company search.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Fischerlaender
Member


Joined: 08 May 2004
Posts: 11
Location: Osterhofen, Bavaria, Germany

PostPosted: Wed Jul 14, 2004 11:40 pm    Post subject: Reply with quote

Quote:
Does the crawler have to sit around waiting for the 1 sec delay, or is it crawling other url's in between?

The crawler isn't sitting in the mean time, it is sleeping. Smile
Because every crawler consists of several crawling processes, this doesn't hurt the perfomance too bad - if any.

Quote:

How many url's can you ca crawl a day using this setup? The Boitho crawler can crawl ca. 1,2 million url's pr. day on a 2.4 gz laptop with 256 mb ram and 2 mbit ADSL line.

It crawls about 700,000 URLs a day on a Celeron with 1500 MHz and 512 MB. The bandwith isn't the bottleneck, it's the hardware the crawler is running on. But because crawling is very easily parallelized, I did not put too much effort into performance improvements.
I should state that the crawler isn't just crawling, it is also extracting links from the crawled pages and is even parsing the HTML code on-the-fly.
_________________
<a href='http://www.neomo.de' target='_blank'>http://www.neomo.de</a> - die Suchmaschinen-Alternative (Testversion)
Back to top
View user's profile Send private message Visit poster's website
Angelina_Apr
Newbie


Joined: 12 Dec 2009
Posts: 1
Location: Mexico

PostPosted: Fri Dec 25, 2009 2:26 pm    Post subject: Crawler Policy Reply with quote

bijugc wrote:So now is the time for us to discus what kind of advertisements is acceptable for us. That way we will be able give a better guidance to the websites. If we notice somebody violate our guidelines we should politely inform them about our policy.
_________________
hey guys, who has the balls? Smile
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Display posts from previous:   
Post new topic   Reply to topic    sirdf.com Forum Index -> Running a search engine All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group