sirdf.com

Fischerlaender · Posted: Tue Jul 13, 2004 10:14 am Post subject:

One thing which is essential for any crawler-based search engine is to find a reasonable policy for your crawler.

Crawlers can consume a lot of ressources from other people, so you have to be careful what you do. The most efficient way for a crawler to fetch documents would be to start with the robots.txt file, analyze it and then start downloading all documents from this host as fast as possible. Thus time for crawling could be minimized, because you only have to download the robots.txt file once and DNS requests are also at a minimum.

Obviously the webmaster of the host wouldn't be too happy about such a crawler policy. So I'm doing it like this: The crawler gets a junk (about ten or twenty) of URLs from a single host in a row. It fetches the appropriate robots.txt file and downloads all of the ten or twenty URLs from that host with a minimum time lag of one second between every request. Are those URLs downloaded the crawler contacts a different host.

I found that this policy is a good compromise between DNS and robots.txt traffic and being nice to the web hosts out there.
_________________
<a href='http://www.neomo.de' target='_blank'>http://www.neomo.de</a> - die Suchmaschinen-Alternative (Testversion)

runarb · Site Admin Joined: 29 Oct 2006 Posts: 4

Fischerlaender · Posted: Wed Jul 14, 2004 11:40 pm Post subject:

Angelina_Apr · Newbie Joined: 12 Dec 2009 Posts: 1 Location: Mexico

bijugc wrote:So now is the time for us to discus what kind of advertisements is acceptable for us. That way we will be able give a better guidance to the websites. If we notice somebody violate our guidelines we should politely inform them about our policy.
_________________
hey guys, who has the balls? Smile