View previous topic :: View next topic |
Author |
Message |
Fischerlaender Member
Joined: 08 May 2004 Posts: 11 Location: Osterhofen, Bavaria, Germany
|
Posted: Tue Jul 13, 2004 10:14 am Post subject: |
|
|
One thing which is essential for any crawler-based search engine is to find a reasonable policy for your crawler.
Crawlers can consume a lot of ressources from other people, so you have to be careful what you do. The most efficient way for a crawler to fetch documents would be to start with the robots.txt file, analyze it and then start downloading all documents from this host as fast as possible. Thus time for crawling could be minimized, because you only have to download the robots.txt file once and DNS requests are also at a minimum.
Obviously the webmaster of the host wouldn't be too happy about such a crawler policy. So I'm doing it like this: The crawler gets a junk (about ten or twenty) of URLs from a single host in a row. It fetches the appropriate robots.txt file and downloads all of the ten or twenty URLs from that host with a minimum time lag of one second between every request. Are those URLs downloaded the crawler contacts a different host.
I found that this policy is a good compromise between DNS and robots.txt traffic and being nice to the web hosts out there. _________________ <a href='http://www.neomo.de' target='_blank'>http://www.neomo.de</a> - die Suchmaschinen-Alternative (Testversion) |
|
Back to top |
|
|
runarb Site Admin
Joined: 29 Oct 2006 Posts: 4
|
Posted: Wed Jul 14, 2004 9:31 pm Post subject: |
|
|
Quote: |
It fetches the appropriate robots.txt file and downloads all of the ten or twenty URLs from that host with a minimum time lag of one second between every request. Are those URLs downloaded the crawler contacts a different host.
|
Does the crawler have to sit around waiting for the 1 sec delay, or is it crawling other url's in between?
How many url's can you ca crawl a day using this setup? The Boitho crawler can crawl ca. 1,2 million url's pr. day on a 2.4 gz laptop with 256 mb ram and 2 mbit ADSL line. _________________ CTO @ Searchdaimon company search. |
|
Back to top |
|
|
Fischerlaender Member
Joined: 08 May 2004 Posts: 11 Location: Osterhofen, Bavaria, Germany
|
Posted: Wed Jul 14, 2004 11:40 pm Post subject: |
|
|
Quote: | Does the crawler have to sit around waiting for the 1 sec delay, or is it crawling other url's in between? |
The crawler isn't sitting in the mean time, it is sleeping.
Because every crawler consists of several crawling processes, this doesn't hurt the perfomance too bad - if any.
Quote: |
How many url's can you ca crawl a day using this setup? The Boitho crawler can crawl ca. 1,2 million url's pr. day on a 2.4 gz laptop with 256 mb ram and 2 mbit ADSL line. |
It crawls about 700,000 URLs a day on a Celeron with 1500 MHz and 512 MB. The bandwith isn't the bottleneck, it's the hardware the crawler is running on. But because crawling is very easily parallelized, I did not put too much effort into performance improvements.
I should state that the crawler isn't just crawling, it is also extracting links from the crawled pages and is even parsing the HTML code on-the-fly. _________________ <a href='http://www.neomo.de' target='_blank'>http://www.neomo.de</a> - die Suchmaschinen-Alternative (Testversion) |
|
Back to top |
|
|
Angelina_Apr Newbie
Joined: 12 Dec 2009 Posts: 1 Location: Mexico
|
Posted: Fri Dec 25, 2009 2:26 pm Post subject: Crawler Policy |
|
|
bijugc wrote:So now is the time for us to discus what kind of advertisements is acceptable for us. That way we will be able give a better guidance to the websites. If we notice somebody violate our guidelines we should politely inform them about our policy. _________________ hey guys, who has the balls? |
|
Back to top |
|
|
|