View previous topic :: View next topic |
Author |
Message |
runarb Site Admin
Joined: 29 Oct 2006 Posts: 4
|
Posted: Wed Jun 02, 2004 1:49 pm Post subject: |
|
|
What kind of time frame are one looking at to make the first large setup/demo with 200 - 1 000 million pages?
I would guess 2-3 year of work time, but have never seen any numbers on this. What time estimates have you done?
_________________ CTO @ Searchdaimon company search. |
|
Back to top |
|
|
mex Newbie
Joined: 02 Jun 2004 Posts: 1
|
Posted: Wed Jun 02, 2004 5:35 pm Post subject: |
|
|
If my memory serves me correctly Google was made with 2 people in 1,5 years. |
|
Back to top |
|
|
Fischerlaender Member
Joined: 08 May 2004 Posts: 11 Location: Osterhofen, Bavaria, Germany
|
Posted: Sun Jun 06, 2004 12:13 pm Post subject: |
|
|
I'm think this depends widely on the knowledge and experience of the people trying to build the engine, thus making a simple answer not that easy.
Quote: | Google was made with 2 people in 1,5 years |
I think there were a lot more people involved. E.g. the Stanford WebBase Project was one of the basics of Google; so Larry and Brin had some kind of co-workers at the university. _________________ <a href='http://www.neomo.de' target='_blank'>http://www.neomo.de</a> - die Suchmaschinen-Alternative (Testversion) |
|
Back to top |
|
|
sfk Newbie
Joined: 16 Jul 2004 Posts: 3
|
Posted: Fri Jul 16, 2004 12:17 am Post subject: |
|
|
To add a more specific question from another newbie:
What kind of hardware and operating system (linux) add-on would you recommend for such a goal?
I'm personally pursueing another specialized search engine project (geometa.info) and started with a single PC, running linux and java.
Now I really want to scale up my infrastructure from crawling 200'000 pages up to between 3 to 5 million pages. |
|
Back to top |
|
|
Fischerlaender Member
Joined: 08 May 2004 Posts: 11 Location: Osterhofen, Bavaria, Germany
|
Posted: Fri Jul 16, 2004 9:18 am Post subject: |
|
|
Quote: | What kind of hardware and operating system (linux) add-on would you recommend for such a goal? |
My very fist version of a crawler ran on an old Linux (Debian, no modifications) box that I once used as a web server: Celeron 800, 768MB RAM, 40GB IDE HD, Motherboard with Intel BX chipset. (You see, very basic hardware ...) It was connected to the internet via an ADSL line (768kbit down, 128kbit up). The crawler was written in Perl and did HTML parsing and link extraction on-the-fly. With this configuration I could crawl 500.000 URLs per day and built an index with about 3mio pages within a week.
While it's not really necessary, running a DNS cache can improve things a lot. I found, that DNS traffic can grow to 25% of all network traffic during a crawl. With a DNS cache on my crawler box, DNS share was down to about 6%.
Try these things to speed up your crawling, in order of decreasing importance:
* Utilize the time web servers need to response for contacting other sites.
* Fasten the way you feed the URLs to your crawler.
* Use a DNS cache. _________________ <a href='http://www.neomo.de' target='_blank'>http://www.neomo.de</a> - die Suchmaschinen-Alternative (Testversion) |
|
Back to top |
|
|
sfk Newbie
Joined: 16 Jul 2004 Posts: 3
|
Posted: Sun Jul 18, 2004 12:07 pm Post subject: |
|
|
Thanks for your reply.
Our poor amount of about 200'000 has several reasons:
One reason is, that we are using HERITRIX and there are a dozen plugins running doing classification and focused crawling. Another is the limited power and size of CPU and Memory.
Java itself seems not yet being the bottleneck - it's rather the architecture and the lack of DNS caching as you suggest. It has to be noted, that before we only have been interested in very specific pages (here: only those which relate to geographic data, services or informations). That is now different for the upcoming project: We now want *any* URL we can get.
From the answers we got, it seems that memory and CPU are the bottleneck.
What I expected - but no one mentioned yet - was, also CPU register caching (or something similar) or Linux with multiple processors...?
There seems to be no "best practice" available out there yet. |
|
Back to top |
|
|
scolls Newbie
Joined: 08 Apr 2006 Posts: 8
|
Posted: Sat Apr 08, 2006 11:21 pm Post subject: |
|
|
I wrote mine in about 10 months (part-time) using Delphi. Interface in PHP. Of course, with me nothing is ever finished, so over the years I expect it should get better and better.
I actually started writing it by mistake! I was actually writing myself a small keyword density analysis tool to analyse webpages still on my pc, and figured it would be nice if it could also work on live URLs too... a few months later I ended up with the search engine I've listed in my signature... I get side-tracked, you see!!! _________________ <b><a href='http://www.webwobot.com' target='_blank'>WebWobot Search Engine</a></b><br><a href='http://www.massdebation.com' target='_blank'>MassDebation.com ~ No Ordinary Debate!</a> |
|
Back to top |
|
|
masidani Member
Joined: 10 Jan 2006 Posts: 23
|
Posted: Fri Apr 21, 2006 2:33 pm Post subject: |
|
|
This is a really good discussion, if only because it seems my initial projections for my search engine (masidani.com) were hopelessly unrealistic! I'm trying to build this part-time (whilst working a full-time job) and I find the time slipping away so fast! But
then I'm not writing everything from scratch, but am using larbin for the crawling (plus
extra features) and then lucene (plus extra packages) for indexing etc. So my current
estimate is for another year of (part-time) work.
Of course, this doesn't really take into account what would happen if the initial prototype doesn't meet expectations and requires further painstaking research....Project management isn't one of my strong points.
Simon
|
|
Back to top |
|
|
old_expat Newbie
Joined: 12 Apr 2006 Posts: 9
|
Posted: Sat Apr 22, 2006 10:08 am Post subject: |
|
|
QUOTE (Fischerlaender @ Jul 16 2004, 09:18 AM) | Quote: | What kind of hardware and operating system (linux) add-on would you recommend for such a goal? |
My very fist version of a crawler ran on an old Linux (Debian, no modifications) box that I once used as a web server: Celeron 800, 768MB RAM, 40GB IDE HD, Motherboard with Intel BX chipset. (You see, very basic hardware ...) It was connected to the internet via an ADSL line (768kbit down, 128kbit up). The crawler was written in Perl and did HTML parsing and link extraction on-the-fly. With this configuration I could crawl 500.000 URLs per day and built an index with about 3mio pages within a week.
While it's not really necessary, running a DNS cache can improve things a lot. I found, that DNS traffic can grow to 25% of all network traffic during a crawl. With a DNS cache on my crawler box, DNS share was down to about 6%.
Try these things to speed up your crawling, in order of decreasing importance:
* Utilize the time web servers need to response for contacting other sites.
* Fasten the way you feed the URLs to your crawler.
* Use a DNS cache. |
Hello Fischerlander - your crawler sounds like it was very efficient. would a crawler as you describe above also be able to focused on specific topics at the same time it did the parsing and link extraction?
Are you still using that crawler?
Would it be somehow available for a poor startup?:)
What about indexing and search .. how did you handle that?
BTW, If I am being rude bys asking these questions in this manner, my apologies. |
|
Back to top |
|
|
|