sirdf.com

runarb · Site Admin Joined: 29 Oct 2006 Posts: 4

What kind of time frame are one looking at to make the first large setup/demo with 200 - 1 000 million pages?

I would guess 2-3 year of work time, but have never seen any numbers on this. What time estimates have you done?

_________________
CTO @ Searchdaimon company search.

mex · Newbie Joined: 02 Jun 2004 Posts: 1

If my memory serves me correctly Google was made with 2 people in 1,5 years.

Fischerlaender · Posted: Sun Jun 06, 2004 12:13 pm Post subject:

I'm think this depends widely on the knowledge and experience of the people trying to build the engine, thus making a simple answer not that easy.

sfk · Newbie Joined: 16 Jul 2004 Posts: 3

To add a more specific question from another newbie:
What kind of hardware and operating system (linux) add-on would you recommend for such a goal?
I'm personally pursueing another specialized search engine project (geometa.info) and started with a single PC, running linux and java.
Now I really want to scale up my infrastructure from crawling 200'000 pages up to between 3 to 5 million pages.

Fischerlaender · Posted: Fri Jul 16, 2004 9:18 am Post subject:

sfk · Newbie Joined: 16 Jul 2004 Posts: 3

Thanks for your reply.

Our poor amount of about 200'000 has several reasons:
One reason is, that we are using HERITRIX and there are a dozen plugins running doing classification and focused crawling. Another is the limited power and size of CPU and Memory.

Java itself seems not yet being the bottleneck - it's rather the architecture and the lack of DNS caching as you suggest. It has to be noted, that before we only have been interested in very specific pages (here: only those which relate to geographic data, services or informations). That is now different for the upcoming project: We now want *any* URL we can get.

From the answers we got, it seems that memory and CPU are the bottleneck.
What I expected - but no one mentioned yet - was, also CPU register caching (or something similar) or Linux with multiple processors...?

There seems to be no "best practice" available out there yet.

scolls · Newbie Joined: 08 Apr 2006 Posts: 8

I wrote mine in about 10 months (part-time) using Delphi. Interface in PHP. Of course, with me nothing is ever finished, so over the years I expect it should get better and better.

I actually started writing it by mistake! I was actually writing myself a small keyword density analysis tool to analyse webpages still on my pc, and figured it would be nice if it could also work on live URLs too... a few months later I ended up with the search engine I've listed in my signature... I get side-tracked, you see!!! Very Happy

_________________
<b><a href='http://www.webwobot.com' target='_blank'>WebWobot Search Engine</a></b><br><a href='http://www.massdebation.com' target='_blank'>MassDebation.com ~ No Ordinary Debate!</a>

masidani · Member Joined: 10 Jan 2006 Posts: 23

This is a really good discussion, if only because it seems my initial projections for my search engine (masidani.com) were hopelessly unrealistic! I'm trying to build this part-time (whilst working a full-time job) and I find the time slipping away so fast! But
then I'm not writing everything from scratch, but am using larbin for the crawling (plus
extra features) and then lucene (plus extra packages) for indexing etc. So my current
estimate is for another year of (part-time) work.

Of course, this doesn't really take into account what would happen if the initial prototype doesn't meet expectations and requires further painstaking research....Project management isn't one of my strong points.

Simon

old_expat · Newbie Joined: 12 Apr 2006 Posts: 9

QUOTE (Fischerlaender @ Jul 16 2004, 09:18 AM)

Quote:

What kind of hardware and operating system (linux) add-on would you recommend for such a goal?

My very fist version of a crawler ran on an old Linux (Debian, no modifications) box that I once used as a web server: Celeron 800, 768MB RAM, 40GB IDE HD, Motherboard with Intel BX chipset. (You see, very basic hardware ...) It was connected to the internet via an ADSL line (768kbit down, 128kbit up). The crawler was written in Perl and did HTML parsing and link extraction on-the-fly. With this configuration I could crawl 500.000 URLs per day and built an index with about 3mio pages within a week.

While it's not really necessary, running a DNS cache can improve things a lot. I found, that DNS traffic can grow to 25% of all network traffic during a crawl. With a DNS cache on my crawler box, DNS share was down to about 6%.

Try these things to speed up your crawling, in order of decreasing importance:
* Utilize the time web servers need to response for contacting other sites.
* Fasten the way you feed the URLs to your crawler.
* Use a DNS cache.

Hello Fischerlander - your crawler sounds like it was very efficient. would a crawler as you describe above also be able to focused on specific topics at the same time it did the parsing and link extraction?

Are you still using that crawler?

Would it be somehow available for a poor startup?:)

What about indexing and search .. how did you handle that?

BTW, If I am being rude bys asking these questions in this manner, my apologies.