sirdf.com Forum Index sirdf.com
Search & Information Retrieval Development Forum
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Time to build a search engine?

 
Post new topic   Reply to topic    sirdf.com Forum Index -> Making a search engine
View previous topic :: View next topic  
Author Message
runarb
Site Admin


Joined: 29 Oct 2006
Posts: 4

PostPosted: Wed Jun 02, 2004 1:49 pm    Post subject: Reply with quote

What kind of time frame are one looking at to make the first large setup/demo with 200 - 1 000 million pages?

I would guess 2-3 year of work time, but have never seen any numbers on this. What time estimates have you done?

_________________
CTO @ Searchdaimon company search.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
mex
Newbie


Joined: 02 Jun 2004
Posts: 1

PostPosted: Wed Jun 02, 2004 5:35 pm    Post subject: Reply with quote

If my memory serves me correctly Google was made with 2 people in 1,5 years.
Back to top
View user's profile Send private message
Fischerlaender
Member


Joined: 08 May 2004
Posts: 11
Location: Osterhofen, Bavaria, Germany

PostPosted: Sun Jun 06, 2004 12:13 pm    Post subject: Reply with quote

I'm think this depends widely on the knowledge and experience of the people trying to build the engine, thus making a simple answer not that easy.

Quote:
Google was made with 2 people in 1,5 years

I think there were a lot more people involved. E.g. the Stanford WebBase Project was one of the basics of Google; so Larry and Brin had some kind of co-workers at the university.
_________________
<a href='http://www.neomo.de' target='_blank'>http://www.neomo.de</a> - die Suchmaschinen-Alternative (Testversion)
Back to top
View user's profile Send private message Visit poster's website
sfk
Newbie


Joined: 16 Jul 2004
Posts: 3

PostPosted: Fri Jul 16, 2004 12:17 am    Post subject: Reply with quote

To add a more specific question from another newbie:
What kind of hardware and operating system (linux) add-on would you recommend for such a goal?
I'm personally pursueing another specialized search engine project (geometa.info) and started with a single PC, running linux and java.
Now I really want to scale up my infrastructure from crawling 200'000 pages up to between 3 to 5 million pages.
Back to top
View user's profile Send private message
Fischerlaender
Member


Joined: 08 May 2004
Posts: 11
Location: Osterhofen, Bavaria, Germany

PostPosted: Fri Jul 16, 2004 9:18 am    Post subject: Reply with quote

Quote:
What kind of hardware and operating system (linux) add-on would you recommend for such a goal?


My very fist version of a crawler ran on an old Linux (Debian, no modifications) box that I once used as a web server: Celeron 800, 768MB RAM, 40GB IDE HD, Motherboard with Intel BX chipset. (You see, very basic hardware ...) It was connected to the internet via an ADSL line (768kbit down, 128kbit up). The crawler was written in Perl and did HTML parsing and link extraction on-the-fly. With this configuration I could crawl 500.000 URLs per day and built an index with about 3mio pages within a week.

While it's not really necessary, running a DNS cache can improve things a lot. I found, that DNS traffic can grow to 25% of all network traffic during a crawl. With a DNS cache on my crawler box, DNS share was down to about 6%.

Try these things to speed up your crawling, in order of decreasing importance:
* Utilize the time web servers need to response for contacting other sites.
* Fasten the way you feed the URLs to your crawler.
* Use a DNS cache.
_________________
<a href='http://www.neomo.de' target='_blank'>http://www.neomo.de</a> - die Suchmaschinen-Alternative (Testversion)
Back to top
View user's profile Send private message Visit poster's website
sfk
Newbie


Joined: 16 Jul 2004
Posts: 3

PostPosted: Sun Jul 18, 2004 12:07 pm    Post subject: Reply with quote

Thanks for your reply.

Our poor amount of about 200'000 has several reasons:
One reason is, that we are using HERITRIX and there are a dozen plugins running doing classification and focused crawling. Another is the limited power and size of CPU and Memory.

Java itself seems not yet being the bottleneck - it's rather the architecture and the lack of DNS caching as you suggest. It has to be noted, that before we only have been interested in very specific pages (here: only those which relate to geographic data, services or informations). That is now different for the upcoming project: We now want *any* URL we can get.

From the answers we got, it seems that memory and CPU are the bottleneck.
What I expected - but no one mentioned yet - was, also CPU register caching (or something similar) or Linux with multiple processors...?

There seems to be no "best practice" available out there yet.
Back to top
View user's profile Send private message
scolls
Newbie


Joined: 08 Apr 2006
Posts: 8

PostPosted: Sat Apr 08, 2006 11:21 pm    Post subject: Reply with quote

I wrote mine in about 10 months (part-time) using Delphi. Interface in PHP. Of course, with me nothing is ever finished, so over the years I expect it should get better and better.

I actually started writing it by mistake! I was actually writing myself a small keyword density analysis tool to analyse webpages still on my pc, and figured it would be nice if it could also work on live URLs too... a few months later I ended up with the search engine I've listed in my signature... I get side-tracked, you see!!! Very Happy
_________________
<b><a href='http://www.webwobot.com' target='_blank'>WebWobot Search Engine</a></b><br><a href='http://www.massdebation.com' target='_blank'>MassDebation.com ~ No Ordinary Debate!</a>
Back to top
View user's profile Send private message Visit poster's website
masidani
Member


Joined: 10 Jan 2006
Posts: 23

PostPosted: Fri Apr 21, 2006 2:33 pm    Post subject: Reply with quote

This is a really good discussion, if only because it seems my initial projections for my search engine (masidani.com) were hopelessly unrealistic! I'm trying to build this part-time (whilst working a full-time job) and I find the time slipping away so fast! But
then I'm not writing everything from scratch, but am using larbin for the crawling (plus
extra features) and then lucene (plus extra packages) for indexing etc. So my current
estimate is for another year of (part-time) work.

Of course, this doesn't really take into account what would happen if the initial prototype doesn't meet expectations and requires further painstaking research....Project management isn't one of my strong points.

Simon
Back to top
View user's profile Send private message
old_expat
Newbie


Joined: 12 Apr 2006
Posts: 9

PostPosted: Sat Apr 22, 2006 10:08 am    Post subject: Reply with quote

QUOTE (Fischerlaender @ Jul 16 2004, 09:18 AM)
Quote:
What kind of hardware and operating system (linux) add-on would you recommend for such a goal?


My very fist version of a crawler ran on an old Linux (Debian, no modifications) box that I once used as a web server: Celeron 800, 768MB RAM, 40GB IDE HD, Motherboard with Intel BX chipset. (You see, very basic hardware ...) It was connected to the internet via an ADSL line (768kbit down, 128kbit up). The crawler was written in Perl and did HTML parsing and link extraction on-the-fly. With this configuration I could crawl 500.000 URLs per day and built an index with about 3mio pages within a week.

While it's not really necessary, running a DNS cache can improve things a lot. I found, that DNS traffic can grow to 25% of all network traffic during a crawl. With a DNS cache on my crawler box, DNS share was down to about 6%.

Try these things to speed up your crawling, in order of decreasing importance:
* Utilize the time web servers need to response for contacting other sites.
* Fasten the way you feed the URLs to your crawler.
* Use a DNS cache.

Hello Fischerlander - your crawler sounds like it was very efficient. would a crawler as you describe above also be able to focused on specific topics at the same time it did the parsing and link extraction?

Are you still using that crawler?

Would it be somehow available for a poor startup?:)

What about indexing and search .. how did you handle that?

BTW, If I am being rude bys asking these questions in this manner, my apologies.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    sirdf.com Forum Index -> Making a search engine All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group