sirdf.com Forum Index sirdf.com
Search & Information Retrieval Development Forum
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Introduction of DataGuy

 
Post new topic   Reply to topic    sirdf.com Forum Index -> Running a search engine
View previous topic :: View next topic  
Author Message
DataGuy
Newbie


Joined: 17 Nov 2004
Posts: 3

PostPosted: Wed Nov 17, 2004 6:46 pm    Post subject: Reply with quote

Hello,

I was just referred here from WebmasterWorld. I would love to be able to converse openly with other SE operators so I'm glad to be here...

I operate a string of SE's, with my most popular site being SearchSight.com. My sites runs mostly on Windows-based machines, so I'm in an up-hill battle from the start.

I've been doing this for 6 years, and the SE business has been good to me. I have always considered this somewhat of a hobby, though I do have employees and due to my wife's prompting, I try to run it as a real business.

I'm fascinated with data aggregation, and I hope to be able to contribute something to the Internet on a World-wide scale. I just haven't been able to do it yet!

Which SE sites are represented here?
Back to top
View user's profile Send private message
runarb
Site Admin


Joined: 29 Oct 2006
Posts: 4

PostPosted: Thu Nov 18, 2004 10:51 am    Post subject: Reply with quote

I run www.boitho.com, Fischerlaender runs www.neomo.de . Both is algorithmic, crawler based search engines based on own file structures (not a sql database), writen in Perl and C.

What languages are you developing in ?


( Please don't judge me based on the search engine that you find at www.boitho.com , it is a 1.5 years old demo we made to show to potential investors how thumbnail pictures can be used in search engines. )
_________________
CTO @ Searchdaimon company search.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
DataGuy
Newbie


Joined: 17 Nov 2004
Posts: 3

PostPosted: Thu Nov 18, 2004 3:37 pm    Post subject: Reply with quote

Quote:
What languages are you developing in ?


Well, please don't judge me based on the languages that we use!

Our crawler uses Visual Basic, mostly because of the thumbnail image retrieval, speed has not been something we've been concerned with. You just can't get very fast if you want to download an entire web page instead of just the source code.

Our database runs on SQL Server.... again probably the worst choice for running a search engine. We are testing an index manager from surfinity that creates a much more efficient search on SQL Server and we hope to have Microsofts full-text search replaced with this product within the next few days.

Ease of development has been the primary concern up until this point. Since I am in charge of programming, marketing, and eveything in between I don't have the time to spend working on developing the fastest system.

I do have some pretty good marketing systems in place right now and I'd be interested in working with someone to create a new search engine based on a new platform. Anyone interested?
Back to top
View user's profile Send private message
scolls
Newbie


Joined: 08 Apr 2006
Posts: 8

PostPosted: Sat Apr 08, 2006 11:53 pm    Post subject: Reply with quote

Well, a belated reply to this post, but what the heck! :rolleyes:

Well, I've ended up writing the system for searchserf.market-uk.com.
I know... it's on a subdomain. :rolleyes: I'll get a domain name for it shortly...

See, I actually just got side-tracked while writing myself a little app for something else, and somehow ended up with this thing!

Anyhow, I've been finding it really fascinating, so I've stuck with it and am doing it as a hobby, with a view to perhaps creating a job for myself with it so I can quit chasing the end of the month and start chasing my dreams, man!!! Laughing

I'm actually quite shocked I even got this far... having zero idea about search engines other than some basic SEO, etc.

So basically it's 4 pieces of software I wrote with Delphi. One does the crawling, with seeds added from another piece of software that handles submissions (as well as sending out email for submission confirmation), another handles the indexing on keywords, and yet another that monitors that everything is up & running & email me reports daily.

It's running on one of those ol' wind-up computers, so it's sure gonna be interesting to see it run on its own server one day! And, of course, a whole lot bigger connection would work wonders! But... one step at a time I suppose. I'm learning as I go, sort of really just dreaming up how it should work - it's really such a trip doing it - I wish I had a job that was this much fun!!! :blink:
_________________
<b><a href='http://www.webwobot.com' target='_blank'>WebWobot Search Engine</a></b><br><a href='http://www.massdebation.com' target='_blank'>MassDebation.com ~ No Ordinary Debate!</a>
Back to top
View user's profile Send private message Visit poster's website
runarb
Site Admin


Joined: 29 Oct 2006
Posts: 4

PostPosted: Mon Apr 10, 2006 12:08 am    Post subject: Reply with quote

I have been using some Delphi to. The crawler for boitho.com is for example in Delphi.

How many pages have you crawled sow fare?

_________________
CTO @ Searchdaimon company search.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
scolls
Newbie


Joined: 08 Apr 2006
Posts: 8

PostPosted: Sun Apr 23, 2006 4:00 pm    Post subject: Reply with quote

QUOTE (runarb @ Apr 10 2006, 12:08 AM)
I have been using some Delphi to. The crawler for boitho.com is for example in Delphi.

How many pages have you crawled sow fare?

Hi Runarb,

I've had a couple of test runs... debugging can be a pain, as you know.

Considering I first just banged a bit of code together really, I've been learning more and more from the experiment as I go along, and I adapt the thing further as I go along.

The crawler does maybe up to 60000 or a bit more a day - not good at all, but it's doing too much of the work, filtering unwanted sites, parsing entire pages & feeding itself URL's from every page it downloads. Backend is MySQL, running on same pc as crawler. So not a fast setup by any standard.

But the results are encouraging enough for me to be planning a complete rewrite based upon the things I've learned so far from its behaviour.

For example, I'd like to perhaps multi-thread it so that it can be caching pages while waiting for pages to download. At the moment, it waits for the page to download, then caches it, then gets the next url to parse, waiting for MySQL to execute the query, etc etc.
Another thing I'd like to do is have it crawl only a limited length route from the initial seeds, eg max xyz links away from seed. The thing is I'm finding that the further away from a good seed you go, the more 404's you find! I was actually quite shocked to see how many outdated links many people have on their sites!

Any suggestions?
_________________
<b><a href='http://www.webwobot.com' target='_blank'>WebWobot Search Engine</a></b><br><a href='http://www.massdebation.com' target='_blank'>MassDebation.com ~ No Ordinary Debate!</a>
Back to top
View user's profile Send private message Visit poster's website
runarb
Site Admin


Joined: 29 Oct 2006
Posts: 4

PostPosted: Mon Apr 24, 2006 1:56 pm    Post subject: Reply with quote

Just having one thread that do the crawling in serial wont be optimal.

You should look into either using multi-thread, as you mention, or asynchronous io.


You can implement asynchronous io by having a large array of non blocking sockets, and sent request through each, one by one, but not wait for it to finish, just go to the next one. Then start at the first to check if all the data have come in. If it has, give it an other page to download.

Unfortunately this is more complicated, and may be tricky to get working.

Also se http://en.wikipedia.org/wiki/Asynchronous_I/O


_________________
CTO @ Searchdaimon company search.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
scolls
Newbie


Joined: 08 Apr 2006
Posts: 8

PostPosted: Sat May 06, 2006 4:15 am    Post subject: Reply with quote

Thanks runarb!

How many sockets in the array would you recommend, or should I just play around & monitor the difference in results.

Also, how would you recommend feeding & fniding good seeds? I am thinking of keeping a scoreboard of all links spawned from crawled pages & cutting entire chains of those that fall below a certain score (eg site "A" gives x links of which y spawn y2 good links and z that spawn z2 bad links (404's etc) )

I'm definitely going to do a complete rewrite of the crawler so I'm really keen on hearing any ideas I can before I start. B)
_________________
<b><a href='http://www.webwobot.com' target='_blank'>WebWobot Search Engine</a></b><br><a href='http://www.massdebation.com' target='_blank'>MassDebation.com ~ No Ordinary Debate!</a>
Back to top
View user's profile Send private message Visit poster's website
runarb
Site Admin


Joined: 29 Oct 2006
Posts: 4

PostPosted: Mon May 08, 2006 4:25 am    Post subject: Reply with quote

Try with 500 for starts, and se how that performs. Then adjust to you have a good cpu utilization.

For seeds most search engines uses the Open Directory RDF Dump, from http://rdf.dmoz.org/ . That is a XML like file containing all the links in dmoz.

_________________
CTO @ Searchdaimon company search.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Display posts from previous:   
Post new topic   Reply to topic    sirdf.com Forum Index -> Running a search engine All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group