A web crawler (also known as a web spider or ant) is a program which browses the World Wide Web in a methodical, automated manner. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches.
From Wikipedia, the free encyclopedia.
Why is OpenWebSpider grabbing webpages from my website?OpenWebSpider is the agent software of oslookup.org which crawls web sites all over the world, in order to build a vertical search engine, in our case an Open Source Software search engine.
I do not want my website to be crawled, what should I do?You can put a file named robotx.txt in your web server. It is a standard way to exclude robot programs from retrieving parts or whole of your web site. For a detailed description about robotx.txt, please refer : http://www.robotstxt.org/wc/norobots.html
Why does OpenWebSpider try to access some non-existing URLs from my website?There might be some places in the web that have some stale URLs pointing to some non-existing URLs in your web site. OpenWebSpider crawls the web by following links in the pages it gathered, and thus could access some non-existing links.
Why doesn't OpenWebSpider obey my robots.txt?
We always suggest verifying that your syntax is correct against the standard at
robots exclusion.
A common source of problems is that the robots.txt file isn't placed in the top
directory of the server (e.g., www.mydomain.com/robots.txt); placing the file in a
subdirectory won't have any effect.
OpenWebSpider obeys the longest (that is, the most specific) applicable rule.
This more intuitive practice matches what people actually do, and what they expect us to do.
For example, consider the following robots.txt file:
User-Agent: *
Disallow: /cgi-bin
It's obvious that the webmaster's intent here is to allow robots to crawl everything
except the /cgi-bin directory. Consequently, that's what we do.
To prevent your site from being crawled by OpenWebSpider, you may add the following lines in your robots.txt
User-agent: OpenWebSpider
Disallow: /
OpenWebSpider does try to follow the robots.txt by
filtering out URLs that are specified in the robot exclusion database.
Once OpenWebSpider has noticed your robots.txt and learned the rule,
it will not grab web pages listed in your robots.txt after then.
Should there be still a question, please email