Spider Trap contains more than one method to bar a Client. In this way Spider Trap works very efficiently. One method would not be enough to reach the same results. Sadly Spider Trap can’t assure 100 % secure against bad spiders – but the most Softwares can’t do this.
1. Build a trap
The most search engines are geared to the standard (Robots-Exclusion Protokoll) and the allocation of the robots.txt viz they don’t read contains, which are bared by robots.txt.
At Google is at all a link available, but because there no cache an description fort his link, you can’t find this link in the search results. You can read in index != spiderd a interesting article about this topic.
This function of search engines is the background for our catch. In this way the robots.txt contains a directory which shouldn’t be spidered.
This happens via following code in robots.txt:
Google would like to be personlay addressed, such like following article describes:
Robots Experiment (only in german)
Now no spider should track sites at directory spider-trap. To test this we st a link in exactly this directory. If a robot follows this links so it disobey the roles of the robots.txt. Usullay Webcrawler know the link but the are not allowed to visit it. If webcrawler demand the file spider-trap/index.php so he is blocked for the website..
2. Ban a robot
Here are possibilities for a bar.
2.1. Debarment via User Agent
By User-Agent we understand the assigned name of a client. Every browser or spider uses a user-agent which will be involved at http protocol.
Here a some typical user-agent-declarations:
Mozilla/5.0 (Windows; U;Windows NT 5.1; de; rv:1.8)Gecko/20051111 Firefox/1.5 (Browser Firefox)
Googlebot/2.1 (+http://www.googlebot.com/bot.html) (Google)
At Spider Trap exists two Flat Files.
In this file are listet user-agents which permanent access to your website. These robots should also never been bared, if they fall into the spider-trap. The hezard is, that user-agent-description can be selected by the client. For example exist for firefox an extension to change the user agent. Therefore it is not reliable to analyse the user-agent behavior.
If a spider fall into spider-trap the user agent will be saved. This file contains the user agents of the bad bots. Of course it is possible to remove entrys out of this file to remove the blockades.
2.2. Ban over htaccess
The ip adress is a definite code. Webcrawler comes mostly with the same IP at your site. To bar via IP it is neccessary to insert the file htaccess into the root directory of your webspace. This file needs write access to add or cancel ip’s of the clients. The advantage of the ip bar is that no script is neccessary to check the ip. So no scripts must be refreshed. An further advantage is that the access can be used for the whole domain. In this way can be saved pictured as well as texts. To avoid to bar all user of an ip it a combination with the user-agent check neccessary. If somebody comes from t-online so the wohle ip would be bared. This is not meant by spider-trap. Therefore there are methods to remove the blockade of a client.
2. 3. Remove a blockade
By a blockade the access will be bared and the HTTP Header 403 sent. At the htaccess it is possible to declare at incoming failure-sites certain failures. We use it by "access denied" with the call of the script forbid.php at /spider-trap/ directory. Via htaccess file at spider-trap directory is the access to the file forbid.php. Via this script a error message with a short description will be dsplayed. To remove the ban it is neccessary to insert into a form the worth of a captcha (count-picture). If the input of the captcha is correct the ip (htaccess) and the user-agent (blacklist.txt) will be removed of the blacklist an the access will be granted. This remove is needed for usually user which would like to taste all links at a site. Further an automatically remove of a bar intended.
2. 4 Warnings
At the configuration-file can be added additionaly an e-mail adress so you will be informed about the changes of the configuration-file. In this way it is easier to control and maintain your file.