23.05.2018
Semalt Provides Tips On How To Deal With Bots, Spiders And Crawlers
Apart from creating search engine friendly URLs, the .htaccess le lets webmasters block speci c bots from accessing their website. One way to block these robots is through the robots.txt le. However, Ross Barber, the Semalt Customer Success Manager, states that he has seen some crawlers ignoring this request. One of the best ways is to use the .htaccess le to stop them from indexing your content.
What are these bots? They are a type of software used by search engines to delete new content from the internet for indexing purposes.
They perform the following tasks: Visit web pages that you've linked to Check your HTML code for errors They save what web pages you're linking to and see what web pages link to your content They index your content However, some bots are malicious and search your site for email addresses and forms that are usually used to send you unwanted messages or spam. Others even look for security loopholes in your code. https://rankexperience.com/articles/article1574.html
1/3
23.05.2018
What is needed to block web crawlers? Before using the .htaccess le, you need to check the following things: 1. Your site must be running on an Apache server. Nowadays, even those web hosting companies half decent in their job, give you access to the required le. 2. You should have access to you're the raw server logs of your website so that you can locate what bots have been visiting your web pages. Note there is no way you'll be able to block all harmful bots unless you block all of them, even those you consider to be helpful. New bots come up every day, and older ones are modi ed. The most ef cient way is to secure your code and make it hard for bots to spam you.
Identifying bots Bots can either be identi ed by the IP address or from their "User Agent String," which they send in the HTTP headers. For instance, Google uses "Googlebot." You may need this list with 302 bots if you already have the name of the bot that you would like to keep away using .htaccess Another way is to download all the log les from the server and open them using a text editor. Their location on the server may change depending on your server's con guration. If you cannot nd them, seek assistance from your web host. If you know what page was visited, or the time of visit, it's easier to come with an unwanted bot. You could search the log le with these parameters. Once, you've noted what bots you need to block; you can then include them in the .htaccess le. Please note that blocking the bot isn't enough to stop it. It may come back with a new IP or name.
How to block them Download a copy of the .htaccess le. Make backups if required.
Method 1: blocking by IP This code snippet blocks the bot using the IP address 197.0.0.1 Order Deny, Allow Deny from 197.0.0.1
https://rankexperience.com/articles/article1574.html
2/3
23.05.2018
The rst line means that the server will block all requests matching the patterns you've speci ed and allow all others. The second line tells the server to issue a 403: forbidden page
Method 2: Blocking by User agents The easiest way is to use Apache's rewrite engine RewriteEngine On RewriteCond %{HTTP_USER_AGENT} BotUserAgent RewriteRule . - [F, L] The rst line ensures that the rewrite module is enabled. Line two is the condition which the rule applies to. The "F" in line 4 tells the server to return a 403: Forbidden while the "L" means this is the last rule. You will then upload the .htaccess le to your server and overwrite the existing one. With time, you will need to update the bot's IP. In case you make an error, just upload the backup that you made.
https://rankexperience.com/articles/article1574.html
3/3