Block Bad Search Engine Spiders and Save Money

Search engine spiders are valuable when they bring free audience to a site through their organic search results. But some search spiders are malicious and come at a cost.
They are malicious when they come from a product that doesn’t provide any audience. They are even more malicious when they pound a site to the point where they become a drain on server resources.
They are terrible when they pour through a site, consume vast amounts of bandwidth, server RAM and server CPU and come back again for more, ever hungry.
Obviously every site wants to attract the most audience possible from Google, Bing and other good search engines.
But sometimes blocking search engines is a good thing.
More Reasons to Block Search Engines
Some search engine robots and spiders represent companies that mine content for a paying audience that uses a propriety product.
They may scrape the content to use in their product or use it for some other means.
As someone who spends a great deal of time with website analytics, I can promise that malicious bots and spiders don’t bring traffic to a website. They simply consume bandwidth and server capacity, which raises site hosting costs in addition to potentially dragging down server response times.
The truly bad ones will slow the site, which worsens the user experience for legitimate visitors. In time, they may also impact search rankings.
Some search engine spiders are legitimate but don’t offer much benefit. Examples include Yandex in Russia and Baidu in China, which don’t bring any or enough audience to justify the impact their crawlers have on a site.
How to Identify Spiders with cPanel
Many sites have access in their hosting accounts to a product called cPanel, which is a popular and commercially available hosting control panel.
cPanel has three features that are helpful in identifying spiders, which are the Latest Visitors report, Awstats raw audience logs and the Raw Access Logs.
I find the first two the easiest to use for less technical website managers. They get the job done, so I’ll focus on them.
Click on the Latest Visitors report to see the IP addresses of anything visiting the site recently and also the User Agent.
Click on User Agent at the top of the column to order the user agents by type. Yandex may look like this:
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
The key piece of information is YandexBot/3.0
Likewise, note the IP address in the far left column.
How to Identify Bad Robots with Awstats
Click on the Awstats icon and then into the most recent report. On the far left side, click on the Hosts report. Note the amount of bandwidth being consumed by IP addresses at the top of the list.
Some of them are quite legitimate, like Google, while others should be blocked.
To figure out the identity of those IP addresses, go to a search engine and search for IP lookup. Enter the address in the box on one of the resulting sites to see the location.
In my case, most of my client sites are local, so it’s hard to believe that the IP address located in Norway has any good reason to be eating up so much bandwidth.
When those addresses are blocked, analytics show that the site does not suffer from any audience loss at all.
How to Block Bad Spiders
One easy way to block bad search engine spiders is with an .htaccess file. It is a raw text file that goes into the root directory of the site. Note the dot at the beginning of htaccess. Please confirm this method with your hosting provider.
To block user agents such as Baidu and Yandex, put this on the file:
# Block user agents
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Baiduspider [OR]
RewriteCond %{HTTP_USER_AGENT} Baiduspider/2.0 [OR]
RewriteCond %{HTTP_USER_AGENT} Baiduspider/3.0 [OR]
RewriteCond %{HTTP_USER_AGENT} YandexBot
RewriteRule . – [F,L]
Add other user agents like the ones above, but do so carefully because a mistake in the code could cause a server error.
To block IP addresses, use the following:
# Block IP addresses
order allow,deny
deny from 38.99.96.162
deny from 50.22.237.250
deny from 89.145.95.2
allow from all
But whatever you do, don’t block Google, Bing and other worthwhile search engine spiders.