Categories: Development

Block Bad Search Engine Spiders and Save Money

Search engine spiders are valuable when they bring free audience to a site through their organic search results. But some search spiders are malicious and come at a cost.

They are malicious when they come from a product that doesn’t provide any audience. They are even more malicious when they pound a site to the point where they become a drain on server resources.

They are terrible when they pour through a site, consume vast amounts of bandwidth, server RAM and server CPU and come back again for more, ever hungry.

Obviously every site wants to attract the most audience possible from Google, Bing and other good search engines.

But sometimes blocking search engines is a good thing.

More Reasons to Block Search Engines

Some search engine robots and spiders represent companies that mine content for a paying audience that uses a propriety product.

They may scrape the content to use in their product or use it for some other means.

As someone who spends a great deal of time with website analytics, I can promise that malicious bots and spiders don’t bring traffic to a website. They simply consume bandwidth and server capacity, which raises site hosting costs in addition to potentially dragging down server response times.

The truly bad ones will slow the site, which worsens the user experience for legitimate visitors. In time, they may also impact search rankings.

Some search engine spiders are legitimate but don’t offer much benefit. Examples include Yandex in Russia and Baidu in China, which don’t bring any or enough audience to justify the impact their crawlers have on a site.

How to Identify Spiders with cPanel

Many sites have access in their hosting accounts to a product called cPanel, which is a popular and commercially available hosting control panel.

cPanel has three features that are helpful in identifying spiders, which are the Latest Visitors report,  Awstats raw audience logs and the Raw Access Logs.

I find the first two the easiest to use for less technical website managers. They get the job done, so I’ll focus on them.

Click on the Latest Visitors report to see the IP addresses of anything visiting the site recently and also the User Agent.

Click on User Agent at the top of the column to order the user agents by type. Yandex may look like this:

Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)

The key piece of information is YandexBot/3.0

Likewise, note the IP address in the far left column.

How to Identify Bad Robots with Awstats

Click on the Awstats icon and then into the most recent report. On the far left side, click on the Hosts report. Note the amount of bandwidth being consumed by IP addresses at the top of the list.

Some of them are quite legitimate, like Google, while others should be blocked.

To figure out the identity of those IP addresses, go to a search engine and search for IP lookup. Enter the address in the box on one of the resulting sites to see the location.

In my case, most of my client sites are local, so it’s hard to believe that the IP address located in Norway has any good reason to be eating up so much bandwidth.

When those addresses are blocked, analytics show that the site does not suffer from any audience loss at all.

How to Block Bad Spiders

One easy way to block bad search engine spiders is with an .htaccess file. It is a raw text file that goes into the root directory of the site. Note the dot at the beginning of htaccess. Please confirm this method with your hosting provider.

To block user agents such as Baidu and Yandex, put this on the file:

# Block user agents
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Baiduspider [OR]
RewriteCond %{HTTP_USER_AGENT} Baiduspider/2.0 [OR]
RewriteCond %{HTTP_USER_AGENT} Baiduspider/3.0 [OR]
RewriteCond %{HTTP_USER_AGENT} YandexBot
RewriteRule . – [F,L]

Add other user agents like the ones above, but do so carefully because a mistake in the code could cause a server error.

To block IP addresses, use the following:

# Block IP addresses
order allow,deny
deny from 38.99.96.162
deny from 50.22.237.250
deny from 89.145.95.2
allow from all

But whatever you do, don’t block Google, Bing and other worthwhile search engine spiders.

Scott S. Bateman

Share
Published by
Scott S. Bateman
Tags: SEO

Recent Posts

How to Create a Google Ads Strategy for Beginners

A Google Ads strategy for beginners begins with defining a few simple goals. Knowing these…

3 years ago

Google Ad Strategy Delivers Efficient Pay Per Click

A Google ad strategy is an inexpensive, efficient and highly educational way of driving targeted…

3 years ago

Tackle Analytics Bounce Rate 1 Page at a Time

The analytics bounce rate is an excellent way to measure the quality of a website…

3 years ago

How Keyword Density Impacts Search Engine Optimization

Keyword density is an SEO tactic that suggests how many times a particular keyword should…

3 years ago

Forum Link Building Offers Modest Benefits for SEO

Forum link building is a marketing tactic that has some moderate benefits with search engine…

3 years ago

New Versus Returning Visitors Reflect Brand Strength

Website publishers who track new versus returning visitors will find new ways to increase site…

3 years ago