Promise Media

Block Search Engine Spiders and Save Money

Stored in Website Development and tagged ,

Search engines are great when they bring free audience to a site through organic search results. But some search spiders are simply terrible.

They are terrible because they come from a product that doesn’t provide any audience.

They are especially terrible when they pour through a site, consume vast amounts of bandwidth and come back again for more, ever hungry.

SEO might stand for not just Search Engine Optimization but also Search Engine Opposition.

Obviously every site wants to attract the most audience possible from Google, Bing and other good search engines.

But sometimes blocking search engines is a good thing.

More Reasons to Block Search Engines

Some search engine robots and spiders represent companies that mine content for a paying audience that uses a propriety product.

They may scrape the content for using in their product or use it for some other means.

“Another form is search engines like Yandex in Russia and Baidu in China, which again don’t bring any or enough audience to justify the impact their crawlers have on a site.”

As someone who spends a great deal of time with Web analytics, I can promise that they don’t bring traffic to a Web site. They simply consume bandwidth, which raises site hosting costs.

Another form is search engines like Yandex in Russia and Baidu in China, which again don’t bring any or enough audience to justify the impact their crawlers have on a site.

Sites can be hit by dozens of these crawlers in a single day and drag down site performance — and sometimes even crash the server if the load is too heavy and enough of them hit the site at the same time.

How to Identify Search Robots with cPanel

Many sites have access in their hosting accounts to a product called cPanel, which is a commercially available hosting control panel.

cPanel has two features that are helpful in identifying spiders, which are the Latest Visitors report,  Awstats raw audience logs and the Raw Access Logs.

I find the first two the easiest to use for less technical Web site managers, and they get the job done, so I’ll focus on them.

Click on the Latest Visitors report to see the IP addresses of anything visiting the site recently and also the User Agent.

Click on User Agent at the top of the column to order the user agents by type. Yandex may look like this:

Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)

The key piece of information is YandexBot/3.0

Likewise, note the IP address in the far left column.

How to Identify Bad Robots with Awstats

Click on the Awstats icon and then into the most recent report. On the far left side, click on the Hosts report. Note the amount of bandwidth being consumed by IP addresses at the top of the list.

Some of them are quite legitimate, like Google, while others should be blocked.

To figure out the identity of those IP addresses, go to a search engine and search for IP lookup. Enter the address in the box on one of the resulting sites to see the location.

In my case, most of my client sites are local, so it’s hard to believe that the IP address located in Norway has any good reason to be eating up so much bandwidth.

When those addresses are blocked, analytics show that the site does not suffer from any audience loss at all.

How to Block Bad Spiders

One easy way to block bad spiders is by using a .htaccess file, which is a raw text file that goes into the root directory of the site. Note the dot at the beginning of htaccess.

To block user agents such as Baidu and Yandex, put this on the file:

# Block user agents
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Baiduspider [OR]
RewriteCond %{HTTP_USER_AGENT} Baiduspider/2.0 [OR]
RewriteCond %{HTTP_USER_AGENT} Baiduspider/3.0 [OR]
RewriteCond %{HTTP_USER_AGENT} YandexBot
RewriteRule . – [F,L]

Add other user agents like the ones above, but do so carefully because a mistake in the code could cause a server error.

To block IP addresses, use the following:

# Block IP addresses
order allow,deny
deny from 38.99.96.162
deny from 50.22.237.250
deny from 89.145.95.2
allow from all

But whatever you do, don’t block Google, Bing and other worthwhile search engines.

Make a Comment, Ask a Question


© 2007-2018 Promise Media LLC • ContactSubmissionsPrivacySitemap