SEO for ecommerce: Toy robot standing on desk in front of computer

SEO for ecommerce: Managing search engine crawlers

Most of the tools we use in SEO for ecommerce work by influencing search engine bots in a cumulative effect. All of the page elements that we have discussed so far—keywords, links, images, video—help the search engine bot assess what the page is about.

But they work in conjunction with each other. As we’ve pointed out before, keywords by themselves have only a limited influence on the search engine crawlers. Much more important is the context in which those keywords appear. The use of links, with or without keywords, is another important piece. All these factors add up, and in an ideal situation, have a positive effect on your ranking.

At the end of the day, however, your efforts still depend on the algorithms that define the search engines’ behavior. This is why you have to continually refine your search engine strategy over time. There are very few ways to influence search engine crawlers directly—that is, to tell them exactly what to do with a page on your website.

However, there are two highly specialized tools in your SEO toolkit which let you do exactly that.

Canonical URLs

As we know, search engines tend to penalize duplicate content. When the exact same content appears on more than one unique page, it can result in penalties for whichever page(s) the search engine crawlers determine to be derivative.

SEO for ecommerce: Solitary red figure in a crowd of white figures

The most obvious underlying reason for this is to prevent the theft of content. These search engine penalties prevent your competitors from stealing blog posts or product descriptions from your website and using them on their own.

These rules also exist to help encourage the development of high-quality, useful content. Years ago, when search engines looked primarily for keywords, it was easy to simply duplicate pages with keyword-rich content to boost your ranking. Such duplicate pages, however, hold little value for human readers. As such, today’s search engine crawlers penalize them heavily.

However, there are legitimate situations in which you may want to use the same (or very similar) content on more than one page. A classic example would be product pages for a line of products with only slight variations in size or color. Assuming that the functionality and specifications are otherwise identical, it would be more practical to use the same description for all the products in this series.

So how can you do it without being penalized? By using a canonical URL.

A canonical URL is one that you define as the official or “preferred” version of a given page. When search engine spiders crawl your website, they will only index the canonical page. In this way, other pages with the same content don’t get penalized.

You can define the canonical version of a page by inserting the canonical link element rel=canonical into the <head> section of a page’s HTML. However, if you are using an ecommerce solution with a Content Management System, you can usually define canonical URLs in the page’s settings.

Robots Exclusion Standard

What happens if you don’t want search engine crawlers indexing your web store’s pages?

Obviously this won’t be the case very often, since SEO for ecommerce depends entirely on favorable indexing. Nevertheless, there may be certain pages on your web site that you don’t want indexed. Common examples include terms and conditions, privacy policies, and other legal documents.

For cases like this, we have the robot exclusion standard.

SEO for ecommerce: Outstretched hand keeping large blue goldfish from crowd of smaller orange goldfish

Also known as robots.txt, the robot exclusion standard is essentially a text file which lists the pages on your website that you do not want search engine spiders crawling. In its simplest format, this text file specifies the search engine crawler (“user-agent”) to which the rule applies, and the URL(s) that crawler is forbidden to index.

An example would look something like this:

User-agent: Googlebot

Disallow: /privacy

A robots.txt file with these specifications would instruct the Google crawlers to ignore your privacy page. You can specify additional URLs that you want Google to ignore by adding them under Disallow. Similarly, you can specify other bots by adding User-agents with their own Disallow rules.

If you want a list of exclusions to apply to all bots (not just Google or some other specific crawler), you can simply define User-agent with an asterisk (*).

Again, the robot exclusion standard is really just a text file. Once you have created it and defined your exclusions, you upload it to your site’s root directory. Alternatively, as with canonical URLs, if you are using a content management system you may be able to modify your robots.txt file through it.

It’s important to note that the robot exclusion standard does not actually prevent bots from indexing a page. It only acts as an instruction for the crawlers to follow. For the most part, the major search engine crawlers respect the instructions contained in this file and won’t index the pages. However, because this file has no means of enforcing compliance, you should not consider it an effective security measure against malicious bots.

Leave a Reply