Perils of Content Duplication and Internal Search Spam in CMS and Ecommerce Platforms.

Search engine spider crawling vast digital web

Preventing Google and Bing from Indexing Duplicate Content Sections and Search Spam

When a website is first set up, one of the most important first hurdles to get past, is making sure that search engines find it, crawl all over it (via your structure and perfect internal linking) then index it. It’s important that the “engines of search” get in there and begin to evaluate your content and start to give it some love.

Crawling and Indexing are 2 concepts that are frequently misunderstood by website owners who typically try to avoid being sucked into the technicalities of getting online.

Crawling is the process where a piece of software, in the depths of a search engine’s computers, sets out to discover, check and assess a website. The crawler/spider will go to a URL and read the content that is on there. That URL will be something in its database already or may be a URL submitted through Google Search Console for indexing, by a website owner.

Indexing is simply the process of being added to a search engine’s database, being evaluated and ranked somewhere in the search food chain for the various topics (or entities) that Google, Bing et al recognise.

So What’s This No-Index Nonsense?

The “noindex” command that you as a website owner need to be aware of, triggers a rule in the search engine database to leave that content out. It can be set with either a <meta> tag or HTTP response header. You usually set the the rule through what’s known as the “robots.txt” file.

No-index is used to prevent listing (indexing) of content by compliant search engines that support this “noindex” rule which is part of the robots standard drawn up way back in 1994.

This is relevant to the big search monsters such as Google and Bing. When Googlebot crawls a page containing the tag and extracts the tags or header statements, Google will drop that page entirely from Google Search results, regardless of whether other sites link to it.

“You can not be serious!” I hear you cry. Why the hell would I ask search engines to ignore my own content? It’s a valid question and one that requires some explanation. Websites are typically built on a platform. That platform might be WordPress, Joomla, Drupal, Wix, Squarespace, CMS Made Simple., or an ecommerce platform like Opencart. Then there are all of the hosting companies’ proprietary “website builders” or – Bog forbid… Yell sites! Aaaargh!

Imperfection In the Backend and Duplication in the Eye of the Beholder!

Each of these platforms has quirks that mean your great bit of content for a page or a blog post can show up several times with different URLs. Because these platforms attempt to be all things to all humans there are times when that inbuilt flexibility in architecture can have unintended consequences.

For instance you may have printer or mobile versions of a content piece and it’s the web version that you want the engines to recognise as the one true version. Here marking one version as the canonical or true version is the main defence against diluting your content with multiple versions of the same thing. However non-canonical versions still get indexed when you have told Google and its peers which one it should catalogue in its index. That’s when you’d also add a no-index command that picks out the specific versions of pages not to index.

WordPress and other blogging based platforms love to structure in category and tag pages, plus authors and various other “taxonomies” or types of pages/posts. these pages append a category/tag/author section into the link on many builds. For most site owners’ purposes you want people to be able to use these taxonomy based searches on your website to help them navigate to a piece of content once they have landed, but in most cases you will want the article to be indexed rather than the tag or category version of it. The differences are subtle but can limit your visibility if not managed properly.

Internal Search Spam: Clogging up your Search Console Data

The other case that has cropped up increasingly, for using noindex, is when you are afflicted by something known as internal search spam. In these cases you will notice that your Google Search Console Pages section lists dozens of weird and sometimes disturbing looking pages on your website like this example: https://www.mysensiblewebsite.co.uk/index.php?route=product/search&tag=PHWin+TikTok+Filipinas+%E2%8F%A9+%28+gp9.site+%29+%E2%8F%AA+Bagong+bukas%2C+laro+na%21+Huwag+palampasin+ang+malaking+premyo%21PHWin+TikTok+Filipinas+++GAMING+1411242.htm

(and no I’m not making it a live link) which is trying to tempt you to click through to this junky Philippino gaming(?) site, or something far dodgier in reality.

Effectively, Google has been deceived into thinking that this blatantly spammy piece of bullshit is actually a page on your website. The issue with the robots.txt method for no-indexing of certain content is that the spam or non-canonical pages can still become indexed, especially if there are links from other websites pointing at these, usually, non-existent pages.

This is the flaw exploited by spammers utilising the Internal Search Spam methods that clog up search Console Indexing lists. On a website somewhere in the ether a spammer will create a link to a page created by internal search.

Ecommerce product searches offer the perfect vehicle for this. The spammer will pick on an ecommerce site and search for non-existent products with a string of crap that will link to their nefarious destination.

The string of nonsense as in the example above – and semi simulated in the video clip – will generate a search results page URL as you see copied to a notebook file at the end. You’ll be pleased to know that we haven’t linked to this spurious URL as we do business cleanly. And neither have we revealed the client’s website.

One of my client websites was being insulted by 10s of thousands of these kinds of spurious page results. On a website with around 2,000 products, there were over 143,000 indexed pages – 90% of which were non-existent, spurious URLs and over 400,000 that Google had No-indexed itself. It was becoming difficult to find information about real pages.

Implementing noindex meta tags

There are two ways to implement noindex: as a <meta> tag and as an HTTP response header which is used for non-html content such as PDFs. They have the same effect; choose the method that is more convenient for your site and appropriate for the content type. Specifying the noindex rule in the robots.txt file is not supported by Google.

To combat this flood of spam on our client’s website, we have implemented a blanket no-index directive for this URL pattern. “https://www.mysensiblewebsite.co.uk/index.php?route=product/”.

To combat the pre-existing spam that had already been indexed, we went into search console and used the Removals tool in the Pages section to remove the Spam URLs for 6 months while we implemented the fix on the website itself. As usual, the fix is imperfect. Google is still iundexing new URLs that conform to the banned patterns but in much lower numbers.

For the no-index rule to be effective, via the meta tag method; the page or resource must NOT be blocked by a robots.txt file, and it has to be otherwise accessible to the crawler. If the page is blocked by a robots.txt file, the crawler won’t access the page, and so the crawler will never see the no-index rule. This implies that the page can still appear in search results, particularly if other websites or pages link to it.

There is much more to the whole indexing and robots.txt domain but this article just deals with the specifics of what we have run into recently.

Preventing Google and Bing from Indexing Duplicate Content Sections and Search Spam

So What’s This No-Index Nonsense?

Imperfection In the Backend and Duplication in the Eye of the Beholder!

Internal Search Spam: Clogging up your Search Console Data

Implementing noindex meta tags

Recent Posts

Recent Comments

Find Us On Socials