How to Stop Russians from Grinding Axes in Analytics

Google Analytics has a “ghost spam” issue lately and one really annoying Russian is responsible for most of it. I’ve read through buckets of posts from Analytics professionals about how to deal with this tomfoolery. I’d like to share a few of the resources and provide some commentary.

This is what some of the recent ghost spam looks like in Analytics:

This is from the Audience Overview report, the first report you see when you login to Analytics. This spam is designed to be intrusive. People are calling it ghost spam to distinguish it from referral spam, in part because this new spam is more intrusive in Analytics, and in part because many of the traditional mitigation methods aren’t working.

This spam spoofs just about every meaningful user-end variable that’s processed by the Analytics script, and because of that it’s relatively intrusive and difficult to block. The hostname, operating system, device category, browser, screen resolution, source, medium, etc., and of course, language are all spoofed. And in addition to spoofing these things, each field is a variable that can be updated at will. Analytics filters based on fixed variables are useless (e.g., each of the spam languages above was sent at different times over the last few months).

My Favorite Solution

My favorite solution to combat the spam shown above is from AnalyticsEdge.com. In section 3 of their very extensive (and very un-jump-linkable) guide to combat spam in Analytics they recommend this language filter:

.{13,}|\.

It’s a simple and elegant solution. If the user’s language setting contains more than 13 characters, it’s excluded. It works.

That said, I’m not entirely sure why the author chose 13 characters. Language and location codes were an issue in the early days of the web but ISO 639-1 has existed since 2002 and browsers have reported it with remarkable consistency for years. You’re probably completely safe using 7 characters instead of 13. Even if you’re working with Analytics in technologically underdeveloped areas (e.g., rural China where a computer is running Windows XP with IE6), you’re probably fine with IETF tags if you use 7 characters. Regardless, it works either way because Vitaly Popov is a narcissist and as such he requires more than 13 characters to send his very important messages to you.

Vitaly Popov?

The guy who’s creating this spam is a Russian named Vitaly Popov. His main website is ilovevitaly.com (he also pushes o-o-8-o-o.com but that simply redirects to ilovevitaly.com). He wants people to use his website for keyword searches know he exists, so he’s spamming Analytics. It’s annoying. I’m waving a tiny flag by not linking to his site, but you can feel free to visit it directly and test it out. Just try typing “0-0-8-0-0.com” into an address bar and you’ll know why Russia has a smaller GDP than Italy, Canada and South Korea. It looks like this:

The search box is located behind the large instructional modal.

Legend has it Google shut down his AdSense account a few years ago. This would have stopped him from making any money from Google ads—similar to how Google stopped showing ads on some fake news sites after the 2016 US election. This disagreed with his disposition, and he…created a horrible search proxy portal thingy. He also started using the Google Analytics Measurement Protocol to spam and spoof an unprecedented number of Analytics sessions.

I’m somewhat hilariously confident in assuming that his AdSense account was suspended because he was sending fraudulent ad impressions. And I wouldn’t be surprised if his Analytics spamming script is a branch of his impression spoofing script.

In a recent email to Motherboard he provided some of his narcissistic rants motivations:

Help to Trump and Russia. I like Trump. I even sacrificed traffic to help him.

I was fully prepared from April, but I wait. I could begin in a month before the elections and on a wave of the anti-Russian hysteria to receive a lot of traffic.

Personal glory. I like my name—Vitaly Popov and want that it was known.

I can because I live in Russia. If I live in USA or Europe, I’ll not begin. God bless Russia!

Revenge to Google.

Money. Traffic is money.

Traffic is money. He’s correct about that. But fake traffic isn’t real traffic, and fake traffic isn’t (or shouldn’t be) money.

Hostname Filters!

The standard solution to Analytics spam (for years) has been a hostname filter. A “hostname” is basically a domain or subdomain. (For example, the hostname of this website is empirical.digital.) Most traditional Analytics spam (most of it is known as “referral spam”) is sent from other hostnames, so if I wanted to exclude spam I could setup a hostname filter in Analytics that excluded data from visits on other domains. Many people automatically started recommending this approach when Vitaly Popov’s ghost spam started showing up:

Although this approach typically works to exclude Analytics spam, it doesn’t work to exclude all ghost spam. That’s because some of this spam spoofs the original hostname. In other words, ghost spam to example.com sometimes spoofs the hostname as example.com. That’s quite impressive and it means that Vitaly Popov is most likely running a spider that traverses the web and picks up Analytics IDs to correlate them with spoofed hostnames.

Hostname Filters?

Hostname filters will most likely be a less effective spam mitigation method moving forward. Additionally, there’s another reason not to use hostname filters. Setting up a hostname filter (and doing nothing else) means that you’ll never know how frequently people are viewing your website’s content on other domains. This happens more than you may assume. Some examples:

Whitehat Hostname Discrepancies:

  • Translation services. When website visitors uses a translation service, the resulting pageview happens with a hostname of translate.googleusercontent.com, translate.baiducontent.com, etc.
  • Google cache. When Google users view a cached version of your website, the resulting pageview happens with a hostname of webcache.googleusercontent.com.
  • Archive.org. When people view previous versions of your website on archive.org (or other archives), the pageview happens with a hostname of web.archive.org.

Blackhat Hostname Discrepancies:

  • Content scrapers. When people scrape content from websites and put that content on other domains, they usually take all of the HTML (including the Analytics script or GTM container). This means that you can usually detect content scrapers by viewing hostnames (as a secondary dimension) in Analytics. None of this traffic is logged in Analytics if you exclude other hostnames from showing up.

Preserving Raw Analytics Data

If you still want to use a hostname filter to exclude more traditional spam, consider doing one of the following in order to capture pageviews reported from extraneous hostnames:

  1. Create a view in your Analytics property to store all data without any filters.
  2. Use a lookup table in Google Tag Manager to separate traffic from different hostnames to different Analytics properties. For example, if you have a dev site and a live site, you can automatically use different Analytics IDs for each (while using the same GTM container). Similarly, all traffic to undefined hostnames (archive.org, translate.googleusercontent.com, etc.) can be sent to a third property (that is used primarily to track undefined views). Note that using this method in GTM doesn’t help mitigate this new form of spam (because the spammers are using the Measurement Protocol to trigger pageviews, not GTM), but it still helps you detect scraped content and pageviews on other extraneous domains.

 

 What It Means for the Future of Google Analytics

Ghost spam could become a huge issue if a lot of people start spoofing pageviews like this. Right now it’s primarily an ax-grinding Russian, but if the method spreads it could feasibly make Analytics meaningless quite quickly for a lot of accounts. There’s a possibility that more people will now start implementing this sort of hack. There’s also the possibility that hacks will only spoof the language setting with 5 (or fewer) characters, making my recommended solution meaningless.

Google hasn’t responded in a meaningful way to these hacks, which makes it seem like they’re relying on an open source community to solve the problem with aftermarket filters and edits. They need to start blocking certain IPs and traffic patters from sending pageviews through the Measurement Protocol at the very least.

The system we know as Google Analytics was originally developed by a company named Urchin, and the main thing Urchin did to convince Google to buy them was create a JavaScript file (what became the Analytics script) to augment log files, and they did it primarily to appropriately classify ad traffic (which shows up as referral traffic in log files, and *this* is why Google purchased them). It stands as a worthy footnote that Urchin also developed that JavaScript file to pick up pageviews through ISP’s that started cacheing HTML (so pageviews wouldn’t hit the server, e.g., AOL); i.e., to represent accurate pageviews regardless of advertising traffic. Once Analytics was separated from log files it became easy to implement and hugely popular, but it also lost connections to its log file roots and as such became much more susceptible to spoofing. Honestly, it’s quite remarkable that Google Analytics data has remained relatively unhacked at massive scales for so long.

All of the traditional enterprise competitors to Google Analytics (Webtrends, Omniture/Adobe, etc.) have always followed the same model as Urchin: JavaScript compliments logs (to varying degrees, and with varying degrees of transparency). The JavaScript-only model of Analytics has become a convenient and easy-to-implement standard, but it’s also far easier to fool.

I’m interested in seeing where this goes 5 years down the road, but as always I recommend that all clients store log files on their servers indefinitely. If nothing else, 2 long-term sources of data is better than one.