Is Google Analytics Spam Messing Up Your Metrics? (Probably)
What’s the impact of a little spam data on your Google Analytics metrics? If you run a large website with tens of thousands of visitors or more per day, then maybe not much. However, if your site is smaller, there’s a good chance spam may be seriously skewing your metrics.
Below is the Acquisition > All Traffic > Referrals report from a small non-profit. I’ve checked the spam referral sources, and clicked “Plot Rows” to see the daily level of traffic from these spammers.
In the table above, you can see that spam accounts for the top two slots in this Referrals Report! Not only is this annoying, but it messes up the metrics pretty badly.
To analyze the spam’s impact on the non-profit’s metrics, I exported this table into Excel and did some calculations. In the results below, the Sources highlighted in yellow are spam referrals; the two summary lines at the bottom show metric calculations with and without spam.
The first thing to note is that 144 out of 283 referral Sessions are spam – that’s 50.9%! The impact on small websites like this one is huge as these spam visits throw off the engagement metrics. As you can see from the spreadsheet, the Bounce Rate for most spam referrals sources is 100%, the Pages/Session is close to 1.0 and the Avg. Session Duration is close to 0.00. When more than 50% of the referral traffic is spam, it is seriously dragging down the engagement numbers and giving you a false impression of the quality of your traffic.
Compare the Bounce Rate, Pages/Session, and Avg. Session Duration for “Including Spam” vs “Excluding Spam” (numbers inside the red rectangle). The spam is making these metrics look much worse than they really are. Bounce Rate, for example, is reading 77.74%. But, when we exclude the spam, the Bounce Rate is a much better 55.4%.
Other than exporting data to Excel and re-calculating all the numbers, is there any way we can stop these spam referrals from polluting our Google Analytics data?
Filtering Out Google Analytics Spam
The techniques for removing spam rely on using Google Analytics View Filters. I first read about these techniques in this excellent article from the Analytics Edge blog: Removing Referral Spam from Google Analytics.
As explained in that article, there are two basic groups of spammers using two different techniques, and you need to use slightly different filters to combat each technique.
Eliminating Ghost Referrals
The first group is what people are starting to call “Ghost Referrals.” These are referrals generated in your reports by fake visits. In this scenario, the spammers don’t even visit your website. Instead, they transmit spammy data directly to Google Analytics that gets added to your reports.
To start cleaning this up, we create a new view and then add some filters. As shown below, you can create a view in the Admin section of Google Analytics. Pick the Account and Property where you want to create a spam-free view. [Note: Views do not contain historical data older than the date on which they are created. If you create a view on Jan 2nd, there will be no data in that view prior to Jan 2nd. So, this new spam-free view will not help clean up the historical data - only the new data coming in.]
Next, we are going to create a list of the valid hostnames that should be showing up in your Google Analytics reports. The key to removing ghost referrals is that they come from hostnames that are not yours – and you can use that weakness to filter them out.
Below is a list of the valid hostnames of visits to our Megalytic website:
Note the last one – translate.googleusercontent.com. This is the hostname that shows when a user views your website through Google Translate – you do not want to filter that out.
If you are not sure of your list of valid hostnames, you can look at the Audience > Technology > Network report and select Hostname as the primary dimension. Set a long time range in the calendar – like a year or more if you have that much data. This will ensure that you capture all the valid hostnames.
Here is what that report looks like for Megalytic. The valid hostnames have little red arrows next to them. The rest (e.g., apple.com, iedit.ilovevitaly.com) are from spammers!
Once you have your list of valid hostnames, put them in a single line of text, separated by the “|” – OR character. Also put a backslash in front of all the “.” – PERIOD characters. This creates a regular expression that will match on your good hostnames and exclude all the spammer hostnames.
For example, here is what we use:
Before you put this filter expression to use we recommend that you build a segment to test it out on your historical data to see how it looks. Filters permanently alter the data in a view, so it’s a good idea to test filter expressions using non-permanent segments on your historical data before using them in filters.
Another benefit of testing your filter expression in a segment is that you can use this segment to look at your historical data without the ghost referrals.
Here is the testing segment created for Megalytic, which we named “My Hosts.”
And here are the results, filtered by using the “My Hosts” segment:
As you can see, some of the sessions have been filtered out – the “My Hosts” segment has 19,934 Sessions vs 20,235 in “All Sessions.”
Next, apply the “My Hosts” segment to your Audience > Technology > Network report and select Hostname as the primary dimension. Check to see that only valid hostnames are showing up. Below are the results for Megalytic.
Once you are confident your filter expression is working correctly, add it to your new view. We called our new view “Spam Free.” You can see below how we selected the “Filters” section to create a filter on this view, and then pasted in our filter expression as a Custom Filter Type. Make sure to select “Include” and to filter on the “Hostname” field.
Save this filter and you should be all set. This new view will now exclude all ghost-referral spam. Unfortunately, filtered views only include data from the date they were created. So, you cannot use this view to look back at historical data. However, you can use the segment “My Hosts” created during the testing process to view spam-free historical results.
Eliminating the non-Ghost Referral Spam
Unlike the ghost referrals, some of the spammer bots, like Semalt, actually visit your website. These will not be removed using the hostname filter described above. To remove these, you will need to create another filter that will exclude a list of known referral spam domains.
So, to clarify, the first filter INCLUDES only your valid hostnames. That kills the ghost-referral spammers. This second filter will EXCLUDE known spammer domains.
To find the non-ghost spammers visiting your website, open Acquisition > All Traffic > Referrals and add Hostname as a secondary dimension. Spam sources where the Hostname is valid (in our case, megalytic.com) are the non-ghost spammer domains we need to exclude.
From this list, you can see that semalt.com and buttons-for-website.com should go on our list. As before, create a filter, but this time use Referral as the Filter Field, and the filter is:
As shown below, we name this filter “Exclude non-Ghost Referral Spam.”
You should check your Acquisition > All Traffic > Referrals report periodically to identify any new spam referral domains that start showing up. Add these new ones to your filter as necessary to keep your data as spam-free as possible.
Another approach to filtering out the non-ghost spammers is to stop them from visiting your website at all. If you are hosting your website on the Apache web server, this kind of blocking can be accomplished by modifying the .htaccess file, as described here: How to block referrer spam traffic.
If you are running WordPress, there is now a plugin that will do this for you: SpamReferrerBlock. One advantage of using this plugin is that they claim to keep a “blacklist” of domains that are spammers and filter those visits for you, so you do not have to keep your filters up to date.
Referral spam is becoming a serious problem and I expect that Google will soon introduce new features to help us protect the integrity of our Google Analytics data. Until then, you can use the filtering techniques described in this post to create a view that is relatively free of referral spam.
Its been almost 2 years since I wrote this, and the Google Analytics spam problem is still with us! If you are looking for more details on this subject, I suggest that you check out Carlos Escalera's post: Ultimate Guide to Getting Rid of All the Spam in Google Analytics.
Update on March 23, 2017 ...
I've seen a few articles indicating that Google is taking action to solve this problem. If you have noticed an improvement, let me know in the comments.