Malicious Users in Analog web stats

Blocking aggressive Chinese crawlers/scrapers/bots

Over the last few days I’ve had a massive increase in traffic from Chinese data centres & ISPs. The traffic has been relentless & the CPU usage on my server kept spiking enough to cause a fault in my cPanel hosting. I’m on a great hosting package with UKHOST4U and the server is fast & stable, but it is shared with a few other websites. This means that I couldn’t just blanket ban Chinese IP ranges. Even though we don’t sell our products in China, it seemed like a very heavy-handed approach, and to block via .htaccess with the entire range of Chinese IP addresses was causing a 2-3 second delay in page parsing (pages normally load in around 600ms).

First I tried blocking the individual IP’s, but this seemed to make the bot more aggressive & requests went up as high as 800 every 30 seconds. The range of IP’s seemed endless, which seems to suggest some sort of bot farm or a whole range of compromised machines which are used for DDOS.

After giving it some thought & checking the raw access logs, I could see a pattern in the user agents being used by the malicious traffic. Below are a few examples of those user agents:-

Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36 Mb2345Browser/9.0

Mozilla/5.0 (Linux; Android 7.0; FRD-AL00 Build/HUAWEIFRD-AL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043602 Safari/537.36 MicroMessenger/6.5.16.1120 NetType/WIFI Language/zh_CN

Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/43.0.2357.121 Mobile Safari/537.36 LieBaoFast/4.51.3

Mozilla/5.0(Linux;U;Android 5.1.1;zh-CN;OPPO A33 Build/LMY47V) AppleWebKit/537.36(KHTML,like Gecko) Version/4.0 Chrome/40.0.2214.89 UCBrowser/11.7.0.953 Mobile Safari/537.36

I broke down the user agents above & added a new rule to my root .htaccess file as follows:-

Options +FollowSymLinks
RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} Mb2345Browser|LieBaoFast|zh-CN|MicroMessenger|zh_CN|Kinza [NC]
RewriteRule ^ - [F,L]

This rule uses a regular expression to block a user agent containing any of the following strings:-
Mb2345Browser
LieBaoFast
zh-CN
MicroMessenger
zh_CN
Kinza

The first two seem to be used commonly by Chinese crawlers, but as mentioned earlier, we do not ship products to china, so I’m not worried about blocking those browsers. The ZH-CN strings refer to Chinese specific localization settings such as OS & Interface language. Micromessneger is related to WeChat – but again, I’ve never had a customer browse/buy from within the WeChat app so that can be safely blocked. Finally, Kinza is related to Russian email spam. I believe the Kinza browser is an obscure Japanese browser, but on our site is commonly misused in the user agent string by Russian email spam.

This seems to be quite a simple solution to block traffic. Many spammy users will have something in the user agent string which isn’t common to the popular browsers such as chrome, safari & Firefox on common devices. You will have to cater this to your own websites needs, but I’ve no doubt I’ll be adding other reg ex arguments from obscure user agents in the future to keep malicious users off the site.

I hope this helps & if you have anything to add, please get in touch or leave a comment.

John Large

My name is John Large. I am a Web Developer, E-commerce site owner & all round geek. My areas of interest include hardware hacking, digital privacy & security, social media & computer hardware. Iโ€™m also a minimalist in the making, interested in the Tiny House movement & the experience economy along with a strong interest in sustainability & renewable energy. When Iโ€™m not tapping on a keyboard or swiping a smart phone I can be found sampling great coffee, travelling the world with my wife Vicki (who writes over at Letโ€™s Talk Beauty) & generally trying to live my life as unconventionally as possible.

4 thoughts to “Blocking aggressive Chinese crawlers/scrapers/bots”

  1. Hi,

    I’m facing the same problem right now with my website. But, as I’m using a varnish cache, I have to block these chinese boots in the vcl file. Do you also have the codelines to block user agents like “zh_CN” in the vcl file?

    This is a question from an absolute beginner.

    Kind Regards

  2. Hi Manuel.

    If it’s an Apache server a .htaccess would still work regardless of the caching engine. It would stop those user agents connecting to the server before the cache even fires, saving you even more resources.

    You may find you already have a ‘.htaccess’ file. If you use ftp software such as filezilla to manage files, ensure that show hidden files & folders is selected. Any file beginning with a ‘.’ is normally hidden from most users unless you instruct your client not to hide the file.

    You may find you already have a .htaccess file, in which case you can just add the rules to it from my post.

  3. Varnish: In sub vcl_recv
    if(
    req.http.user-agent ~ “bingbot”
    || req.http.user-agent ~ “DotBot”
    || req.http.user-agent ~ “Exabot”
    || req.http.user-agent ~ “Gigabot”
    || req.http.user-agent ~ “ICCrawler”
    || req.http.user-agent ~ “Snappy”
    || req.http.user-agent ~ “Yandex”
    || req.http.user-agent ~ “yandexbot”
    || req.http.user-agent ~ “Yeti”
    || req.http.user-agent ~ “Mb2345Browser”
    || req.http.user-agent ~ “QQBrowser”
    || req.http.user-agent ~ “LieBaoFast”
    || req.http.user-agent ~ “MicroMessenger”
    || req.http.user-agent ~ “Kinza”
    || req.http.user-agent ~ “slurp”
    || req.http.user-agent ~ “TheWorld”
    || req.http.user-agent ~ “YoudaoBot”){
    error 403 “Agent Banned. You are banned from this site. Please contact via a different client configuration if you believe that this is a mistake.”;
    }

    I use also an external file with IP of attackers to make an acl.

  4. Nice tip for other Varnish users. Thanks for sharing. I might add a few of those to my own .htaccess rule. Can I ask why you are blocking bingbot? Does bing not drive any traffic to your website? I get a fair bit from Bing.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.