Malicious Users in Analog web stats

Blocking aggressive Chinese crawlers/scrapers/bots

Over the last few days I’ve had a massive increase in traffic from Chinese data centres & ISPs. The traffic has been relentless & the CPU usage on my server kept spiking enough to cause a fault in my cPanel hosting. I’m on a great hosting package with UKHOST4U and the server is fast & stable, but it is shared with a few other websites. This means that I couldn’t just blanket ban Chinese IP ranges. Even though we don’t sell our products in China, it seemed like a very heavy-handed approach, and to block via .htaccess with the entire range of Chinese IP addresses was causing a 2-3 second delay in page parsing (pages normally load in around 600ms).

First I tried blocking the individual IP’s, but this seemed to make the bot more aggressive & requests went up as high as 800 every 30 seconds. The range of IP’s seemed endless, which seems to suggest some sort of bot farm or a whole range of compromised machines which are used for DDOS.

After giving it some thought & checking the raw access logs, I could see a pattern in the user agents being used by the malicious traffic. Below are a few examples of those user agents:-

Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36 Mb2345Browser/9.0

Mozilla/5.0 (Linux; Android 7.0; FRD-AL00 Build/HUAWEIFRD-AL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043602 Safari/537.36 MicroMessenger/6.5.16.1120 NetType/WIFI Language/zh_CN

Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/43.0.2357.121 Mobile Safari/537.36 LieBaoFast/4.51.3

Mozilla/5.0(Linux;U;Android 5.1.1;zh-CN;OPPO A33 Build/LMY47V) AppleWebKit/537.36(KHTML,like Gecko) Version/4.0 Chrome/40.0.2214.89 UCBrowser/11.7.0.953 Mobile Safari/537.36

I broke down the user agents above & added a new rule to my root .htaccess file as follows:-

Options +FollowSymLinks
RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} Mb2345Browser|LieBaoFast|zh-CN|MicroMessenger|zh_CN|Kinza [NC]
RewriteRule ^ - [F,L]

This rule uses a regular expression to block a user agent containing any of the following strings:-
Mb2345Browser
LieBaoFast
zh-CN
MicroMessenger
zh_CN
Kinza

The first two seem to be used commonly by Chinese crawlers, but as mentioned earlier, we do not ship products to china, so I’m not worried about blocking those browsers. The ZH-CN strings refer to Chinese specific localization settings such as OS & Interface language. Micromessneger is related to WeChat – but again, I’ve never had a customer browse/buy from within the WeChat app so that can be safely blocked. Finally, Kinza is related to Russian email spam. I believe the Kinza browser is an obscure Japanese browser, but on our site is commonly misused in the user agent string by Russian email spam.

This seems to be quite a simple solution to block traffic. Many spammy users will have something in the user agent string which isn’t common to the popular browsers such as chrome, safari & Firefox on common devices. You will have to cater this to your own websites needs, but I’ve no doubt I’ll be adding other reg ex arguments from obscure user agents in the future to keep malicious users off the site.

I hope this helps & if you have anything to add, please get in touch or leave a comment.

John Large

My name is John Large. I am a Web Developer, E-commerce site owner & all round geek. My areas of interest include hardware hacking, digital privacy & security, social media & computer hardware. Iā€™m also a minimalist in the making, interested in the Tiny House movement & the experience economy along with a strong interest in sustainability & renewable energy. When Iā€™m not tapping on a keyboard or swiping a smart phone I can be found sampling great coffee, travelling the world with my wife Vicki (who writes over at Letā€™s Talk Beauty) & generally trying to live my life as unconventionally as possible.

21 thoughts to “Blocking aggressive Chinese crawlers/scrapers/bots”

  1. Hi,

    I’m facing the same problem right now with my website. But, as I’m using a varnish cache, I have to block these chinese boots in the vcl file. Do you also have the codelines to block user agents like “zh_CN” in the vcl file?

    This is a question from an absolute beginner.

    Kind Regards

  2. Hi Manuel.

    If it’s an Apache server a .htaccess would still work regardless of the caching engine. It would stop those user agents connecting to the server before the cache even fires, saving you even more resources.

    You may find you already have a ‘.htaccess’ file. If you use ftp software such as filezilla to manage files, ensure that show hidden files & folders is selected. Any file beginning with a ‘.’ is normally hidden from most users unless you instruct your client not to hide the file.

    You may find you already have a .htaccess file, in which case you can just add the rules to it from my post.

  3. Varnish: In sub vcl_recv
    if(
    req.http.user-agent ~ “bingbot”
    || req.http.user-agent ~ “DotBot”
    || req.http.user-agent ~ “Exabot”
    || req.http.user-agent ~ “Gigabot”
    || req.http.user-agent ~ “ICCrawler”
    || req.http.user-agent ~ “Snappy”
    || req.http.user-agent ~ “Yandex”
    || req.http.user-agent ~ “yandexbot”
    || req.http.user-agent ~ “Yeti”
    || req.http.user-agent ~ “Mb2345Browser”
    || req.http.user-agent ~ “QQBrowser”
    || req.http.user-agent ~ “LieBaoFast”
    || req.http.user-agent ~ “MicroMessenger”
    || req.http.user-agent ~ “Kinza”
    || req.http.user-agent ~ “slurp”
    || req.http.user-agent ~ “TheWorld”
    || req.http.user-agent ~ “YoudaoBot”){
    error 403 “Agent Banned. You are banned from this site. Please contact via a different client configuration if you believe that this is a mistake.”;
    }

    I use also an external file with IP of attackers to make an acl.

  4. Nice tip for other Varnish users. Thanks for sharing. I might add a few of those to my own .htaccess rule. Can I ask why you are blocking bingbot? Does bing not drive any traffic to your website? I get a fair bit from Bing.

  5. Thanks for the tip. I’ve been having the exact same problem on my site and your .htaccess suggestion appears to have worked.

  6. Glad it helped. It’s really easy to expand upon, so if you see any obvious user agents you don’t like with a unique (to that user agent) identifier string, feel free to add it & create your own rules. I’ve blocked a few more crawlers which scan my website for data & marketing purpose, but ignore robots.txt – they are wasting bandwidth and selling data about my website so they can go elsewhere.

  7. Very much the same here. As of October 1, we have a massive rise in traffic from ranges in Chinese /8 networks that are way too large to ip-block individually. User agents are typically “LieBaoFast”, “Mb2345Browser/9.0” and “MicroMessenger”. Blocked them by a rewrite rule, which will work as long as they are not changing the string.

  8. > Can I ask why you are blocking bingbot?

    Yeah, and he also blocked Yandex for some reason.

    It’s a Russian search engine and as far as I know it doesn’t do anything abusive like those Chinese bots. Also Yandex Browser is a quite popular web browser in Russia.

  9. facing the same closed it by just doing a block by country “China” in firewall rules of cloudflare. But still some annoying requests coming through.

  10. It looks like this crawler/scraper/bot uses the same 4 user agents over and over again. It appears that they use a bunch of IPs all over China (and Hong Kong) that are from the mobile networks. They hit once with an IP and then with another IP from another part of China with one of the 4 user agents. Not sure what the crawler/scraper/bot wants, but I have no presence in China (or Hong Kong) so I block entire netblocks after the fact.

  11. Thanks.

    I got banned this bots with fail2ban via csf on nginx server.

    My nginx-badbots.conf file:

    [Definition]
    #Not : nginx kotu botlar

    failregex = ^ – .* “(GET|POST|HEAD).*HTTP.*” “.*(LieBaoFast|Mb2345Browser|UCBrowser|MicroMessenger|Kinza).*”$

    ignoreregex =

    And jail.local config:

    [nginx-badbots]
    enabled = true
    maxretry = 1
    #30 days
    bantime = 7776000
    port = http,https
    filter = nginx-badbots
    logpath = /home/nginx/domains/*/log/access.log
    /usr/local/nginx/logs/*access*.log
    action = csfdeny[name=nginx-badbots]

  12. Regards. Recently I have seen these same useragents, from ips that have the following pattern /159.138.\d+.\d+/, I had to restrict them with .htaccess because they were taking me to the limit. Do not have mercy on these intruders, because at any time they devour you.

  13. We set this rule on our server, these bots are a pest as they were relentless:

    ##Block persistent and resource hungry bots
    if ($http_user_agent ~* (Baiduspider|python-requests/2.13.0|Mb2345Browser|LieBaoFast|zh-CN|MicroMessenger|zh_CN|Kinz)) {
    return 403;
    }

  14. If you are using Cloudflare.

    A really quick and easy way to deal with this issue is to set up a firewall rule for ASN “AS136907”.

    AS136907 contains all the IPs for the organization “Huawei-HK-CLOUDS”. You can choose to Block all traffic from this range or set prompt with a JS challenge.

  15. Hi John, first of all – thanks for the blog post. I tried your code, it didn’t work (I tested with Chrome user agent switch) and I was monitoring the log file. I found similar code and I adjusted it and now my version works – I hope it will help someone else. If you compare with the original code you will notice minor differences but I assume that was the main reason why the code didn’t work initially. Once again – thanks for the blog post.

    #China spammers
    RewriteCond %{HTTP_USER_AGENT} ^.*(Mb2345Browser|LieBaoFast|zh-CN|MicroMessenger|zh_CN|Kinza|Bytespider|Baiduspider|Sogou).*$ [NC]
    RewriteRule .* – [F,L]

  16. hi John, same here …
    about 800.000 requests per day without any sense (scanning the tag-cloud in our forum).
    the requests originate from litterally hundreds of IPs in 159.138.128.0/19 (HUAWEI cloud Hongkong) and i decided to block the whole range since we have not a single purchase or download from any IP out of this range over years.

  17. Thank you so much. This saved my website from a massive attack that got me really worried for several days.
    Adding the rules to .htaccess seems to have stopped the requests.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.