Malicious Users in Analog web stats

Blocking aggressive Chinese crawlers/scrapers/bots

Over the last few days I’ve had a massive increase in traffic from Chinese data centres & ISPs. The traffic has been relentless & the CPU usage on my server kept spiking enough to cause a fault in my cPanel hosting. I’m on a great hosting package with UKHOST4U and the server is fast & stable, but it is shared with a few other websites. This means that I couldn’t just blanket ban Chinese IP ranges. Even though we don’t sell our products in China, it seemed like a very heavy-handed approach, and to block via .htaccess with the entire range of Chinese IP addresses was causing a 2-3 second delay in page parsing (pages normally load in around 600ms).

First I tried blocking the individual IP’s, but this seemed to make the bot more aggressive & requests went up as high as 800 every 30 seconds. The range of IP’s seemed endless, which seems to suggest some sort of bot farm or a whole range of compromised machines which are used for DDOS.

After giving it some thought & checking the raw access logs, I could see a pattern in the user agents being used by the malicious traffic. Below are a few examples of those user agents:-

Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36 Mb2345Browser/9.0

Mozilla/5.0 (Linux; Android 7.0; FRD-AL00 Build/HUAWEIFRD-AL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043602 Safari/537.36 MicroMessenger/6.5.16.1120 NetType/WIFI Language/zh_CN

Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/43.0.2357.121 Mobile Safari/537.36 LieBaoFast/4.51.3

Mozilla/5.0(Linux;U;Android 5.1.1;zh-CN;OPPO A33 Build/LMY47V) AppleWebKit/537.36(KHTML,like Gecko) Version/4.0 Chrome/40.0.2214.89 UCBrowser/11.7.0.953 Mobile Safari/537.36

I broke down the user agents above & added a new rule to my root .htaccess file as follows:-

Options +FollowSymLinks
RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} Mb2345Browser|LieBaoFast|zh-CN|MicroMessenger|zh_CN|Kinza [NC]
RewriteRule ^ - [F,L]

This rule uses a regular expression to block a user agent containing any of the following strings:-
Mb2345Browser
LieBaoFast
zh-CN
MicroMessenger
zh_CN
Kinza

The first two seem to be used commonly by Chinese crawlers, but as mentioned earlier, we do not ship products to china, so I’m not worried about blocking those browsers. The ZH-CN strings refer to Chinese specific localization settings such as OS & Interface language. Micromessneger is related to WeChat – but again, I’ve never had a customer browse/buy from within the WeChat app so that can be safely blocked. Finally, Kinza is related to Russian email spam. I believe the Kinza browser is an obscure Japanese browser, but on our site is commonly misused in the user agent string by Russian email spam.

This seems to be quite a simple solution to block traffic. Many spammy users will have something in the user agent string which isn’t common to the popular browsers such as chrome, safari & Firefox on common devices. You will have to cater this to your own websites needs, but I’ve no doubt I’ll be adding other reg ex arguments from obscure user agents in the future to keep malicious users off the site.

I hope this helps & if you have anything to add, please get in touch or leave a comment.

John Large

My name is John Large. I am a Web Developer, E-commerce site owner & all round geek. My areas of interest include hardware hacking, digital privacy & security, social media & computer hardware. Iโ€™m also a minimalist in the making, interested in the Tiny House movement & the experience economy along with a strong interest in sustainability & renewable energy. When Iโ€™m not tapping on a keyboard or swiping a smart phone I can be found sampling great coffee, travelling the world with my wife Vicki (who writes over at Letโ€™s Talk Beauty) & generally trying to live my life as unconventionally as possible.

14 thoughts to “Blocking aggressive Chinese crawlers/scrapers/bots”

  1. Hi,

    I’m facing the same problem right now with my website. But, as I’m using a varnish cache, I have to block these chinese boots in the vcl file. Do you also have the codelines to block user agents like “zh_CN” in the vcl file?

    This is a question from an absolute beginner.

    Kind Regards

  2. Hi Manuel.

    If it’s an Apache server a .htaccess would still work regardless of the caching engine. It would stop those user agents connecting to the server before the cache even fires, saving you even more resources.

    You may find you already have a ‘.htaccess’ file. If you use ftp software such as filezilla to manage files, ensure that show hidden files & folders is selected. Any file beginning with a ‘.’ is normally hidden from most users unless you instruct your client not to hide the file.

    You may find you already have a .htaccess file, in which case you can just add the rules to it from my post.

  3. Varnish: In sub vcl_recv
    if(
    req.http.user-agent ~ “bingbot”
    || req.http.user-agent ~ “DotBot”
    || req.http.user-agent ~ “Exabot”
    || req.http.user-agent ~ “Gigabot”
    || req.http.user-agent ~ “ICCrawler”
    || req.http.user-agent ~ “Snappy”
    || req.http.user-agent ~ “Yandex”
    || req.http.user-agent ~ “yandexbot”
    || req.http.user-agent ~ “Yeti”
    || req.http.user-agent ~ “Mb2345Browser”
    || req.http.user-agent ~ “QQBrowser”
    || req.http.user-agent ~ “LieBaoFast”
    || req.http.user-agent ~ “MicroMessenger”
    || req.http.user-agent ~ “Kinza”
    || req.http.user-agent ~ “slurp”
    || req.http.user-agent ~ “TheWorld”
    || req.http.user-agent ~ “YoudaoBot”){
    error 403 “Agent Banned. You are banned from this site. Please contact via a different client configuration if you believe that this is a mistake.”;
    }

    I use also an external file with IP of attackers to make an acl.

  4. Nice tip for other Varnish users. Thanks for sharing. I might add a few of those to my own .htaccess rule. Can I ask why you are blocking bingbot? Does bing not drive any traffic to your website? I get a fair bit from Bing.

  5. Thanks for the tip. I’ve been having the exact same problem on my site and your .htaccess suggestion appears to have worked.

  6. Glad it helped. It’s really easy to expand upon, so if you see any obvious user agents you don’t like with a unique (to that user agent) identifier string, feel free to add it & create your own rules. I’ve blocked a few more crawlers which scan my website for data & marketing purpose, but ignore robots.txt – they are wasting bandwidth and selling data about my website so they can go elsewhere.

  7. Very much the same here. As of October 1, we have a massive rise in traffic from ranges in Chinese /8 networks that are way too large to ip-block individually. User agents are typically “LieBaoFast”, “Mb2345Browser/9.0” and “MicroMessenger”. Blocked them by a rewrite rule, which will work as long as they are not changing the string.

  8. > Can I ask why you are blocking bingbot?

    Yeah, and he also blocked Yandex for some reason.

    It’s a Russian search engine and as far as I know it doesn’t do anything abusive like those Chinese bots. Also Yandex Browser is a quite popular web browser in Russia.

  9. facing the same closed it by just doing a block by country “China” in firewall rules of cloudflare. But still some annoying requests coming through.

  10. It looks like this crawler/scraper/bot uses the same 4 user agents over and over again. It appears that they use a bunch of IPs all over China (and Hong Kong) that are from the mobile networks. They hit once with an IP and then with another IP from another part of China with one of the 4 user agents. Not sure what the crawler/scraper/bot wants, but I have no presence in China (or Hong Kong) so I block entire netblocks after the fact.

  11. Thanks.

    I got banned this bots with fail2ban via csf on nginx server.

    My nginx-badbots.conf file:

    [Definition]
    #Not : nginx kotu botlar

    failregex = ^ – .* “(GET|POST|HEAD).*HTTP.*” “.*(LieBaoFast|Mb2345Browser|UCBrowser|MicroMessenger|Kinza).*”$

    ignoreregex =

    And jail.local config:

    [nginx-badbots]
    enabled = true
    maxretry = 1
    #30 days
    bantime = 7776000
    port = http,https
    filter = nginx-badbots
    logpath = /home/nginx/domains/*/log/access.log
    /usr/local/nginx/logs/*access*.log
    action = csfdeny[name=nginx-badbots]

  12. Regards. Recently I have seen these same useragents, from ips that have the following pattern /159.138.\d+.\d+/, I had to restrict them with .htaccess because they were taking me to the limit. Do not have mercy on these intruders, because at any time they devour you.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.