Blocking aggressive Chinese crawlers/scrapers/bots

4th August 201917th August 2020 John Large Cybersecurity, Technical, Web Development

Over the last few days I’ve had a massive increase in traffic from Chinese data centres & ISPs. The traffic has been relentless & the CPU usage on my server kept spiking enough to cause a fault in my cPanel hosting. I’m on a great hosting package with UKHOST4U and the server is fast & stable, but it is shared with a few other websites. This means that I couldn’t just blanket ban Chinese IP ranges. Even though we don’t sell our products in China, it seemed like a very heavy-handed approach, and to block via .htaccess with the entire range of Chinese IP addresses was causing a 2-3 second delay in page parsing (pages normally load in around 600ms).

First I tried blocking the individual IP’s, but this seemed to make the bot more aggressive & requests went up as high as 800 every 30 seconds. The range of IP’s seemed endless, which seems to suggest some sort of bot farm or a whole range of compromised machines which are used for DDOS.

After giving it some thought & checking the raw access logs, I could see a pattern in the user agents being used by the malicious traffic. Below are a few examples of those user agents:-

Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36 Mb2345Browser/9.0

Mozilla/5.0 (Linux; Android 7.0; FRD-AL00 Build/HUAWEIFRD-AL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043602 Safari/537.36 MicroMessenger/6.5.16.1120 NetType/WIFI Language/zh_CN

Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/43.0.2357.121 Mobile Safari/537.36 LieBaoFast/4.51.3

Mozilla/5.0(Linux;U;Android 5.1.1;zh-CN;OPPO A33 Build/LMY47V) AppleWebKit/537.36(KHTML,like Gecko) Version/4.0 Chrome/40.0.2214.89 UCBrowser/11.7.0.953 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://aspiegel.com/petalbot)

I broke down the user agents above & added a new rule to my root .htaccess file as follows:-

Options +FollowSymLinks
RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} Mb2345Browser|LieBaoFast|zh-CN|MicroMessenger|zh_CN|Kinza|Datanyze|serpstatbot|spaziodati|OPPO\sA33|AspiegelBot|aspiegel|PetalBot [NC]
RewriteRule ^ - [F,L]

This rule uses a regular expression to block a user agent containing any of the following strings:-
Mb2345Browser
LieBaoFast
zh-CN
MicroMessenger
zh_CN
Kinza
OPPO A33
Aspeigel
PetalBot

The first two seem to be used commonly by Chinese crawlers, but as mentioned earlier, we do not ship products to china, so I’m not worried about blocking those browsers. The ZH-CN strings refer to Chinese specific localization settings such as OS & Interface language. Micromessneger is related to WeChat – but again, I’ve never had a customer browse/buy from within the WeChat app so that can be safely blocked. Finally, Kinza is related to Russian email spam. I believe the Kinza browser is an obscure Japanese browser, but on our site is commonly misused in the user agent string by Russian email spam.

This seems to be quite a simple solution to block traffic. Many spammy users will have something in the user agent string which isn’t common to the popular browsers such as chrome, safari & Firefox on common devices. You will have to cater this to your own websites needs, but I’ve no doubt I’ll be adding other reg ex arguments from obscure user agents in the future to keep malicious users off the site.

I hope this helps & if you have anything to add, please get in touch or leave a comment.

John Large

My name is John Large. I am a Web Developer, E-commerce site owner & all round geek. My areas of interest include hardware hacking, digital privacy & security, social media & computer hardware. I’m also a minimalist in the making, interested in the Tiny House movement & the experience economy along with a strong interest in sustainability & renewable energy. When I’m not tapping on a keyboard or swiping a smart phone I can be found sampling great coffee, travelling the world with my wife Vicki (who writes over at Let’s Talk Beauty) & generally trying to live my life as unconventionally as possible.

61 thoughts to “Blocking aggressive Chinese crawlers/scrapers/bots”

Manuel says:

16th August 2019 at 8:03 am

Hi,

I’m facing the same problem right now with my website. But, as I’m using a varnish cache, I have to block these chinese boots in the vcl file. Do you also have the codelines to block user agents like “zh_CN” in the vcl file?

This is a question from an absolute beginner.

Kind Regards
John Large says:

16th August 2019 at 8:33 am

Hi Manuel.

If it’s an Apache server a .htaccess would still work regardless of the caching engine. It would stop those user agents connecting to the server before the cache even fires, saving you even more resources.

You may find you already have a ‘.htaccess’ file. If you use ftp software such as filezilla to manage files, ensure that show hidden files & folders is selected. Any file beginning with a ‘.’ is normally hidden from most users unless you instruct your client not to hide the file.

You may find you already have a .htaccess file, in which case you can just add the rules to it from my post.
Paco Vela says:

20th August 2019 at 9:13 am

Varnish: In sub vcl_recv
if(
req.http.user-agent ~ “bingbot”
|| req.http.user-agent ~ “DotBot”
|| req.http.user-agent ~ “Exabot”
|| req.http.user-agent ~ “Gigabot”
|| req.http.user-agent ~ “ICCrawler”
|| req.http.user-agent ~ “Snappy”
|| req.http.user-agent ~ “Yandex”
|| req.http.user-agent ~ “yandexbot”
|| req.http.user-agent ~ “Yeti”
|| req.http.user-agent ~ “Mb2345Browser”
|| req.http.user-agent ~ “QQBrowser”
|| req.http.user-agent ~ “LieBaoFast”
|| req.http.user-agent ~ “MicroMessenger”
|| req.http.user-agent ~ “Kinza”
|| req.http.user-agent ~ “slurp”
|| req.http.user-agent ~ “TheWorld”
|| req.http.user-agent ~ “YoudaoBot”){
error 403 “Agent Banned. You are banned from this site. Please contact via a different client configuration if you believe that this is a mistake.”;
}

I use also an external file with IP of attackers to make an acl.
John Large says:

20th August 2019 at 4:42 pm

Nice tip for other Varnish users. Thanks for sharing. I might add a few of those to my own .htaccess rule. Can I ask why you are blocking bingbot? Does bing not drive any traffic to your website? I get a fair bit from Bing.
Jim Dominic says:

23rd August 2019 at 2:14 am

Thanks for the tip. I’ve been having the exact same problem on my site and your .htaccess suggestion appears to have worked.
John Large says:

23rd August 2019 at 11:33 pm

Glad it helped. It’s really easy to expand upon, so if you see any obvious user agents you don’t like with a unique (to that user agent) identifier string, feel free to add it & create your own rules. I’ve blocked a few more crawlers which scan my website for data & marketing purpose, but ignore robots.txt – they are wasting bandwidth and selling data about my website so they can go elsewhere.
Matthias says:

10th October 2019 at 2:24 pm

Very much the same here. As of October 1, we have a massive rise in traffic from ranges in Chinese /8 networks that are way too large to ip-block individually. User agents are typically “LieBaoFast”, “Mb2345Browser/9.0” and “MicroMessenger”. Blocked them by a rewrite rule, which will work as long as they are not changing the string.
scott says:

12th October 2019 at 11:34 am

Thanks! My site was overwhelmed with these bots. This seems to have fixed it.
Jason says:

27th October 2019 at 11:04 am

Thanks for this – just helped me overcome the same issue.
Alex says:

31st October 2019 at 8:33 am

> Can I ask why you are blocking bingbot?

Yeah, and he also blocked Yandex for some reason.

It’s a Russian search engine and as far as I know it doesn’t do anything abusive like those Chinese bots. Also Yandex Browser is a quite popular web browser in Russia.
Kaushik Gandhi says:

9th November 2019 at 10:12 am

facing the same closed it by just doing a block by country “China” in firewall rules of cloudflare. But still some annoying requests coming through.
Randy says:

9th November 2019 at 4:38 pm

It looks like this crawler/scraper/bot uses the same 4 user agents over and over again. It appears that they use a bunch of IPs all over China (and Hong Kong) that are from the mobile networks. They hit once with an IP and then with another IP from another part of China with one of the 4 user agents. Not sure what the crawler/scraper/bot wants, but I have no presence in China (or Hong Kong) so I block entire netblocks after the fact.
omer says:

14th November 2019 at 1:29 am

Thanks.

I got banned this bots with fail2ban via csf on nginx server.

My nginx-badbots.conf file:

[Definition]
#Not : nginx kotu botlar

failregex = ^ – .* “(GET|POST|HEAD).*HTTP.*” “.*(LieBaoFast|Mb2345Browser|UCBrowser|MicroMessenger|Kinza).*”$

ignoreregex =

And jail.local config:

[nginx-badbots]
enabled = true
maxretry = 1
#30 days
bantime = 7776000
port = http,https
filter = nginx-badbots
logpath = /home/nginx/domains/*/log/access.log
/usr/local/nginx/logs/*access*.log
action = csfdeny[name=nginx-badbots]
Jairo Gelbes says:

15th November 2019 at 4:08 pm

Regards. Recently I have seen these same useragents, from ips that have the following pattern /159.138.\d+.\d+/, I had to restrict them with .htaccess because they were taking me to the limit. Do not have mercy on these intruders, because at any time they devour you.
Lexicalk says:

15th December 2019 at 9:14 pm

We set this rule on our server, these bots are a pest as they were relentless:

##Block persistent and resource hungry bots
if ($http_user_agent ~* (Baiduspider|python-requests/2.13.0|Mb2345Browser|LieBaoFast|zh-CN|MicroMessenger|zh_CN|Kinz)) {
return 403;
}
Christian Lund says:

4th January 2020 at 12:11 pm

If you are using Cloudflare.

A really quick and easy way to deal with this issue is to set up a firewall rule for ASN “AS136907”.

AS136907 contains all the IPs for the organization “Huawei-HK-CLOUDS”. You can choose to Block all traffic from this range or set prompt with a JS challenge.
Edgars says:

5th January 2020 at 10:06 pm

Hi John, first of all – thanks for the blog post. I tried your code, it didn’t work (I tested with Chrome user agent switch) and I was monitoring the log file. I found similar code and I adjusted it and now my version works – I hope it will help someone else. If you compare with the original code you will notice minor differences but I assume that was the main reason why the code didn’t work initially. Once again – thanks for the blog post.

#China spammers
RewriteCond %{HTTP_USER_AGENT} ^.*(Mb2345Browser|LieBaoFast|zh-CN|MicroMessenger|zh_CN|Kinza|Bytespider|Baiduspider|Sogou).*$ [NC]
RewriteRule .* – [F,L]
cmv says:

7th January 2020 at 10:56 am

hi John, same here …
about 800.000 requests per day without any sense (scanning the tag-cloud in our forum).
the requests originate from litterally hundreds of IPs in 159.138.128.0/19 (HUAWEI cloud Hongkong) and i decided to block the whole range since we have not a single purchase or download from any IP out of this range over years.
Carlos says:

12th January 2020 at 6:41 pm

Thank you so much. This saved my website from a massive attack that got me really worried for several days.
Adding the rules to .htaccess seems to have stopped the requests.
Scott says:

20th January 2020 at 6:43 am

How can I tell if it’s working? Will the user agents stop showing in the log?
Joaquin says:

20th January 2020 at 1:24 pm

AWESOME
Thanks very much!
Ron says:

28th January 2020 at 4:27 am

Hi John, im having attack for 3 months now on one specific domain on my server, its started with about a 500 k to 1 million requests per day, all coming from china . so i blocked china and all was good until a week ago i realized the server is slow so i checked and i see that now the traffic is coming from all over the world using this 2 ip ranges 159.138.*.* and 114.119.*.*

here is 3 entries from the log file.
=============================
1.)
114.119.162.117 – – [27/Jan/2020:20:14:12 -0800] “GET /calendar-2/action~oneday/time_limit~1565593200/tag_ids~1842,1228,1495,1835,1321,1839/request_format~html HTTP/1.0” 403 670 “-” “Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/43.0.2357.121 Mobile Safari/537.36 LieBaoFast/4.51.3”
2.)
114.119.157.127 – – [27/Jan/2020:20:18:23 -0800] “GET /calendar-2/action~oneday/time_limit~1565593200/tag_ids~1217,1500,2108,2009,1516/request_format~html HTTP/1.0” 403 665 “-” “Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36 Mb2345Browser/9.0”

3.)
114.119.143.128 – – [27/Jan/2020:20:23:58 -0800] “GET /calendar-2/action~oneday/time_limit~1565593200/tag_ids~1407,1784,1431,1413/request_format~html HTTP/1.0” 403 660 “-” “Mozilla/5.0 (Linux; Android 7.0; FRD-AL00 Build/HUAWEIFRD-AL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043602 Safari/537.36 MicroMessenger/6.5.16.1120 NetType/WIFI Language/zh_CN”

======================================================
i did so many things now i blocked all the world beside the US but the traffic is also coming from the US as well with the same ip ranges. my server is flooded with about 1 million requests every 1 -2 hours. its insane . i keep deleting the logs they becoming so big . do you have any suggestion for me ? i try every possible thing even installed a paid version of DDOS protection . but so far nothing works.
John Large says:

28th January 2020 at 12:38 pm

Have you tried my solution with .htaccess? Looking at those user gents supplied, they would be matched by my rule.

If you are having such a massive issue, have you thought about using Cloudflare? The free accounts are good, but the first tier of paid account is excellent. This isn’t a paid endorsement, but the free account has got me out of a few sticky spots in the past when I’ve been targeted. You can use them for a few months & see if it helps.

They should filter out most of this bad traffic at a DNS level before it ever hits your server.
Ron says:

28th January 2020 at 6:39 pm

Hi John , so at first i try your code and it didnt work then i changed to “Edgars” Code and added one more bot to the code and its all gone . 🙂
i mean the trafic is still coming to the log but the server is not getting the hits .
thanks so much for this blog, you rock..

this is what i use in htaccess file..
==
Options +FollowSymLinks
RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^.*(Mb2345Browser|MQQBrowser|LieBaoFast|zh-CN|MicroMessenger|zh_CN|Kinza|Bytespider|Baiduspider|Sogou).*$ [NC]
RewriteRule .* – [F,L]

i added this ip subnet with only the first numbers as well
Deny from 159.138 114.119
John Large says:

30th January 2020 at 10:40 am

Hi Ron.

Glad you got it to work. Strange how on some servers my code works & on others you need the mods in Edgars code such as the ^.* after %{HTTP_USER_AGENT} – probably has something to do with the rest of my .htaccess file which is 174 lines long (I have a lot of rules). It’s flexible as whenever you see a spammy user agent string, you can add it to your regex to block them.

With regards the log file, you could decrease the maximum size of the file so it prunes more often and reduces server load with writes.
Ron says:

31st January 2020 at 9:33 am

HI , John . so now few days later the attack has stopped completely . i guess when its realized they are blocked they get some kind of erorr and they stop to target that server. the attack not even coming to the log .
this chinese bots will continue to attack stronger everyday untill they are blocked then they leave.
at least now i know how to deal with them 🙂

Thanks alot again for the help. glad i found this blog.
cheers.
wasaweb says:

11th February 2020 at 1:12 pm

They might have changed the user agents they use. I’m currently seeing the following.

● Mozilla/5.0(Linux;U;Android 5.1.1;zh-CN;OPPO A33 Build/LMY47V) AppleWebKit/537.36(KHTML,like Gecko) Version/4.0 Chrome/40.0.2214.89 Mobile Safari/537.36
● Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36
● Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/43.0.2357.121 Mobile Safari/537.36
● Mozilla/5.0 (iPad; CPU OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (I’m not certain about this one, but the general patterns are the same and sometimes they come from the same IP addresses as others)

I doubt many people would still be using these old versions of Chrome, or the device (OPPO A33) would be popular, so blocking on such basis might be an option, but it looks like they are changing their strategy now.
John Large says:

11th February 2020 at 1:29 pm

Feel free to add to the rules. I’ve not seen those useragent strings appear yet in my logs, but I’ll keep an eye out for them. The A33 is quite an obscure mobile handset, so possibly blocking A33 in your string or OPPO might be a good solution.
John Large says:

13th February 2020 at 10:33 am

Just a quick update. I had the same issue with the OPPO A33 and also with Huawei Cloud users.

I changed my rule to include OPPO A33 – using a \s to delimit white space.

RewriteCond %{HTTP_USER_AGENT} Mb2345Browser|LieBaoFast|zh-CN|MicroMessenger|zh_CN|Kinza|Datanyze|serpstatbot|OPPO\sA33|spaziodati [NC] RewriteRule ^ - [F,L]

I also added to my deny list the Huawei cloud IP range as follows (these are in CIDR format and will block the entire cloud data centre :-

deny from 114.119.128.0/19 deny from 114.119.160.0/21 deny from 114.119.176.0/20
Scott says:

13th February 2020 at 7:44 pm

Hi John,
My htaccess was working for a few weeks. Then my site slowed again. and I found just what wasaweb was saying.

I just added your |OPPO\sA33| to my user agent string. It doesn’t seem to help. Is there something else I can do?

What about that part: LMY47V ? Is it possible to just block on this string?

Thanks!
John Large says:

14th February 2020 at 5:23 pm

Hi.

I added the rule mentioned above and it’s working fine.

Check your .htaccess file to make sure there are no mistakes or missing | delimiters. If it was working and now isn’t, I’d start diagnosing the regex to make sure there aren’t any errors. If in doubt, start from the beginning fresh.
Scott says:

14th February 2020 at 5:54 pm

What I meant was, the code I was using was working and after a few weeks noticed the site slowing down again. Then in the logs I noticed that OPPO A33 which I didn’t remember from before. So I added your code like |OPPO\sA33| in the middle of the others in my string.

Anyway, now, 24 hours later, it seems like the site is loading faster now. Should the htaccess changes work immediately or could it take time like this?
Warren says:

26th February 2020 at 10:53 pm

Thanks John for your efforts. Back in late November I found this article via https://www.phpbb.com/community/viewtopic.php?p=15360416#p15360416. Implementing your recommended rules gave us immediate relief.

I notice your February 13th update and have one question. Why was the [NC] dropped at the end of the line? Was it intentional?
John Large says:

27th February 2020 at 8:24 am

Hi.

Sorry, I didn’t mean to miss off the Non case sensitive [NC] rule. I guess I mistyped it into my comment field. I’ll go back and update that comment. Thanks for being so vigilant, I completely forgot to type it & didn’t even notice it was missing.

My actual rule still contains the [NC] tag
John Large says:

27th February 2020 at 10:16 am

I came across yet another mutation of the same bot.

It uses the following spoofed useragent:-

User Agent: Mozilla/5.0 (iPad; CPU OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53

I added the following two new lines to catch this specific usergent. It’s not really possible to block using different identifiers within the string as Bingbot still use the iOS 7_0 useragent. This complete useragent block is working well.

Add these below my previous rules & test. I’ve tested extensively with other iOS and iPad versions & browser combinations with no false positives. Bingbot is also getting through just fine.

RewriteCond %{HTTP_USER_AGENT} Mozilla/5\.0\ $iPad;\ CPU\ OS\ 7_0\ like\ Mac\ OS\ X$\ AppleWebKit/537\.51\.1\ $KHTML,\ like\ Gecko$\ Version/7\.0\ Mobile/11A465\ Safari/9537\.53 RewriteRule ^ - [F,L]
Ralf Gettler says:

1st March 2020 at 6:01 pm

Thanks John Large – it’s working 🙂
Valdir C. Franchini says:

10th March 2020 at 4:24 am

Sou brasileiro moro em Araras SP interior de São Paulo SP eu tive os mesmos prolemas pois eu usava um hospedagem compartilhada -> Tentei os bloqueios via apache .htaccess funcionaram porem eu descobri que os ataques vinham de dentro da empresa PROVEDOR e das empresas e pessoas que estavam no mesmo servidor que eu estava

Solução:

Migrei minhas paginas para uma nuvem e agora o bloqueio e em layer 3 e layer 4 eu não deixo mais essas pragas chegarem até o layer 7 HTTP/HTTPS ” Apache2 ”

Corram para as Montanhas !
As mesma coisa esta acontecendo com as VPS ” servidores Virtuais Privados ” de diversos PROVEDORES espalhados pelo mundo INTEIRO -> Hacker estão explorando Vulnerabilidade no HARDWARE compartilhado pelas VPS e estão usando os servidores invadidos para fazerem ataques de SSH e escalonamento de privilégios .. etc
Jonathan Bundy says:

10th March 2020 at 10:54 pm

Hey John,

I noticed you added ‘Aspiegel’ sometime in the last few days, I came here to let you know about it to update. Thanks for posting this in the first place, and diligently updating as new ones are discovered.
Tom says:

15th March 2020 at 2:49 pm

Im still getting bombarded with the following:

114.119.166.252 – – [15/Mar/2020:14:35:52 +0000] “GET /page/20/14/4/11/11/4/11/6/15/1/7/3/blog-3.htm HTTP/1.1” 500 – “-” “Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; AspiegelBot)”

Any ideas guys?
John Large says:

16th March 2020 at 8:38 am

Have you added aspeiglebot to your regex rule set? I already added it to mine a few weeks back, the user agent keeps morphing so you need to look for part of the string which is unique to these attackers. AspiegelBot is a good place to start.
John Large says:

16th March 2020 at 8:38 am

Hi Jonathan.

Thanks for that, doing my best to fight this thing with all of you. I’ll update as & when I find anything suspicious or any new useragents.
Eugene says:

16th March 2020 at 8:27 pm

Hello all,

I’m not very familiar with adding things to htacces, does it matter where I paste your block ?
Below my htacces, where to put John’s code or what do I have to change ?

RewriteEngine On
RewriteBase /
RewriteRule ^index\.php$ – [L]

# add a trailing slash to /wp-admin
RewriteRule ^wp-admin$ wp-admin/ [R=301,L]

RewriteCond %{REQUEST_FILENAME} -f [OR]
RewriteCond %{REQUEST_FILENAME} -d
RewriteRule ^ – [L]
RewriteRule ^(wp-(content|admin|includes).*) $1 [L]
RewriteRule ^(.*\.php)$ $1 [L]
RewriteRule . index.php [L]
John Large says:

17th March 2020 at 9:01 am

I’ve actually started experimenting with a different way of blocking. Ignore all of my code from my post and try adding all of this code to the very end of you .htaccess file

BrowserMatchNoCase "libwww-perl" bad_bot BrowserMatchNoCase "wget" bad_bot BrowserMatchNoCase "LieBaoFast" bad_bot BrowserMatchNoCase "Mb2345Browser" bad_bot BrowserMatchNoCase "zh-CN" bad_bot BrowserMatchNoCase "MicroMessenger" bad_bot BrowserMatchNoCase "zh_CN" bad_bot BrowserMatchNoCase "Kinza" bad_bot BrowserMatchNoCase "Bytespider" bad_bot BrowserMatchNoCase "Baiduspider" bad_bot BrowserMatchNoCase "Sogou" bad_bot BrowserMatchNoCase "Datanyze" bad_bot BrowserMatchNoCase "AspiegelBot" bad_bot BrowserMatchNoCase "adscanner" bad_bot BrowserMatchNoCase "serpstatbot" bad_bot BrowserMatchNoCase "spaziodat" bad_bot BrowserMatchNoCase "undefined" bad_bot Order Deny,Allow Deny from env=bad_bot

I’m looking for the most efficient way of blocking without sacrificing page load speeds & TTFB. This way (so far) seems a little more efficient. If you need to add a line for another user agent it should be self-explanatory – just pick a section of text from the user agent which appears to be unique to that particular bot and add that text to another line in the same format.
Adrian says:

20th March 2020 at 4:44 pm

@John Large

Rewrites are slow. Your latest idea is much better. I don’t think the quotes do anything. All of that can be put on one line:

BrowserMatchNoCase (libwww-perl|wget|LieBaoFast|Mb2345Browser|zh-CN) bad_bot

or in a more general syntax:

SetEnvIfNoCase User-Agent (libwww-perl|wget|LieBaoFast|Mb2345Browser|zh-CN) bad_bot

All of this is better in the main httpd.conf file (if you have access to it) rather than a .htaccess file, or, better still, in some software before it ever gets to Apache – firewall, proxy, load balancer, etc. (if you’re lucky enough to have access).
John Large says:

20th March 2020 at 4:51 pm

Thanks for the clarification, Adrian.

I’ll update my rule as you mentioned. I’ve got it in a .htaccess at the top-level folder to cover all hosted sites. It’s an environment I share with a few other sites, so I can’t access the httpd.conf and the hosting aren’t willing to do so because it would affect a Few other customers on this server.
Dave says:

21st March 2020 at 8:42 am

I believe AspiegelBot is related to Huawei’s new search engine. According to some other stuff I’ve read online, “Huawei has launched their own search service as they try to bounce back from being blocked from the Google ecosystem.”

So, if you’re doing any business with the Chinese, I supposed you may not want to block this bot. I’ll add that I have no idea how accurate that statement above is, but thought it was worth sharing.
Adrian says:

21st March 2020 at 12:48 pm

You may also be interested in this:

https://perishablepress.com/7g-firewall/

It isn’t a ‘firewall’, but a set of Apache directives. I’ve used some bits of it but changed the Rewrites to SetEnvIf’s.
Carl says:

27th March 2020 at 9:31 am

Massive thanks John and all that have contributed!

So the latest recommendation (not being so aware of HTACCESS syntax) would be to use the following (in my case):

SetEnvIfNoCase User-Agent (AspiegelBot|adscanner) bad_bot
Order Deny,Allow
Deny from env=bad_bot
Teemu says:

1st April 2020 at 8:22 am

@John Large
Thank you! My site on shared services also got attacked by Chinese bots and I got bandwith limit executed by service provider. With help of these instructions I got situation back on track.
William Hendricks says:

18th October 2020 at 2:26 am

Have cup of Joe on me.

Petalbot was being a PITA, tried the deny ip, but after 5 or 6 i gave up. Robots.txt was not only ignored but increased the number of ip’s used.

I pasted the first code into my .htaccess file , which in my case is like an Orangutan doing brain surgery.

But WOW, what a difference, that put the Mojo on Petalbot and a few others
John Large says:

19th October 2020 at 3:51 pm

Hi William.

Glad you cracked it. These bots are a pest. Best to block them at the top level of your hosting account to protect any & all websites you host. Be sure to keep an eye on the post. I add new bots to the list as they appear.

The donation towards my caffeine habit is much appreciated 🙂
Paul says:

4th January 2021 at 2:39 pm

Hi John,

thanks for starting this interesting post & thread. I’m looking for a way to block sogou and baidu from my website (and any other bad Chinese actors), as I don’t offer my services in China and most of my site visits (to one specific, random page) seem to come from sogou. I’m not an expert at this, and I’ve noticed using FTP that I have many layers of htaccess files on my server – above and inside public_html and also one within the folder inside public_html where my site resides. Which one should I add the lines to, and which version should I add? (your 17 March comment seems to override your blog post version, but you’ve also mentioned you’ve amended that because of the comment by Adrian on 20 March)

Many thanks.

P
topps says:

17th January 2021 at 12:05 am

you should make the rewrite rule go back to their own site so they can feel the heat they dish out 🙂
Mark says:

3rd February 2021 at 5:36 pm

Hi John. This is so helpful, what a great thread! Question please. I have shared hosting with Bluehost.com with a total of five sites, the first site is the primary for the account and the other four are called add on domains. The result is their five sites. My question is, if I add the directives above to my .htaccess to block bots, do I just do the main site? Or do I need to do one for each site? Thank you and the others for great information. Mark
aeroG says:

1st April 2021 at 3:07 am

Excellent info. I used fail2ban bot filter but the htaccess from the the comments section to block using BrowserMatch has been extremely effective. Thank you and god bless.
RICHARD says:

1st July 2021 at 12:29 pm

Hi,

Being constantly bombed by the petalsearch bot.

I would be grateful if the following could be checked to ensure that it’s correct.

# BEGIN BOT BLOCK

Options +FollowSymLinks
RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^.*(Mb2345Browser|MQQBrowser|LieBaoFast|zh-CN|MicroMessenger|zh_CN|Kinza|Bytespider|Baiduspider|Sogou).*$ [NC]
RewriteRule .* – [F,L]

# END BOT BLOCK

Also, looking for a way to create a rewrite rule to send them back to their own originating servers that will raise a red flag to their own servers, hopefully, irritate them to cease crawling website.

Example …

# START REDIRECT BOT BLOCK

RewriteCond %{HTTP_USER_AGENT} (petalsearch) [NC]
RewriteRule ^(.*)$ http://petalsearch.com$1 [R=301,L]

# END REDIRECT BOT BLOCK
Jae says:

27th October 2022 at 10:46 am

Hi John,
Great thread, thank you. We have had the same issue recently with bot traffic and resorted to using CF. So no additional load was placed on our server.
We are currently trying out the following User Agent rule in CF:
(http.user_agent contains “Mb2345Browser|LieBaoFast|zh-CN|MicroMessenger|zh_CN|Kinza|Datanyze|serpstatbot|spaziodati|OPPO\\sA33|AspiegelBot|aspiegel|PetalBot”)
Hope it works.
Thanks again.
franck says:

23rd February 2023 at 7:27 am

En 2023 il continue a nous faire chier…

Avez-vous une nouvelle solution, j’ai resté les solutions plus haut mais rien ne fonctionne…
Pingback: Unexplained Inodes' Usage | Adeeology
Ylva says:

19th February 2024 at 4:24 am

Thank you very much for this solution.
It worked well (client’s website was under attack).

Regards Ylva
michal says:

30th April 2024 at 12:29 pm

Thanks. This solution is faster than plugins like 7G firewall which costs longer to process a htaccess file.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

John Large

61 thoughts to “Blocking aggressive Chinese crawlers/scrapers/bots”

Leave a Reply