Malicious Users in Analog web stats

Blocking aggressive Chinese crawlers/scrapers/bots

Over the last few days I’ve had a massive increase in traffic from Chinese data centres & ISPs. The traffic has been relentless & the CPU usage on my server kept spiking enough to cause a fault in my cPanel hosting. I’m on a great hosting package with UKHOST4U and the server is fast & stable, but it is shared with a few other websites. This means that I couldn’t just blanket ban Chinese IP ranges. Even though we don’t sell our products in China, it seemed like a very heavy-handed approach, and to block via .htaccess with the entire range of Chinese IP addresses was causing a 2-3 second delay in page parsing (pages normally load in around 600ms).

First I tried blocking the individual IP’s, but this seemed to make the bot more aggressive & requests went up as high as 800 every 30 seconds. The range of IP’s seemed endless, which seems to suggest some sort of bot farm or a whole range of compromised machines which are used for DDOS.

 

After giving it some thought & checking the raw access logs, I could see a pattern in the user agents being used by the malicious traffic. Below are a few examples of those user agents:-

Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36 Mb2345Browser/9.0

Mozilla/5.0 (Linux; Android 7.0; FRD-AL00 Build/HUAWEIFRD-AL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043602 Safari/537.36 MicroMessenger/6.5.16.1120 NetType/WIFI Language/zh_CN

Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/43.0.2357.121 Mobile Safari/537.36 LieBaoFast/4.51.3

Mozilla/5.0(Linux;U;Android 5.1.1;zh-CN;OPPO A33 Build/LMY47V) AppleWebKit/537.36(KHTML,like Gecko) Version/4.0 Chrome/40.0.2214.89 UCBrowser/11.7.0.953 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://aspiegel.com/petalbot)

I broke down the user agents above & added a new rule to my root .htaccess file as follows:-

Options +FollowSymLinks
RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} Mb2345Browser|LieBaoFast|zh-CN|MicroMessenger|zh_CN|Kinza|Datanyze|serpstatbot|spaziodati|OPPO\sA33|AspiegelBot|aspiegel|PetalBot [NC]
RewriteRule ^ - [F,L]

This rule uses a regular expression to block a user agent containing any of the following strings:-
Mb2345Browser
LieBaoFast
zh-CN
MicroMessenger
zh_CN
Kinza
OPPO A33
Aspeigel
PetalBot

The first two seem to be used commonly by Chinese crawlers, but as mentioned earlier, we do not ship products to china, so I’m not worried about blocking those browsers. The ZH-CN strings refer to Chinese specific localization settings such as OS & Interface language. Micromessneger is related to WeChat – but again, I’ve never had a customer browse/buy from within the WeChat app so that can be safely blocked. Finally, Kinza is related to Russian email spam. I believe the Kinza browser is an obscure Japanese browser, but on our site is commonly misused in the user agent string by Russian email spam.

This seems to be quite a simple solution to block traffic. Many spammy users will have something in the user agent string which isn’t common to the popular browsers such as chrome, safari & Firefox on common devices. You will have to cater this to your own websites needs, but I’ve no doubt I’ll be adding other reg ex arguments from obscure user agents in the future to keep malicious users off the site.

I hope this helps & if you have anything to add, please get in touch or leave a comment.

John Large

My name is John Large. I am a Web Developer, E-commerce site owner & all round geek. My areas of interest include hardware hacking, digital privacy & security, social media & computer hardware. I’m also a minimalist in the making, interested in the Tiny House movement & the experience economy along with a strong interest in sustainability & renewable energy. When I’m not tapping on a keyboard or swiping a smart phone I can be found sampling great coffee, travelling the world with my wife Vicki (who writes over at Let’s Talk Beauty) & generally trying to live my life as unconventionally as possible.

59 thoughts to “Blocking aggressive Chinese crawlers/scrapers/bots”

  1. Hi,

    I’m facing the same problem right now with my website. But, as I’m using a varnish cache, I have to block these chinese boots in the vcl file. Do you also have the codelines to block user agents like “zh_CN” in the vcl file?

    This is a question from an absolute beginner.

    Kind Regards

  2. Hi Manuel.

    If it’s an Apache server a .htaccess would still work regardless of the caching engine. It would stop those user agents connecting to the server before the cache even fires, saving you even more resources.

    You may find you already have a ‘.htaccess’ file. If you use ftp software such as filezilla to manage files, ensure that show hidden files & folders is selected. Any file beginning with a ‘.’ is normally hidden from most users unless you instruct your client not to hide the file.

    You may find you already have a .htaccess file, in which case you can just add the rules to it from my post.

  3. Varnish: In sub vcl_recv
    if(
    req.http.user-agent ~ “bingbot”
    || req.http.user-agent ~ “DotBot”
    || req.http.user-agent ~ “Exabot”
    || req.http.user-agent ~ “Gigabot”
    || req.http.user-agent ~ “ICCrawler”
    || req.http.user-agent ~ “Snappy”
    || req.http.user-agent ~ “Yandex”
    || req.http.user-agent ~ “yandexbot”
    || req.http.user-agent ~ “Yeti”
    || req.http.user-agent ~ “Mb2345Browser”
    || req.http.user-agent ~ “QQBrowser”
    || req.http.user-agent ~ “LieBaoFast”
    || req.http.user-agent ~ “MicroMessenger”
    || req.http.user-agent ~ “Kinza”
    || req.http.user-agent ~ “slurp”
    || req.http.user-agent ~ “TheWorld”
    || req.http.user-agent ~ “YoudaoBot”){
    error 403 “Agent Banned. You are banned from this site. Please contact via a different client configuration if you believe that this is a mistake.”;
    }

    I use also an external file with IP of attackers to make an acl.

  4. Nice tip for other Varnish users. Thanks for sharing. I might add a few of those to my own .htaccess rule. Can I ask why you are blocking bingbot? Does bing not drive any traffic to your website? I get a fair bit from Bing.

  5. Thanks for the tip. I’ve been having the exact same problem on my site and your .htaccess suggestion appears to have worked.

  6. Glad it helped. It’s really easy to expand upon, so if you see any obvious user agents you don’t like with a unique (to that user agent) identifier string, feel free to add it & create your own rules. I’ve blocked a few more crawlers which scan my website for data & marketing purpose, but ignore robots.txt – they are wasting bandwidth and selling data about my website so they can go elsewhere.

  7. Very much the same here. As of October 1, we have a massive rise in traffic from ranges in Chinese /8 networks that are way too large to ip-block individually. User agents are typically “LieBaoFast”, “Mb2345Browser/9.0” and “MicroMessenger”. Blocked them by a rewrite rule, which will work as long as they are not changing the string.

  8. > Can I ask why you are blocking bingbot?

    Yeah, and he also blocked Yandex for some reason.

    It’s a Russian search engine and as far as I know it doesn’t do anything abusive like those Chinese bots. Also Yandex Browser is a quite popular web browser in Russia.

  9. facing the same closed it by just doing a block by country “China” in firewall rules of cloudflare. But still some annoying requests coming through.

  10. It looks like this crawler/scraper/bot uses the same 4 user agents over and over again. It appears that they use a bunch of IPs all over China (and Hong Kong) that are from the mobile networks. They hit once with an IP and then with another IP from another part of China with one of the 4 user agents. Not sure what the crawler/scraper/bot wants, but I have no presence in China (or Hong Kong) so I block entire netblocks after the fact.

  11. Thanks.

    I got banned this bots with fail2ban via csf on nginx server.

    My nginx-badbots.conf file:

    [Definition]
    #Not : nginx kotu botlar

    failregex = ^ – .* “(GET|POST|HEAD).*HTTP.*” “.*(LieBaoFast|Mb2345Browser|UCBrowser|MicroMessenger|Kinza).*”$

    ignoreregex =

    And jail.local config:

    [nginx-badbots]
    enabled = true
    maxretry = 1
    #30 days
    bantime = 7776000
    port = http,https
    filter = nginx-badbots
    logpath = /home/nginx/domains/*/log/access.log
    /usr/local/nginx/logs/*access*.log
    action = csfdeny[name=nginx-badbots]

  12. Regards. Recently I have seen these same useragents, from ips that have the following pattern /159.138.\d+.\d+/, I had to restrict them with .htaccess because they were taking me to the limit. Do not have mercy on these intruders, because at any time they devour you.

  13. We set this rule on our server, these bots are a pest as they were relentless:

    ##Block persistent and resource hungry bots
    if ($http_user_agent ~* (Baiduspider|python-requests/2.13.0|Mb2345Browser|LieBaoFast|zh-CN|MicroMessenger|zh_CN|Kinz)) {
    return 403;
    }

  14. If you are using Cloudflare.

    A really quick and easy way to deal with this issue is to set up a firewall rule for ASN “AS136907”.

    AS136907 contains all the IPs for the organization “Huawei-HK-CLOUDS”. You can choose to Block all traffic from this range or set prompt with a JS challenge.

  15. Hi John, first of all – thanks for the blog post. I tried your code, it didn’t work (I tested with Chrome user agent switch) and I was monitoring the log file. I found similar code and I adjusted it and now my version works – I hope it will help someone else. If you compare with the original code you will notice minor differences but I assume that was the main reason why the code didn’t work initially. Once again – thanks for the blog post.

    #China spammers
    RewriteCond %{HTTP_USER_AGENT} ^.*(Mb2345Browser|LieBaoFast|zh-CN|MicroMessenger|zh_CN|Kinza|Bytespider|Baiduspider|Sogou).*$ [NC]
    RewriteRule .* – [F,L]

  16. hi John, same here …
    about 800.000 requests per day without any sense (scanning the tag-cloud in our forum).
    the requests originate from litterally hundreds of IPs in 159.138.128.0/19 (HUAWEI cloud Hongkong) and i decided to block the whole range since we have not a single purchase or download from any IP out of this range over years.

  17. Thank you so much. This saved my website from a massive attack that got me really worried for several days.
    Adding the rules to .htaccess seems to have stopped the requests.

  18. Hi John, im having attack for 3 months now on one specific domain on my server, its started with about a 500 k to 1 million requests per day, all coming from china . so i blocked china and all was good until a week ago i realized the server is slow so i checked and i see that now the traffic is coming from all over the world using this 2 ip ranges 159.138.*.* and 114.119.*.*

    here is 3 entries from the log file.
    =============================
    1.)
    114.119.162.117 – – [27/Jan/2020:20:14:12 -0800] “GET /calendar-2/action~oneday/time_limit~1565593200/tag_ids~1842,1228,1495,1835,1321,1839/request_format~html HTTP/1.0” 403 670 “-” “Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/43.0.2357.121 Mobile Safari/537.36 LieBaoFast/4.51.3”
    2.)
    114.119.157.127 – – [27/Jan/2020:20:18:23 -0800] “GET /calendar-2/action~oneday/time_limit~1565593200/tag_ids~1217,1500,2108,2009,1516/request_format~html HTTP/1.0” 403 665 “-” “Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36 Mb2345Browser/9.0”

    3.)
    114.119.143.128 – – [27/Jan/2020:20:23:58 -0800] “GET /calendar-2/action~oneday/time_limit~1565593200/tag_ids~1407,1784,1431,1413/request_format~html HTTP/1.0” 403 660 “-” “Mozilla/5.0 (Linux; Android 7.0; FRD-AL00 Build/HUAWEIFRD-AL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043602 Safari/537.36 MicroMessenger/6.5.16.1120 NetType/WIFI Language/zh_CN”

    ======================================================
    i did so many things now i blocked all the world beside the US but the traffic is also coming from the US as well with the same ip ranges. my server is flooded with about 1 million requests every 1 -2 hours. its insane . i keep deleting the logs they becoming so big . do you have any suggestion for me ? i try every possible thing even installed a paid version of DDOS protection . but so far nothing works.

  19. Have you tried my solution with .htaccess? Looking at those user gents supplied, they would be matched by my rule.

    If you are having such a massive issue, have you thought about using Cloudflare? The free accounts are good, but the first tier of paid account is excellent. This isn’t a paid endorsement, but the free account has got me out of a few sticky spots in the past when I’ve been targeted. You can use them for a few months & see if it helps.

    They should filter out most of this bad traffic at a DNS level before it ever hits your server.

  20. Hi John , so at first i try your code and it didnt work then i changed to “Edgars” Code and added one more bot to the code and its all gone . 🙂
    i mean the trafic is still coming to the log but the server is not getting the hits .
    thanks so much for this blog, you rock..

    this is what i use in htaccess file..
    ==
    Options +FollowSymLinks
    RewriteEngine On
    RewriteBase /
    RewriteCond %{HTTP_USER_AGENT} ^.*(Mb2345Browser|MQQBrowser|LieBaoFast|zh-CN|MicroMessenger|zh_CN|Kinza|Bytespider|Baiduspider|Sogou).*$ [NC]
    RewriteRule .* – [F,L]

    i added this ip subnet with only the first numbers as well
    Deny from 159.138 114.119

  21. Hi Ron.

    Glad you got it to work. Strange how on some servers my code works & on others you need the mods in Edgars code such as the ^.* after %{HTTP_USER_AGENT} – probably has something to do with the rest of my .htaccess file which is 174 lines long (I have a lot of rules). It’s flexible as whenever you see a spammy user agent string, you can add it to your regex to block them.

    With regards the log file, you could decrease the maximum size of the file so it prunes more often and reduces server load with writes.

  22. HI , John . so now few days later the attack has stopped completely . i guess when its realized they are blocked they get some kind of erorr and they stop to target that server. the attack not even coming to the log .
    this chinese bots will continue to attack stronger everyday untill they are blocked then they leave.
    at least now i know how to deal with them 🙂

    Thanks alot again for the help. glad i found this blog.
    cheers.

  23. They might have changed the user agents they use. I’m currently seeing the following.

    ● Mozilla/5.0(Linux;U;Android 5.1.1;zh-CN;OPPO A33 Build/LMY47V) AppleWebKit/537.36(KHTML,like Gecko) Version/4.0 Chrome/40.0.2214.89 Mobile Safari/537.36
    ● Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36
    ● Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/43.0.2357.121 Mobile Safari/537.36
    ● Mozilla/5.0 (iPad; CPU OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (I’m not certain about this one, but the general patterns are the same and sometimes they come from the same IP addresses as others)

    I doubt many people would still be using these old versions of Chrome, or the device (OPPO A33) would be popular, so blocking on such basis might be an option, but it looks like they are changing their strategy now.

  24. Feel free to add to the rules. I’ve not seen those useragent strings appear yet in my logs, but I’ll keep an eye out for them. The A33 is quite an obscure mobile handset, so possibly blocking A33 in your string or OPPO might be a good solution.

  25. Just a quick update. I had the same issue with the OPPO A33 and also with Huawei Cloud users.

    I changed my rule to include OPPO A33 – using a \s to delimit white space.

    RewriteCond %{HTTP_USER_AGENT} Mb2345Browser|LieBaoFast|zh-CN|MicroMessenger|zh_CN|Kinza|Datanyze|serpstatbot|OPPO\sA33|spaziodati [NC]
    RewriteRule ^ - [F,L]

    I also added to my deny list the Huawei cloud IP range as follows (these are in CIDR format and will block the entire cloud data centre :-


    deny from 114.119.128.0/19
    deny from 114.119.160.0/21
    deny from 114.119.176.0/20

  26. Hi John,
    My htaccess was working for a few weeks. Then my site slowed again. and I found just what wasaweb was saying.

    I just added your |OPPO\sA33| to my user agent string. It doesn’t seem to help. Is there something else I can do?

    What about that part: LMY47V ? Is it possible to just block on this string?

    Thanks!

  27. Hi.

    I added the rule mentioned above and it’s working fine.

    Check your .htaccess file to make sure there are no mistakes or missing | delimiters. If it was working and now isn’t, I’d start diagnosing the regex to make sure there aren’t any errors. If in doubt, start from the beginning fresh.

  28. What I meant was, the code I was using was working and after a few weeks noticed the site slowing down again. Then in the logs I noticed that OPPO A33 which I didn’t remember from before. So I added your code like |OPPO\sA33| in the middle of the others in my string.

    Anyway, now, 24 hours later, it seems like the site is loading faster now. Should the htaccess changes work immediately or could it take time like this?

  29. Hi.

    Sorry, I didn’t mean to miss off the Non case sensitive [NC] rule. I guess I mistyped it into my comment field. I’ll go back and update that comment. Thanks for being so vigilant, I completely forgot to type it & didn’t even notice it was missing.

    My actual rule still contains the [NC] tag

  30. I came across yet another mutation of the same bot.

    It uses the following spoofed useragent:-

    User Agent: Mozilla/5.0 (iPad; CPU OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53

    I added the following two new lines to catch this specific usergent. It’s not really possible to block using different identifiers within the string as Bingbot still use the iOS 7_0 useragent. This complete useragent block is working well.

    Add these below my previous rules & test. I’ve tested extensively with other iOS and iPad versions & browser combinations with no false positives. Bingbot is also getting through just fine.

    RewriteCond %{HTTP_USER_AGENT} Mozilla/5\.0\ \(iPad;\ CPU\ OS\ 7_0\ like\ Mac\ OS\ X\)\ AppleWebKit/537\.51\.1\ \(KHTML,\ like\ Gecko\)\ Version/7\.0\ Mobile/11A465\ Safari/9537\.53
    RewriteRule ^ - [F,L]

  31. Sou brasileiro moro em Araras SP interior de São Paulo SP eu tive os mesmos prolemas pois eu usava um hospedagem compartilhada -> Tentei os bloqueios via apache .htaccess funcionaram porem eu descobri que os ataques vinham de dentro da empresa PROVEDOR e das empresas e pessoas que estavam no mesmo servidor que eu estava

    Solução:

    Migrei minhas paginas para uma nuvem e agora o bloqueio e em layer 3 e layer 4 eu não deixo mais essas pragas chegarem até o layer 7 HTTP/HTTPS ” Apache2 ”

    Corram para as Montanhas !
    As mesma coisa esta acontecendo com as VPS ” servidores Virtuais Privados ” de diversos PROVEDORES espalhados pelo mundo INTEIRO -> Hacker estão explorando Vulnerabilidade no HARDWARE compartilhado pelas VPS e estão usando os servidores invadidos para fazerem ataques de SSH e escalonamento de privilégios .. etc

  32. Hey John,

    I noticed you added ‘Aspiegel’ sometime in the last few days, I came here to let you know about it to update. Thanks for posting this in the first place, and diligently updating as new ones are discovered.

  33. Im still getting bombarded with the following:

    114.119.166.252 – – [15/Mar/2020:14:35:52 +0000] “GET /page/20/14/4/11/11/4/11/6/15/1/7/3/blog-3.htm HTTP/1.1” 500 – “-” “Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; AspiegelBot)”

    Any ideas guys?

  34. Have you added aspeiglebot to your regex rule set? I already added it to mine a few weeks back, the user agent keeps morphing so you need to look for part of the string which is unique to these attackers. AspiegelBot is a good place to start.

  35. Hi Jonathan.

    Thanks for that, doing my best to fight this thing with all of you. I’ll update as & when I find anything suspicious or any new useragents.

  36. Hello all,

    I’m not very familiar with adding things to htacces, does it matter where I paste your block ?
    Below my htacces, where to put John’s code or what do I have to change ?

    RewriteEngine On
    RewriteBase /
    RewriteRule ^index\.php$ – [L]

    # add a trailing slash to /wp-admin
    RewriteRule ^wp-admin$ wp-admin/ [R=301,L]

    RewriteCond %{REQUEST_FILENAME} -f [OR]
    RewriteCond %{REQUEST_FILENAME} -d
    RewriteRule ^ – [L]
    RewriteRule ^(wp-(content|admin|includes).*) $1 [L]
    RewriteRule ^(.*\.php)$ $1 [L]
    RewriteRule . index.php [L]

  37. I’ve actually started experimenting with a different way of blocking. Ignore all of my code from my post and try adding all of this code to the very end of you .htaccess file

    BrowserMatchNoCase "libwww-perl" bad_bot
    BrowserMatchNoCase "wget" bad_bot
    BrowserMatchNoCase "LieBaoFast" bad_bot
    BrowserMatchNoCase "Mb2345Browser" bad_bot
    BrowserMatchNoCase "zh-CN" bad_bot
    BrowserMatchNoCase "MicroMessenger" bad_bot
    BrowserMatchNoCase "zh_CN" bad_bot
    BrowserMatchNoCase "Kinza" bad_bot
    BrowserMatchNoCase "Bytespider" bad_bot
    BrowserMatchNoCase "Baiduspider" bad_bot
    BrowserMatchNoCase "Sogou" bad_bot
    BrowserMatchNoCase "Datanyze" bad_bot
    BrowserMatchNoCase "AspiegelBot" bad_bot
    BrowserMatchNoCase "adscanner" bad_bot
    BrowserMatchNoCase "serpstatbot" bad_bot
    BrowserMatchNoCase "spaziodat" bad_bot
    BrowserMatchNoCase "undefined" bad_bot
    Order Deny,Allow
    Deny from env=bad_bot

    I’m looking for the most efficient way of blocking without sacrificing page load speeds & TTFB. This way (so far) seems a little more efficient. If you need to add a line for another user agent it should be self-explanatory – just pick a section of text from the user agent which appears to be unique to that particular bot and add that text to another line in the same format.

  38. @John Large

    Rewrites are slow. Your latest idea is much better. I don’t think the quotes do anything. All of that can be put on one line:

    BrowserMatchNoCase (libwww-perl|wget|LieBaoFast|Mb2345Browser|zh-CN) bad_bot

    or in a more general syntax:

    SetEnvIfNoCase User-Agent (libwww-perl|wget|LieBaoFast|Mb2345Browser|zh-CN) bad_bot

    All of this is better in the main httpd.conf file (if you have access to it) rather than a .htaccess file, or, better still, in some software before it ever gets to Apache – firewall, proxy, load balancer, etc. (if you’re lucky enough to have access).

  39. Thanks for the clarification, Adrian.

    I’ll update my rule as you mentioned. I’ve got it in a .htaccess at the top-level folder to cover all hosted sites. It’s an environment I share with a few other sites, so I can’t access the httpd.conf and the hosting aren’t willing to do so because it would affect a Few other customers on this server.

  40. I believe AspiegelBot is related to Huawei’s new search engine. According to some other stuff I’ve read online, “Huawei has launched their own search service as they try to bounce back from being blocked from the Google ecosystem.”

    So, if you’re doing any business with the Chinese, I supposed you may not want to block this bot. I’ll add that I have no idea how accurate that statement above is, but thought it was worth sharing.

  41. Massive thanks John and all that have contributed!

    So the latest recommendation (not being so aware of HTACCESS syntax) would be to use the following (in my case):

    SetEnvIfNoCase User-Agent (AspiegelBot|adscanner) bad_bot
    Order Deny,Allow
    Deny from env=bad_bot

  42. @John Large
    Thank you! My site on shared services also got attacked by Chinese bots and I got bandwith limit executed by service provider. With help of these instructions I got situation back on track.

  43. Have cup of Joe on me.

    Petalbot was being a PITA, tried the deny ip, but after 5 or 6 i gave up. Robots.txt was not only ignored but increased the number of ip’s used.

    I pasted the first code into my .htaccess file , which in my case is like an Orangutan doing brain surgery.

    But WOW, what a difference, that put the Mojo on Petalbot and a few others

  44. Hi William.

    Glad you cracked it. These bots are a pest. Best to block them at the top level of your hosting account to protect any & all websites you host. Be sure to keep an eye on the post. I add new bots to the list as they appear.

    The donation towards my caffeine habit is much appreciated 🙂

  45. Hi John,

    thanks for starting this interesting post & thread. I’m looking for a way to block sogou and baidu from my website (and any other bad Chinese actors), as I don’t offer my services in China and most of my site visits (to one specific, random page) seem to come from sogou. I’m not an expert at this, and I’ve noticed using FTP that I have many layers of htaccess files on my server – above and inside public_html and also one within the folder inside public_html where my site resides. Which one should I add the lines to, and which version should I add? (your 17 March comment seems to override your blog post version, but you’ve also mentioned you’ve amended that because of the comment by Adrian on 20 March)

    Many thanks.

    P

  46. you should make the rewrite rule go back to their own site so they can feel the heat they dish out 🙂

  47. Hi John. This is so helpful, what a great thread! Question please. I have shared hosting with Bluehost.com with a total of five sites, the first site is the primary for the account and the other four are called add on domains. The result is their five sites. My question is, if I add the directives above to my .htaccess to block bots, do I just do the main site? Or do I need to do one for each site? Thank you and the others for great information. Mark

  48. Excellent info. I used fail2ban bot filter but the htaccess from the the comments section to block using BrowserMatch has been extremely effective. Thank you and god bless.

  49. Hi,

    Being constantly bombed by the petalsearch bot.

    I would be grateful if the following could be checked to ensure that it’s correct.

    # BEGIN BOT BLOCK

    Options +FollowSymLinks
    RewriteEngine On
    RewriteBase /
    RewriteCond %{HTTP_USER_AGENT} ^.*(Mb2345Browser|MQQBrowser|LieBaoFast|zh-CN|MicroMessenger|zh_CN|Kinza|Bytespider|Baiduspider|Sogou).*$ [NC]
    RewriteRule .* – [F,L]

    # END BOT BLOCK

    Also, looking for a way to create a rewrite rule to send them back to their own originating servers that will raise a red flag to their own servers, hopefully, irritate them to cease crawling website.

    Example …

    # START REDIRECT BOT BLOCK

    RewriteCond %{HTTP_USER_AGENT} (petalsearch) [NC]
    RewriteRule ^(.*)$ http://petalsearch.com$1 [R=301,L]

    # END REDIRECT BOT BLOCK

  50. En 2023 il continue a nous faire chier…

    Avez-vous une nouvelle solution, j’ai resté les solutions plus haut mais rien ne fonctionne…

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.