You can configure website anti-crawler protection rules to protect against search engines, scanners, script tools, and other crawlers, and use JavaScript to create custom anti-crawler protection rules.
A website has been added to WAF.
CDN caching may impact JS anti-crawler performance and page accessibility.
Figure 1 shows how JavaScript anti-crawler detection works, which includes JavaScript challenges (step 1 and step 2) and JavaScript authentication (step 3).
If JavaScript anti-crawler is enabled when a client sends a request, WAF returns a piece of JavaScript code to the client.
By collecting statistics on the number of JavaScript challenges and authentication responses, the system calculates how many requests the JavaScript anti-crawler defends. In Figure 2, the JavaScript anti-crawler has logged 18 events, 16 of which are JavaScript challenge responses, and 2 of which are JavaScript authentication responses. Others is the number of WAF authentication requests fabricated by the crawler.
WAF only logs JavaScript challenge and JavaScript authentication events. No other protective actions can be configured for JavaScript challenge and authentication.
WAF blocks and logs detected attacks.
Detected attacks are logged only. This is the default protective action.
Type |
Description |
Remarks |
---|---|---|
Search Engine |
This rule is used to block web crawlers, such as Googlebot and Baiduspider, from collecting content from your site. |
If you enable this rule, WAF detects and blocks search engine crawlers. NOTE:
If Search Engine is not enabled, WAF does not block POST requests from Googlebot or Baiduspider. If you want to block POST requests from Baiduspider, use the configuration described in Configuration Example - Search Engine. |
Scanner |
This rule is used to block scanners, such as OpenVAS and Nmap. A scanner scans for vulnerabilities, viruses, and other jobs. |
After you enable this rule, WAF detects and blocks scanner crawlers. |
Script Tool |
This rule is used to block script tools. A script tool is often used to execute automatic tasks and program scripts, such as HttpClient, OkHttp, and Python programs. |
If you enable this rule, WAF detects and blocks the execution of automatic tasks and program scripts. NOTE:
If your application uses scripts such as HttpClient, OkHttp, and Python, disable Script Tool. Otherwise, WAF will identify such script tools as crawlers and block the application. |
Other |
This rule is used to block crawlers used for other purposes, such as site monitoring, using access proxies, and web page analysis. NOTE:
To avoid being blocked by WAF, crawlers may use a large number of IP address proxies. |
If you enable this rule, WAF detects and blocks crawlers that are used for various purposes. |
JavaScript anti-crawler is disabled by default. To enable it, click and click Confirm in the displayed dialog box.
indicates that JavaScript anti-crawler is enabled.
CDN caching may impact JS anti-crawler performance and page accessibility.
Two protective actions are provided: Protect all paths and Protect a specified path.
Select Protect a specified path. In the upper left corner of the page, click Add Path. In the displayed dialog box, configure required parameters and click OK.
Parameter |
Description |
Example Value |
---|---|---|
Rule Name |
Name of the rule |
wafjs |
Path |
A part of the URL, not including the domain name A URL is used to define the address of a web page. The basic URL format is as follows: Protocol name://Domain name or IP address[:Port]/[Path/.../File name]. For example, if the URL is http://www.example.com/admin, set Path to /admin. NOTE:
|
/admin |
Logic |
Select a logical relationship from the drop-down list. |
Include |
Rule Description |
A brief description of the rule. |
None |
To verify that WAF is protecting domain name www.example.com against an anti-crawler rule:
The following shows how to allow the search engine of Baidu or Google and block the POST request of Baidu.