How to Build a Custom Bot Blocklist Using Apache Rewrite Rules

Last Reviewed and Updated on September 27, 2024

Introduction

Blocking bot traffic is an essential task for maintaining the security, performance, and reliability of websites hosted on Apache-powered servers. Malicious bots can scrape content, spam forms, and even attempt attacks on your server. In this article, we will guide you through the process of building, maintaining, and deploying a custom bot blocklist using Apache’s .htaccess file and regex-based rules.

What is a Bot Blocklist and Why is It Important?

A bot blocklist is a set of rules or filters used to identify and block unwanted bots from accessing your website. These bots may engage in various malicious activities like scraping content, brute-forcing login pages, or performing denial-of-service (DoS) attacks. By blocking such bots, you can:

  • Protect your server’s resources
  • Prevent content scraping and data theft
  • Secure sensitive areas such as login pages

Prerequisites

Before diving into the setup, ensure the following:

  • You have access to cPanel and the ability to edit .htaccess files on your server.
  • A basic understanding of Apache configuration, particularly the mod_rewrite module.
  • Familiarity with regular expressions (regex) for matching bot traffic patterns.

Understanding Apache Rewrite Rules

Apache’s mod_rewrite module is a powerful tool for URL manipulation and conditional routing. With mod_rewrite, you can write rules that inspect incoming requests and take action based on conditions like IP address, User-Agent string, or referrer.

How Rewrite Rules Work:

  1. The RewriteCond directive defines the conditions that must be met for a rule to apply.
  2. The RewriteRule directive specifies the action to take when the conditions are met.

For example, to block a specific IP address, you can use a rule like:

RewriteCond %{REMOTE_ADDR} ^123.456.789.000$
RewriteRule ^.* - [F]

This rule will return a 403 Forbidden response for any request coming from the IP 123.456.789.000.

Step 1: Creating the Bot Blocklist File

  1. Locate or create the .htaccess file: The .htaccess file is typically located in the root directory of your website. If it doesn’t exist, you can create one.
  2. Permissions: Ensure that the .htaccess file is writable and accessible by Apache.

Example of a basic .htaccess file:

# .htaccess file for blocking bots

Step 2: Adding Regex-Based Bot Block Rules

Blocking by User-Agent

You can block bots based on their User-Agent string, which identifies the bot. For example:

RewriteCond %{HTTP_USER_AGENT} "bot|crawler|spider" [NC]
RewriteRule ^.* - [F,L]

This rule will block any request that has “bot”, “crawler”, or “spider” in the User-Agent string.

Blocking by IP Range

Blocking entire IP ranges can be useful when dealing with known malicious sources:

RewriteCond %{REMOTE_ADDR} ^123.456.
RewriteRule ^.* - [F,L]

This blocks any IP address starting with 123.456.

Blocking by Referrer

You can also block requests based on the referrer header. For example, block requests from suspicious sites:

RewriteCond %{HTTP_REFERER} suspicious-site.com [NC]
RewriteRule ^.* - [F,L]

Step 3: Testing the Blocklist

Once you’ve added your block rules, it’s important to test them to ensure they’re working as expected.

Testing Methods:

  • cURL Command: Use the cURL command to simulate requests with specific headers (like User-Agent) and check for the expected 403 Forbidden response.
    curl -A "BadBot" http://yourdomain.com
  • Browser Testing: Use browser tools to modify your User-Agent string and access the website.

Troubleshooting:

  • 500 Errors: A syntax error in .htaccess can lead to a 500 Internal Server Error. Check for any typos in your rules.
  • Regex Issues: Ensure that the regular expressions are correctly formatted. Use online tools to test your regex patterns.

Step 4: Automating and Maintaining the Blocklist

Keeping the Blocklist Updated:

  • Automated Scripts: Use cron jobs to automate the process of updating the blocklist. For example, a script can fetch known bad IP addresses or user-agents from an external list and update your .htaccess file.
  • Logs: Analyze Apache access logs to identify new bot traffic patterns and update your blocklist accordingly.

Best Practices:

  • Periodically review and clean your blocklist to avoid blocking legitimate traffic.
  • Keep your regex patterns specific to avoid false positives (blocking legitimate users).
  • Monitor your server performance to ensure that blocking bots does not impact overall performance.

Step 5: Deploying the Blocklist to Production

Once your blocklist is ready and tested, it’s time to deploy it to the live server. Make sure to:

  • Backup your .htaccess file before making changes.
  • Deploy the updated blocklist during low-traffic periods to avoid disrupting users.
  • Monitor the impact of the blocklist on server performance and bot traffic.

Conclusion

In this guide, we have covered how to build, maintain, and deploy a custom bot blocklist using Apache’s .htaccess file and regex-based rules. Blocking bots is an ongoing process, and regularly updating your blocklist will help keep your server secure from malicious traffic.

Further Reading and Resources: