mod_antispam-1.0 for Apache-2.0 / Apache-2.1

Copyright 2005 by Hideo NAKAMITSU. All rights reserved

Hideo NAKAMITSU <nomo@bluecoara.net>
http://bluecoara.net/

--------------
WHAT IS THIS ?
--------------

  By using this module, you can control referer spam accesses.

  As you know, sometimes you can see referer spam access in your log files.
  their purpose is to lead you to spam website by recording their website address
  in your log files.

  about referer spam, see http://www.spywareinfo.com/articles/referer_spam/

  spammers always use bots/tools to connect your website with invalid referer.

  when http server gets some HTTP_REFERER from clients, mod_antispam will connect to
  that website and try to find link to your website from the target.

  if address is not found, module will update blacklist file automatically not to
  connect there later.
  and if your address found, update whitelist automatically not to connect
  here later.

  also you can edit white/black lists by hands using regular expressions.

--------------
REFERER spam MECHANISM
--------------

  The most important thing is HTTP_REFERER in your log files is generated from client's web browser.
  therefore, people who knows referer mechanism can fake their HTTP_REFERER using some tools or by hands.

  I'll give you an example.

  -------------------------------------
  % telnet your.website.example.com 80
  GET / HTTP/1.1
  Host: your.website.example.com
  Referer: http://www.google.com/
  Connection: close

  (contents will be displayed here)
  -------------------------------------

  Then http://www.google.com/ is added in your access log files, however http://www.google.com/ doesn't have
  any link to your website.

-------
mod_antispam ACTION
-------

  When this module finds any spam URI, you can choose some actions.

    (1) [Test]
        record spam address into blacklist and access is allowed (test mode)
    (2) [Replace]
        Rejectrecord spam address into blacklist and rewrite HTTP_REFERER to none and access allowd.
        by this method, access is allowed and spam address is not added in your logfile
    (3) [Reject]
        record spam address into blacklist and return HTTP_FORBIDDEN (access denied)
    (4) [ReplaceReject]
        record spam address into blacklist and rewrite HTTP_REFERER to none and access denied.
        by this method, access is denied and spam address is not added in your logfile

  in some case (3) or (4) is dangerous. because some websites need cookie to
  display their website, some site is protected by authentication. (e.g. BBS in the groupware)
  or some HTTP_REFERER maybe intranet address.
  (e.g. http://127.0.0.1/bookmark.html, http://intranet/bookmarks.html)

  this module doesn't support cookie and can't connect to authorized website,
  because module doesn't know that username or password.

  first you should use Test or Replace mode and choose another methods
  when you can analyze spam URI if you need.

-------
INSTALL
-------

  If your apache supports shared modules, install is very easy.

  # /usr/local/apache2/bin/apxs -a -i -c mod_antispam.c

-------
CONFIGURATION
-------

  ------------------
   required section
  ------------------

  AntispamEnable (on/off, default=off)
    Enable or not this module

  AntispamWhiteList (filename, default=none)
    Whitelist file path. you can edit by hands with regular expressions.
    this file is not created automatically. you have to create this file
    and set proper permissions (writable by http user) before running Apache.

  AntispamBlackList (filename, default=none)
    Blacklist file path. you can edit by hands with regular expressions.
    this file is not created automatically. you have to create this file
    and set proper permissions (writable by http user) before running Apache.

  AntispamAutoWhiteList (filename, default=none)
    Whitelist file that will be automatically created. you shouldn't edit by hands.
    this file is not created automatically. you have to create this file
    and set proper permissions (writable by http user) before running Apache.

  AntispamAutoBlackList (filename, default=none)
    Blacklist file that will be automatically created. you shouldn't edit by hands.
    this file is not created automatically. you have to create this file
    and set proper permissions (writable by http user) before running Apache.

  ------------------
   optional section
  ------------------

  AntispamAction (Test/Replace/Reject/ReplaceReject, default=Test)
    you can define actions after getting spam.
    Test: update white/black lists. all accesses allowed.
    Replace: update white/black lists. and replace spam referer to none. all accesses allowed.
    Reject: update white/black lists. deny referer spam by HTTP_FORBIDDEN. spam URI will be stored in the log files.
    ReplaceReject: update white/black lists. replace spam referer to none. deny referer spam by HTTP_FORBIDDEN.

  AntispamTarget (FQDN/FULL, default=FULL)
    mod_antispam updates white/black lists automatically by adding
    spam/ham URI into files. if this setting is FQDN, only FQDN part of the HTTP_REFERER
    is saved in the datafile. and in case FULL, full URI is saved.

  AntispamSizeLimit (integer: bytes, default=100000)
    when this module gets HTTP_REFERER from clients, it will connect to that target
    and download their contents. you can define download size limit.

  AntispamTimeout (integer: seconds, default=5)
    timeout of the connection.

  AntispamRetry (integer, default=3)
    retry count for connection error. in case some errors after retry count,
    update black list.

--------------
STEP BY STEP
--------------

  when you install this module at first, these configurations are recommended.
  as I explained, you have to create black/white list files and set proper
  permissions to update them by http owner.

  ---------------------------------------------------
  AntispamEnable on
  AntispamAction Test
  AntispamWhiteList logs/antispam.white
  AntispamBlackList logs/antispam.black
  AntispamAutoWhiteList logs/antispam.white.auto
  AntispamAutoBlackList logs/antispam.black.auto
  ---------------------------------------------------

  some days or months later, you can find many spam accesses in the
  antispam.black.auto. then you should copy spam URI and paste to antispam.black
  by hands. and also if you find nonspam URI in the autnsiapm.black.auto,
  you should copy them and paste to antispam.white.
  of course you can define them by regular expressions.

  I'll give you an example.

  after some weeks, you can get some address like this.
  (notice: no-spam URI is recorded in the blacklist in this case)

    - logs/antispam.black.auto
        http://www.discount-drugs.name
        http://www.public-sex.name
        http://www.big-tits.name
        http://www.example.net/this/is/not/spam.html
        http://animal-sex.horse-sex.ws
        http://www.glory-hole.name
        http://www.group-sex.name

    - logs/antispam.white.auto
        http://bluecoara.net/
        http://foo.bar.example.org/foo/bar.html

    - logs/antispam.black
        (empty unless you edit by hands)

    - logs/antispam.white
        (empty unless you edit by hands)

  you should edit these files by hands. this is not required
  but recommended to manage/understand spam.

    - logs/antispam.black.auto
        (empty)

    - logs/antispam.white.auto
        (empty)

    - logs/antispam.black
        http://www.discount-drugs.name
        http://www.public-sex.name
        http://www.big-tits.name
        http://animal-sex.horse-sex.ws
        http://www.glory-hole.name
        http://www.group-sex.name

    - logs/antispam.white
        http://bluecoara.net/
        http://foo.bar.example.org/foo/bar.html
        http://www.example.net/this/is/not/spam.html

  also, you can edit by regular expressions.

    - logs/antispam.white
        ^http://[^/]+\.jp
        ^http://[^\.]*\.google\.[^\.]+$

  after editing, modify httpd.conf and change AntispamAction to Replace,
  Reject, or ReplaceReject.

--------------
LOOP ?
--------------

  If you are using this module on "http://www.example.com/" and someone
  connect your website with modifying HTTP_REFERER to "http://www.example.com/",
  mod_antispam will connect to your own website.

  but once this module connects to some website, white/black lists will be
  updated and if their address is already in your lists, module never connect
  to their website if you have proper settings. therefore you don't need to
  worry about connection loop.

--------------
USER-AGENT
--------------
  when mod_antispam connect to the target, it will send "User-Agent: mod_antispam"
  by default. you can modify this source and change User-Agent.

--------------
PERFORMANCE
--------------

  when some clients connect to Apache, this module will connect to that HTTP_REFERER,
  it takes some seconds at the first time.

  and once mod_antispam connect to the target, this will update white/black lists.
  and after that, module will not refer to white/black lists on the server.
  but it takes some seconds to read white/black lists and compare spam with them.
  therefore, if white/black lists are too large, apache performance will be slow.

  I'll give you the performance data.

  - apache default

    Concurrency Level:      10
    Time taken for tests:   0.267426 seconds
    Complete requests:      1000
    Failed requests:        0
    Write errors:           0
    Total transferred:      271000 bytes
    HTML transferred:       27000 bytes
    Requests per second:    3739.35 [#/sec] (mean)
    Time per request:       2.674 [ms] (mean)
    Time per request:       0.267 [ms] (mean, across all concurrent requests)
    Transfer rate:          987.19 [Kbytes/sec] received

  - mod_antispam enabled (each 1000 lines)
    and I made each 1000 lines white/black/autowhite/autoblack lists, and
    added target URI in the bottom of the black list.

    Concurrency Level:      10
    Time taken for tests:   41.905376 seconds
    Complete requests:      1000
    Failed requests:        0
    Write errors:           0
    Total transferred:      271000 bytes
    HTML transferred:       27000 bytes
    Requests per second:    23.86 [#/sec] (mean)
    Time per request:       419.054 [ms] (mean)
    Time per request:       41.905 [ms] (mean, across all concurrent requests)
    Transfer rate:          6.30 [Kbytes/sec] received

  - mod_antispam enabled (each 100 lines)
    and I made each 100 lines white/black/autowhite/autoblack lists, and
    added target URI in the bottom of the black list.

    Concurrency Level:      10
    Time taken for tests:   4.387564 seconds
    Complete requests:      1000
    Failed requests:        0
    Write errors:           0
    Total transferred:      272084 bytes
    HTML transferred:       27108 bytes
    Requests per second:    227.92 [#/sec] (mean)
    Time per request:       43.876 [ms] (mean)
    Time per request:       4.388 [ms] (mean, across all concurrent requests)
    Transfer rate:          60.40 [Kbytes/sec] received

  you should write rules by regular expressions not make large white/black lists.
  and I'll support BerkeleyDB to get good performance in future.

--------------
LICENSE
--------------

  ASL-2.0(Apache Software License version 2)
