Middleman filtering proxy server
(c)2002 Jason McLaughlin <jasonmc@sympatico.ca>
http://www.sourceforge.net/projects/middle-man



Introduction


Middleman is a powerful proxy server with many features designed to make browsing the Internet a more pleasant experience. It can do much more than just proxying though; it can be used as a layer between any web server and client to filter HTTP requests, or act as a portal between an internal network and the Internet. It has an intuitive Web interface that provides an easy way of accessing and changing the proxy's configuration, there's no need to dig through any complicated configuration files.

Installation
Installing Middleman should be straightforward. After extracting the archive type "./configure && make", if you're using a BSD operating system you will need to use "gmake" rather than make, if that's unavailable as a last resort you can use BSD's make, then enter the "gcc -o mman *.o -pthread" command afterwards. There are several compile-time options available for the configure script, type "./configure --help" to see a complete list.

If you wish to have the proxy server loaded at boot time, there is a script in the "scripts" directory called mman.init to assist you with that, simply edit the paths at the top then copy it to the "/etc/rc.#" directory, where # is your current runlevel (if you're unsure what it is, use the "runlevel" command). You may need to rename the script, if you're using a debian-based distribution the naming scheme for init.d scripts is in the form "S##program", where ## is the order in which the script is loaded, and "program" is the program's name.

There are several command line options you may use when loading the proxy server; at the very least you will need to use the -c option followed by the path to the configuration file. The -p option can be used to have middleman check (and create) a file containing the PID of the proxy server, this can be used to prevent multiple instances of the proxy server from running concurrently. The -l option can be used to specify the path to the logfile if the --enable-syslog option wasn't used during compilation, and -d to specify the level of detail which should be logged; use -h for a complete list of loglevels.

Using

Once the proxy server is running, you'll then need to configure your web browser to use it.

If you're using Mozilla, open up the edit menu and click on preferences. Expand the "advanced" options then click on "Proxies". Click on the "Manual proxy configuration" radio button then fill in the HTTP and HTTPS fields with the IP address and port of the proxy; if you're using the default configuration, the port will be 8080.

If you're using Konqueror, open up the settings menu and click on "Configure Konqueror". Click on the icon labeled "Proxy" in the left pane, click the "Use proxy" checkbox and then the "Manual proxy configuration" checkbox. Click on setup to the right of that then fill in the HTTP and HTTPS fields with the IP address and port of the proxy.

URL commands

Any feature can be temporarily bypassed by adding a special prefix to the URL. To bypass everything, use "bypass..website.com" or "/bypass../file" for HTTP requests, to bypass only some features, use "bypass[<features>]..website.com or "/bypass[<features>]../file" for HTTP requests; where <features> is any combination of the letters 'f' (URL filter), 'r' (redirecting), 'w' (rewrite), 'h' (header filter), 'm' (MIME filter), 'c' (cookie filter), 'e' (external parser), 'p' (forwarding), and 'k' (keyword filtering). The '+' (plus) and '-'  (minus) symbols can be used to alternate between bypassing and unbypassing the features, if the access rule bypassed it already. For example: "bypass[fw-ph]..website.com" will bypass the URL filtering and rewrite feature, and unbypass the forwarding and header filter feature. The "filter.." and "mime.." URL command will respectively show which URL or MIME filter entry matches the requested URL, if any, and provide a link to edit or delete it. The "score.." URL command will show the keyword score for a page. URL commands are also taken from the Referer header sent by your browser, making them apply to images, links, or files loaded from a webpage a URL command was used on; and any URL command used when making a request which results in a server sending back a 302 redirect is added to the Location header as well. More than one URL command can be used at the same time, simply append it to the end of the previous one; for example, if you want to check if something matches a MIME filter entry but the site is blocked by a URL filter entry, you can use "bypass[f]..mime..www.somenastysite.com".


Configuration

Most of the configuration is made easy by the Web interface; however, configuring the network settings will need to be done by manually editing the configuration file. At the top of the included config.xml you will see a section that looks simular to the following:

<network>
    <listen>
        <ip>127.0.0.1</ip>
        <port>8080</port>
    </listen>
</network>

Each <listen> section inside the <network> section has an <ip> and <port> option, which should contain after them the IP address and port number to listen on, respectively. You may leave out the <ip> option to have Middleman listen on all interfaces. Middleman, by default, can have up to 20 <listen> sections.

As mentioned above, all other configuration settings can be modified through the Web interface. To access this, while using the proxy load "http://mman" in your browser; when not using the proxy, the Web interface is accessible by making a regular HTTP request for /mman to the proxy's IP address and port.

Once you've loaded the Web interface, you will see a page with several links available at the top.

The "Active connections" link will display a page showing all connections currently being handled by the proxy.

The "DNS cache" link is for debugging purposes only, and will display entries in the DNS cache.

The "Show headers" link will bring you to a page showing all the HTTP headers your browser sends, and what they look like after being filtered. Note: headers handled by Middleman aren't shown, this is to avoid confusion.

The "Save settings" link will bring you to a page with a Filename dialog where you can save all current settings, by default it will be filled with the path to the configuration file given when the proxy server was loaded.

The "Load settings" link will also bring you to a page with a Filename dialog, as well as an "Overwrite" option. The overwrite option can be used to select whether the settings contained in the configuration file will overwrite all current settings or simply be added to them.

The "View log entries" link will bring you to a page showing recent entries made to the logfile, and will allow you to search through them using regular expressions. The log buffer can also be cleared from here, as well as have it's size adjusted. The level of logging detail available through the web interface is unaffected by the options given in the command line, and will always be all log entires with the exception of debug messages.

The "Config" link will bring you to a page where all configuration settings can be accessed. On the main page you will see a dialog with a drop down list containing the name of each section, as well as a table with a list lf each section and an enable/disable radio button beside it; this can be used to quickly enable/disable a feature if it's causing problems with a website.

When you select an item in the drop down list and click on the submit button, you will be brough to a page containing a dialog at the top as well as a list of entries for that section below, with the exception of the network section, which is read-only. The dialog at the top will always contain an "add" link, which can be used to add an additonal entry to the section, and in some cases will have several other options which will be explained below. Each entry at the bottom has an "Edit", "Delete", "Up", "Down", "Top", and "Bottom" link. The edit link will bring you to a dialog where you can edit that specific entry, the delete link will remove it from the section. The "Up" and "Down" links allow you to change the order of the entries, this is important in  cases where more than one entry can match the same thing. The "Top" and "Bottom" links can be used to move the entry to the very top or bottom of the list.

All entries for all sections have an "Enabled" option which allows you to disable a specific entry, as well as a "Comment" field that can be used to store a description of what the entries purpose is.

Several sections follow an allow/deny/policy model; for these sections, each entry has an action option which will specify what happens when it is found to match. If no matching entry is found, the action the policy is set to will be taken. It is important to remember that all entries with an action opposite to the policy are searched first, and if nothing is found the entries with an action the same as the policy are not searched. So, for example, if the policy for the access section is set to "allow", and no entries with a "deny" action are found matching the connection, none of the entries with an "allow" action are looked at, so any access limitations specified in the allow entry are ignored.

The tables below will describe all the options available in each section and the entries within them.

--- Access section ---

Purpose
The access feature is used to control who can access the proxy server, and to what extent.
Global options
Policy
Default action to take when no matching entry is found.
Entry options
IP Address
A regular expression matching the IP addresses this entry applies to, leaving blank will cause the entry to match everything.
Username
If this field is not empty, clients matching this entry will be required to authenticate with the proxy server. There can be more than one entry matching the same IP address, in which case the one matching the username/password send by the browser is used.
Password
The client's password if the username field is used.
Access
A list of features connections matching this entry are allowed to access, the options are:
Web interface - Access to all of the web interface (access to /mman/template/<template name> is always allowed regardless of this)
Proxy requests - Allowed to make regular proxy requests
CONNECT Requests - Allowed to make CONNECT requests
Transparing proxying - Allowed to make transparent proxy requests (must be allowed to make HTTP requests as well)
HTTP Requests - Allowed to make regular HTTP requests to proxy (for Web interface and redirected requests)
Allow bypassing - Allows features to be bypassed by prefixing with URL command

Bypass
A list of features which will by default be bypassed when making requests.



--- Templates section ---

Purpose
Templates are used throughout Middleman as a replacement for pages which can't be displayed due to filtering, error, or other condtions.
Global options
Path
Location to look for templates in if no absolute path is given.
Entry options
Name
The name of the template, this is used in other sections to reference it. It may also be one of the following to replace internal error messages:
blocked - Page blocked
nodns - DNS lookup failed
badrequest - Malformed HTTP header from client
badresponse - Malformed HTTP header from server
nofile - File not found
noconnect - Connection failed
noaccess - Access denied
badprotocol - Protocol not implemented
badauth - Authorization failed (when forwarding through SOCKS4)

There are 3 built-in templates that can be used: tinygif (a 1x1 transparent gif image), checkedgif (a 4x4 grey and transparent checkered pattern), and tinyswf (an emtpy flash animation).

You can override the content sent by a website for certain response codes by making a template with a numerical name the same as the response code.


There are several variables that can be used in templates which will be replaced with information about the request currently being handled, they are:
$HTTP_METHOD - Method used to request file
$HTTP_HOST - Host HTTP request was made to.
$HTTP_FILE - File HTTP request was made for.
$HTTP_PORT - Port HTTP request was made to.
$IP - IP address of client making request.

Templates can be accessed directly by loading "http://mman/template/<template name>".

File
The filename of the template
Mimetype
The MIME-type of the template. When using an executable, this can be set to STDIN to have the MIME-type extracted from a "Content-type" header sent by the program, this will be explained in greater depth below.
Response code The response code to use when sending the template, leave blank to use internal default.
Type
Template type, either File or Executable. If executable is choses, the file is executed and whatever it writes on STDOUT is sent as the template. Several environment variables are set for the executable to use, they will be explained further below in the external section.

--- MIME section ---

Purpose
The mime feature allows you to filter content based on it's MIME-type.
Global options
Policy
The action to take when no matching entry is found.
Default template
The template to send for blocked MIME-types if the template option is left blank for the matching entry, or if no matching entry is found but the policy is deny.
Entry options
Host
A regular expression matching the host's this entry applies to, leave blank to match everything.
File
A regular expression matching the file's this entry applies to, leave blank to match everything.
Mimetype
A regular expression matching the MIME-type's this entry applies to, leave blank to matching everything.
Template
The template to send when an entry matches, this has no purpose in entries with the action set to allow.

--- Redirect section ---

Purpose
The redirect feature allows you to redirect requests.
Entry options
URL
A regular expression matching the URL's you wish to redirect; the URL will always be in the form "host/file" or "/file" for HTTP requests.
Redirect
The URL to redirect to; it may contain backreferences to strings captured using parenthesis in the URL pattern. This can be in the form "host/file", or "/file" if you wish to send a relative URL when redirecting a URL in the Location: header. If this option is left blank, no action will be taken against requests matching the URL.

See the rewrite section for additional notes on using regular expressions with backreferences.
Port
The port to redirect to; if left blank the same port the original request was made to is used.
302 Redirect
If yes, a 302 redirect is issued; otherwise the new host is connected to directly and the new file is requested. A 302 redirect should always be used when possible to ensure relative links and images are correct.
Options
Several options are available to control how the URL should be handled, they are:
Encode URL - Encode the new URL.
Decode URL before - Decode the URL before attempting to match it with the regular expression
Decode URL after - Decode the new URL after matching.

Applies to
This option is to choose whether the redirection applies to requested URL's, the Location header when a remote site sends a 302 redirect, or both.

--- Forward section ---

Purpose
The forward feature allows you to selectively forward requests through another proxy or SOCKS4 firewall based on their URL.
Entry options
Host
A regular expression matching the host's you wish to have requests forwarded for, leave blank to match everything.
File
A regular expression matching the file's you wish to have requests forwarded for, leave blank to match everything.
Proxy
The hostname or IP address of the proxy to forward through; if this is left blank, and the host or file options aren't, no action will be taken for requests matching the host and file.
Username
The username to use if the proxy requires authentication.
Password
The password to use if the proxy requires authentication.
Domain
The NT domain when using the NTLM authentication protocol.
Port
The port number of the proxy to forward through.
Type
What type of proxy to forward through; can be HTTP or SOCKS4
Applies to
What type of requests are forwarded; can be HTTP and/or CONNECT (HTTPS)

--- Header section ---

Purpose
The header feature allows you to control what headers are passed from your browser to websites. In additional to the allow and deny actons in some other sections, there is an insert action which will add a new header onto the ones sent by your browser; for these entires, the Host and Type options are plaintext.
Global options
Policy
The action to take when no matching entries are found.
Entry options
Host
A regular expression matching the host's this entry applies to; leave blank to match everything.
Type
A regular expression matching the header type's this entry applies to; leave blank to match everything (header's are in the form "Type: value").
Value
A regular expression matching the header value's this entry applies to; leave blank to match everything.

--- Rewrite section ---

Purpose
The rewrite feature allows you to use regular expressions to modify the contents of webpages, files, the client header, and server header.
Entry options
Host
A regular expression matching the host's this entry applies to; leave blank to match everything.
File
A regular expression matching the file's this entry applies to; leave blank to match everything.
Mimetype
A regular expression matching the MIME-type's this entry applies to. This must be filled with something, otherwise the rewrite rule will be applied to every downloaded file, which is almost certainly not what you want. To have it applied to webpages, fill this field with "text/html"
Pattern
A regular expression pattern matching the area of text inside the file to modify; if this is left blank, and the host, file, or mimetype options aren't, this will be the last entry matched for sites matching the host, file, and mimetype.
Replace
The replacement text to use in place of the area of text matching the pattern; it may contain backreferences to strings captured using parenthesis in the pattern.

A backrefernce to a captured string is in the form "$#", where # is a number from 1-9; "$0" will be replaced with the entire area of text matching the regular expression.

Escape sequences may be used to represent unprintable characters, they are "\n" (newline), "\r" (carrier return), and "\t" (tab). To use a backslash as part of the replacement text, preceed it with another backslash.
Applies to
This option is to select what the rewrite rule applies to; the options are:
Client header - rewrite the client header; this happens before Middleaman parses it so be careful not to remove any headers needed to handle the request properly. The Mimetype option serves no purpose for this.
Server header - rewrite the header from the remote web server; same conditons from client header apply.
Body - rewrite the body of the webpage or file.
POST data - rewrite POST/PUT data sent when submitting a form or uploading a file.


--- Cookies section ---

Purpose
The cookies feature allows you to choose which hosts your browser is allowed to send and receive cookies to and from.
Global options
Policy
The action to take when no matching entry is found.
Entry options
Host
A regular expression matching the host's this entry applies to.
Direction
The direction of the cookie this entry applies to; can be either in (Set-cookie sent by website), out (Cookie sent by browser), or both.

--- External section ---

Purpose
The external feature allows you to use any program or script to parse the contents of a file.
Entry options
Executable
The path to the executable; if no absolute path is given, the path as given in the PATH environment variable is searched.

Any number of arguments can be passed by seperating them by spaces; if you're using a temporary file as the method to pass the contents of the file, it's path will be the last argument.

When the program is executed, several environment variables are set to reflect the properties of the file being handled, they are:
HTTP_METHOD - Method used to request the file.
HTTP_HOST - Host HTTP request was made to.
HTTP_FILE - File HTTP request was made for.
HTTP_PORT - Port HTTP request was made to.
IP - IP address of client making request.

Additionally, for every header received from the remote website and set by a client, an environment variable is set. All the environment variables for the server's headers start with SERVER_, and the client's start with CLIENT_; All '-' (dashes) in the header type are converted to '_' (underscores), and all characters are in uppercase.

If an executable returns with a non-zero status code, the original content is returned.
Host
A regular expression matching the host's this entry applies to, leave blank to match everything.
File
A regular expression matching the file's this entry applies to, leave blank to match everything.
Mimetype
A regular expression matching the MIME-type's this entry applies to, leave blank to match everything.
Newmime
The MIME-type of the content returned from the external program, leave blank to have the original MIME-type preserved.

If this is set to STDIN, the external program is expected to write "Content-type: <mimetype>" followed by 2 newlines as it's first output, where <mimetype> is the new MIME-type.
Type
The method which content is passed to the external program; if set to Pipe the content is piped to the program's STDIN, if set to File the content is stored in a temporary file and it's path is passed as the last argument.

--- Keyword filtering ---

Purpose
The keyword filtering feature allows you to block pages which may contain inappropiate content using a scoring system. When the host, file, mimetype, and keyword in an entry matches a file, it's score is added to the total score; when that total score exceeds the threshold, the page is deemed inappropiate and blocked.
Global options
Template
The template to send when a page exceeds the threshold.
Threshold
The number the total score must equal or exceed until it's blocked.
Entry options
Host
A regular expression matching the host's this entry appliesto, leave blank to match everything.
File
A regular expression matching the file's this entry applies to, leave blank to match everything.
Mimetype
A regular expression matching the mimetype's this entry applied to; it is highly advisable that you set this to something, otherwise all file's will be checked; if you're unsure, set this to "text/"
Keyword
A regular expression matching anything in the body of the document considered inappropiate, leave blank to match everything.
Score
The score this entry adds to the total score when it matches; this may be a positive or negative integer.


Transparent proxying


Middleman can be used to transparently proxy requets; to make use of this feature, you will need to use firewall software capable of forwarding connections. Configure the firewall to forward connections destined for port 80 to the proxy server; the proxy server will look at the Host header sent by the browser and use that to determine what host the request was originally intended for. This feature may not work for all browsers, sending the Host header is only required for HTTP 1.1, although most HTTP 1.0 clients send it anyways.

If you're using iptables under Linux, the following command should do the job (replace interfaces and port to match your setup)
iptables -t nat -A PREROUTING -i eth0 -p tcp --dport 80 -j REDIRECT --to-port 8080

Example external parser


This is a trivial example of how to write an external parser; this will replace any page with the word "sex" in it with a warning message (shrug).
This should be used with the type set to "File", Mimetype set to "text/html", and Newmime set to "STDIN"

--- SNIP ---
#!/bin/sh
if grep -i sex $1 > /dev/null; then
echo "Content-Type: text/html"
echo ""
echo "<html><head><title>Inappropiate content</title></head>"
echo "<body><font size=6>$HTTP_HOST$HTTP_FILE contains inappropiate content</font></body>"
echo "</html>"
exit 0
fi

# Non-zero exit status returns original content
exit 1
# Alternatively, you can send a Content-type header with the same MIME-type as the original document and cat the file (slower)
echo "Content-type: $SERVER_CONTENT_TYPE"
echo ""
cat $1

--- SNIP ---


Frequently asked questions


Q: I setup middleman to use an external parser, but it doesn't always work.
A: Middleman will refuse to buffer files in memory that exceed 512KB, this it to prevent unreasonable download times and memory exhaustion when downloading an extermely large file. This limit can be changed or removed by editing the BUFFERMAX setting in include/settings.h before compiling Middleman.

Q: Some pages show strange numbers throughout the document, and it hangs when loading a page.
A: Middleman is an HTTP 1.1 proxy; some older browsers (such as netscape 4.x) will not work correctly with the proxy, the only solution is to upgrade your browser.

Q: I get a "Bad response" error when loading a webpage that works without the proxy.
A: This probably means the webserver is using the old HTTP/0.9 protocol which doesn't require the webserver to send a header; there's no sane way to support this with the way Middleman is designed.

Q: I keep getting "URL redirection limit exceeded" errors for a page while using the proxy.
A: The default configuration includes a redirect entry which bypasses link tracking scripts by redirecting any request which has a URL within the URL directly to that URL; i.e. requesting "http://www.somesite.com/redirect.pl?http://someothersite.com" will cause the proxy to send back a 302 redirect for "http://someothersite.com". In most cases this works as expected; however, on some sites, such as ones that make you go through a login process and have the URL you originally requested within the URL, this will not work. You can temporarily bypass this by prefixing "bypass[r].." to the URL, or permanently bypass it by adding a redirect entry above the link bypassing one with a URL pattern matching the host and no Redirect field.


Reporting bugs

If you encounter any problems while using Middleman, please contact me. If the problem results in a crash, please follow these steps to help me debug the problem:
1) Recompile middleman using the --enable-debug option in the configure script
2) Type "ulimit -c unlimted" in your shell before running the proxy, this will cause middleman to dump a core file when it crashes.
3) Email me the compiled binary, core file, and configuration file you were using at the time. The last few log entires would also be helpful.

Feature requests

If you have any ideas on how Middleman could be improved, please email me (address at top)... I'll do my best to respond.