Re: PHP security (or the lack thereof)
Crispin Cowan wrote:
> > Trying to make the language 'safe' won't fix it because the
> > language is not the problem. The real problem is the way PHP is
> > presented to most new developers.
> >
> > PHP has been introduced as a tool for the web developer. As a
> > language its goal is "to allow web developers to write dynamically
> > generated pages quickly." (
> > http://www.php.net/manual/en/faq.general.php ). The focus then is to
> > enable the web developer by giving him the tools he needs to create
> > dynamic content, with as little hassle as possible. The web
> > developer need only read a short tutorial (
> > http://www.php.net/manual/en/tutorial.php ) and he is ready to read,
> > understand and implement the ideas presented in the various example
> > scripts on PHP.net. Unfortunately this situation leaves the web
> > developer uninformed and unprepared to face the hostile environment
> > that is the net.
>
> That is a fascinating perspective.
>
> Web developers who work with static content (HTML and images, etc.) is
> pretty secure: the security threat amounts to Apache configuration
> (directory browsing and htpasswd stuff) and it is pretty difficult for
> an attacker to corrupt static content by way of the content.
>
> Dynamic content, while not inherently dangerous, becomes dangerous when
> you hand the web developer a Turing-complete language. Suddenly the
> exact behavior of the web site under arbitrary input becomes
> undecidable. Programmers (mostly) know this. Security developers
> (should) know this. Web artists may have just been introduced to
> programming to get their web site to be dynamic.
>
> There are two possible approaches to fixing this. One, as nabiy
> suggests, is to change how PHP is presented to web developers. Label it
> as a chain saw, and point out that chain saws don't know the difference
> between "log" and "leg" :)
>
> The other is to contrive a language that is both sufficient for dynamic
> web content development, and also *not* Turing-complete. I have no idea
> what such a language might look like, or even whether the intersection
> of these two requirements is the null set.
Eliminating Turing-completeness would be fairly straightforward:
prohibit unbounded recursion and iteration (i.e. no "while" loops). It
probably wouldn't have much impact upon the usability of the language
either; the kind of processing performed by most web applications
don't require anything beyond simple iteration over finite
lists/strings/arrays.
Unfortunately, it wouldn't have much impact upon the security of the
language either; you don't need anything beyond string concatenation
to fall vulnerable to XSS, SQL-injection or shell-injection attacks.
And you don't need unbounded iteration to make exhaustive analysis
impractical. Just because you can /theoretically/ determine something,
that doesn't mean that you can make the determination using existing
hardware in a reasonable time-frame.
So far as writing secure web applications is concerned, it's likely to
be more fruitful to stop using a common "string" type for raw text,
HTML, URLs, URL-encoded form data, SQL statements, shell commands,
regexps, prtinf-style format strings, HTTP headers and so on. IOW,
stop using "in-band signalling".
The problems aren't limited to web applications; I wouldn't be able to
count the number of times I've seen shell scripts (or C programs using
the printf/system idiom) which fail on filenames or other strings
which contain shell metacharacters (or begin with a leading hyphen).
Web applications are just a more extreme case, due to a combination
of:
a) relatively inexperienced programmers
b) having a whole bunch of extra syntaxes thrown in
c) the fact that the very nature of a web application means that
anyone, anywhere can throw malicious data at it.
So far as designing a language which accounts for these issues is
concerned: IMHO, the most feasible solution is to stop passing
structured data around as "formatted" strings and to use data
structures (e.g. a parse tree) instead.
If you want to construct HTML, you construct the parse tree by
creating leaf nodes from strings and higher-level nodes from a tag
name, a list of attribute name/value nodes, and a list of child nodes.
Each constructor would validate its input according the allowed
syntax. The process of generating HTML from the parse tree would
perform any necessary conversions (e.g. "<" -> "<" within leaf
nodes).
For every formal language which is likely to be useful, the
development language would provide a parser, generator, and a library
of useful operations on the structured representation (find, add,
delete, modify nodes, etc). Any data entering or leaving the
structured representation as a string would be represented in its
"natural" form, i.e. it wouldn't contain any language "syntax".
Apart from being more robust, such a language should also make life
easier for the application developer, as they don't have to implement
their own equivalents. Even programmers who don't understand the
security issues will typically have to deal with many of these issues
in order to get their code to simply work.
--
Glynn Clements <glynn@xxxxxxxxxxxxxxxxxx>