Re: Sun M-class hardware denial of service
* Theo de Raadt:
> Oh I get it. You can use a "trust relationship with your
> administrators" to get around the fact that Sun sold a piece of
> hardware which does not provide the isolation they promised in their
> white papers and documentation.
Quoting from
<http://www.sun.com/servers/sparcenterprise/SPARCEnt-ResMan-Final.pdf>:
| Fault isolation and error management
|
| Domains are protected against software or hardware failures in other
| domains. Failures in hardware shared between domains cause failures only
| in the domains that share the hardware. When a domain encounters a fatal
| error, a domainstop operation occurs that cleanly and quickly shuts down
| only the domain with the error. Domainstop operates by shutting down the
| paths in and out of the system address controller and the system data
| interface application-specific integrated circuits (ASICs). The shutdown
| is intended to prevent further corruption of data, and to facilitate
| debugging by not allowing the failure to be masked by continued
| operation.
|
| When certain hardware errors occur in a Sun SPARC Enterprise system, the
| system controller performs specific diagnosis and domain recovery
| steps. The following automatic diagnosis engines identify and diagnose
| hardware errors that affect the availability of the system and its
| domains:
|
| • XSCF diagnosis engine — diagnoses hardware errors associated with
| domain stops
|
| • Solaris Operating System diagnosis engine — identifies non-fatal
| domain hardware errors and reports them to the system controller
|
| • POST diagnosis engine — identifies any hardware test failures that
| occur when the power-on self-test is run
|
| In most situations, hardware failures that cause a domain crash are
| detected and eliminated from the domain configuration either by power-on
| self test (POST) or OpenBoot PROM during the subsequent automatic
| recovery boot of the domain. However, there can be situations where
| failures are intermittent or the boot-time tests are inadequate to
| detect failures that cause repeated domain failures and reboots. In
| those situations, XSCF uses configurations or configuration policies
| supplied by the domain administrator to eliminate hardware from the
| domain configuration in an attempt to get a stable domain environment
| running.
The final paragraph suggests that the behavior is configurable, but
according to
<http://docs.sun.com/source/819-6202-13/21ch4p.html#0_pgfId-146307>,
it's not:
| 4.6.2 Clearing the Fault/Degradation Information
|
| The information on a faulty or degraded component is cleared when the
| component is replaced. For a component replacement, please contact a
| field engineer.
All in all, the behavior in this case is mostly as promised in the
documentation.
One plausible explanation for the fault is that part of the POST is
executed by the booting kernel. This design has the disadvantage that
it's possible to fault the domain from within, without actual hardware
failures (it might suffice to have a kernel image that checksums, but
doesn't signal booting success to the XSCF when run). The advantage is
that this part of the POST is easier to upgrade (and potentially to
validate as well), and it also makes sure that system administrators do
not pamper over severe hardware issues which lead to support nightmares
later on.
On the other hand, I generally prefer a "trust me, I know what I'm
doing" switch on the systems I deal with. It's really frustrating if a
system tries to protect itself from me, and consequently fails to comply
with the actual requirements in this situation.