Re: Sun single-CPU DOS
On Mon, 22 May 2006, Mike O'Connor wrote:
> Doug,
>
> :> :ping another device with interpacket delay of 0 and a count
> ...
> :> Define what you mean by "interpacket delay". Are you referring to an
> ...
> :cisco router. extending ping. 0 delay.
> :I was speaking of cisco ping.
> :I should have said 'timeout'. mea culpa.
>
> Ahhhh.... between your using the term "interpacket" and your saying
> that Sun was talking about "jabber", I had assumed you were talking
> about the ethernet IPG / IFG. Ignore my "don't complain about your
> ethernet being DoS-ed if it's out-of-spec" remarks. :)
>
> :> For that manner, define "ping". You're certainly not talking about
> :running ping on the cisco to another device (preferably a fast
> :cisco as the source and a nice fast interface like a gige or
> :a IP/sonet)
>
> Cisco extended ping where you answer the prompts in a way to perform
> do a flood ping... gotcha, makes somewhat more sense now.
>
> :dedicated, switched Ethernet here.
> :it seems to mostly overwhelm the sun's interupt processing, but
> :that's just a theory since Sun has decided that the solution is to
> :unplug the machine on the other end.
> :
> :We're only talking about 14000 packets per second to kill a netra
> :T1. I've been able to drive one faster than that via other means
> :without causing a 'jabber effect'.
>
> How are you concluding that this is "jabber"? Does "netstat -k" show
> the jabber and relevant physical-layer counters incrementing on the
> ethernet interface in question, or is Sun just labelling that kind of
> traffic flow as jabber in some generic sense? The term "jabber" has
> specific meaning in some network contexts, and you should be able to
> determine if it really is or isn't physical-layer jabber if relevant
> netstat -k counters are incrementing.
>
Sun says it is jabber, which is why I put it quotes. Since they have not
replicated in lab, they are jumping to conclusions. Yes, I agree,
it is very specific and the backline engineer usage appears 'stretching things'
> ...
> :> Now, that doesn't mean the -console-
> :> should go out to lunch (sounds like you're getting a little too much
> :> "The Network Is The Computer" :) ),
> ...
> :indeed. that's my issue, the console should not be hung. The machine
> :should not require a hard reset. And, I do not believe there is
> :an electrical problem. I'm not doing anything down that low, It's
> :just a TCP/IP stream, and, a not outrageous one at at that.
>
> Well, it's an IP stream, not necessarily TCP/IP -- more like ICMP/IP.
> I don't see an option for Cisco's extended ping to do a "TCP" ping:
>
In this case it's tcp/ip.
step 1) telnet to router
step 2) ping some remote device on a fast link (like 2GB IP/Sonet)
step 3) watch as returning tcp/ip telnet stream DOS's the sun.
it is not the cisco ping the is DOS'ing the sun, it is the return stream
of !!..!.!!!....!!!..!!!... (ad infinitum)
> http://www.cisco.com/en/US/tech/tk365/technologies_tech_note09186a0080093f22.shtml
>
> (though I haven't had a need to memorize every nook and cranny of IOS
> in pursuit of some certification -- "I just make the stuff work...")
> Your talking about "disabling Nagle" in your initial post made it sound
> like a TCP stream, but if all you meant was setting the timeout to 0 on
> a typical ping to do a flood, then TCP and TCP-specific mechanisms like
> the Nagle algorithm aren't in play.
>
the nagle comes into play in the tcp-stream not coalescing all the
single char tcp/ip packets each with a single ! or . in it.
> :> My -suspicion- here is that it's the interrupts that the "stream of
> :> small TCP packets" generates that is leading to the system hang, but
> :> it'd take some kernel profiling to understand the specific impact.
> :> If the only way to generate the particular concentration of network
> :> interrupts along that ethernet interface involves outright breaking
> :> the ethernet spec, I can see where Sun rejects this as bogus from a
> :> -security- perspective.
> :>
> :See, that's where I have trouble. From a Security perspective, you'd
> :want to avoid the DOS via some kind of drop or disable mechanism
> :in the first place... IMHO.
>
> Well, I was talking about out-of-spec stuff when I wrote the above,
> but a similar thing would apply to network traffic that totally fills
> the pipe. I can drop/block/disable all the traffic in the world, but
> if more and more comes, my network is dead. Depending on the type of
> traffic, there may be absolutely nothing you can do to prevent the
> traffic from filling your network pipe and DoSing the interface. Of
> course, it shouldn't cause your console to silently hang up.
>
right. totally agreed. it should not cause the machine to totally lock up.
(I specified wrong earlier, btw. Break still works, just nothing else does)
> In this particular case, if you're talking about ICMP, and there
> really isn't a "jabber"/physical layer issue afoot, the idea is for
> some combination of you and|or Sun to:
>
> a) Find out if it's really a function of interrupt load.
>
> Someone with expertise with lockstat or other Sun kernel
> profiling tools should be able to discern that. Look at
> the lockstat output for your system during a flood ping
> vs. lockstat output from the system sitting there under
> a "normal" network load and see if something stands out.
>
> It could be the case that the problem isn't necessarily
> interrupts, but the system being stuck in a particular
> lock or codepath related to the traffic. lockstat being
> driven by someone with clue should shake that out.
>
getting that someone to not slap a 'jabber' label on things and
dismiss it out of hand is where I am currently frustrated beyond
belied.
> Depending on the level of interrupt starvation, it may
> be the case that lockstat won't log anything useful during
> the flood ping. But, there's a whole lot of options and
> I'm not any particular expert in Solaris kernel internals.
>
> b) If so, mitigate the interrupt load. Assuming you isolate
> as an interrupt problem, the first step is to see if you
> can make the Sun do less in the face of the interrupts.
> There's basically two big things that generate interrupts:
>
> 1) the interrupts generated from the inbound ping traffic
> coming to your Sun from the Cisco.
> 2) the interrupts generated by your Sun's attempts to
> do outbound 'replies' to the Cisco.
>
> There may be some ndd tunables or kernel packet filtering
> that can be enabled or tweaked to limit the interrupt load
> that the inbound and|or outbound ICMP traffic generates and
> that may be sufficient to get the console to stay working.
>
well, yes, this was all quite accidental in the first place.
The solution is really quite easy, don't disable nagle on the
cisco in the first place. However, I'm much more concerned about
the implications of a normal user being able to DOS the machine and
Sun not caring enough to do due dilligence to address the issue.
> Alternatively, you will want to pursue "how do I mitigate
> interrupt load" with Sun. Perhaps they can monitor counters
> over time and not go into a console-stunning tizzy in the
> face of network interrupt floods. I dunno. Selectively
> throttling interrupts isn't the easiest dance in the world
> to do if you actually care about performance, and the last
> thing you want to do is introduce a DoS vector that might
> not have existed before. You may want to present this as
> more a "network" problem than "security" problem.
>
> Hope this helps... good luck!
>