<<< Date Index >>>     <<< Thread Index >>>

Re: Sun single-CPU DOS



On Wed, 24 May 2006, Mike O'Connor wrote:

> :Sun says it is jabber, which is why I put it quotes. Since they have not
> :replicated in lab, they are jumping to conclusions. Yes, I agree,
> :it is very specific and the backline engineer usage appears 'stretching 
> things'
> Most Sun adapters have an actual jabber counter that netstat -k will
> spew out for you.  You can eliminate ambiguity easily enough.  Here's
> an example I Google'd for:
>
indeed, and using kstat shows count of 0. more ammo in my favor and presented
back to irritating backline.

> netstat -k eri0
> eri0:
> ipackets 525571 ierrors 365 opackets 8446 oerrors 0 collisions 85
> ifspeed 10000000 rbytes 73324309 obytes 1118022 multircv 99205 multixmt
> 6 brdcstrcv 415863
> brdcstxmt 10 norcvbuf 0 noxmtbuf 0 inits 4 rx_inits 8 tx_inits 1
> nocarrier 1 nocanput 0 allocbfail 0 drop 321 pasue_rcv_cnt 0
> pasue_on_cnt 0 pasue_off_cnt 0 pasue_time_cnt 0 txmac_urun 0
> txmac_maxpkt_err 0 excessive_coll 0 late_coll 0 first_coll 35
> defer_timer_exp 0 peak_attempt_cnt 0 jabber 0 no_tmds 0
>
> (see, "jabber")
>
> tx_hang 0 rx_corr 0 no_free_rx_desc 0 rx_overflow 0 rx_hang 0
> rx_align_err 64 rx_crc_err 19 rx_length_err 0 rx_code_viol_err 0
> bad_pkts 321 runt 40 toolong_pkts 279 rxtag_error 0 parity_error 0
> pci_error_interrupt 0 unknown_fatal 0 pci_data_parity_err 0
> pci_signal_target_abort 0 pci_rcvd_target_abort 0 pci_rcvd_master_abort 0
> pci_signal_system_err 0 pci_det_parity_err 0 ipackets64 525571
> opackets64 8446 rbytes64 73324309 obytes64 1118022 pmcap 4
>
> :In this case it's tcp/ip.
> :
> :step 1) telnet to router
> :step 2) ping some remote device on a fast link (like  2GB IP/Sonet)
> :step 3) watch as returning tcp/ip telnet stream DOS's the sun.
> :
> :it is not the cisco ping the is DOS'ing the sun, it is the return stream
> :of !!..!.!!!....!!!..!!!...  (ad infinitum)
>
> Ahhh, so it's just the return traffic from the Cisco printing out all
> those !!..!.!!! stuff (corresponding to whatever it is the the Cisco is
> pinging) that causes all this?  Nifty!  I didn't think that the Cisco
> could print that fast!  I'm fairly certain it should rate-limit/sample
> that output (unless some automated thingy actually cares about that
> output coming from the Cisco).
>

you'd be surprised how fast a gsr can spit out streams of !.!..!..!
(30,000 pps before sun craps out. ;)

> :the nagle comes into play in the tcp-stream not coalescing all the
> :single char tcp/ip packets each with a single ! or . in it.
>
> Makes perfect sense now that I get what the traffic is.  As an aside,
> the Nagle algorithm was designed with telnet explicitly in mind, per
> RFC 896.  But, a lot of folks these days use telnet for stuff apart
> from interactive use, and I could see someone wanting to disable it
> for performance' sake.  For bare-bones stack implementations, Nagle
> may not be there at all.
>
yep. It's just not turned on on routers by default, so this one caught
us a little bit by surprise when engineers were running a burn in
test in the lab on an OC-192 card.
(_usually_ you don't cream a router with lots of little packets via
telnet)

> :right. totally agreed. it should not cause the machine to totally lock up.
> :(I specified wrong earlier, btw. Break still works, just nothing else does)
>
> That makes it sound even more like an interrupt issue rather than some
> overall system lock.
>
also to me.

> :> In this particular case, if you're talking about ICMP, and there
> :> really isn't a "jabber"/physical layer issue afoot, the idea is for
> ...
> :getting that someone to not slap a 'jabber' label on things and
> :dismiss it out of hand is where I am currently frustrated beyond
> :belied.
>
> Beyond netstat -k, you can probably use lockstat or other kernel
> profiling tools as I mentioned in my earlier post to give them a
> good idea of where the bug really is.  Interrupt issues aren't
> always going to be cut and dried.  There could be some particular
> flavor of IOS, network adapter, media type, CPU, OS, etc. that
> is more prone or less prone to the problem.
>
> :well, yes, this was all quite accidental in the first place.
> :The solution is really quite easy, don't disable nagle on the
> :cisco in the first place. However, I'm much more concerned about
> :the implications of a normal user being able to DOS the machine and
> :Sun not caring enough to do due dilligence to address the issue.
>
> Judging from the amount of times we've exchanged emails (I should
> have asked for a network diagram sooner to help visualize this :) ),
> sometimes it's not so easy.  And "what is or isn't a DoS" can be a
> grey line where reasonable people may differ.  I could readily see
> someone saying "if you point a stupid amount of traffic at something
> it dies, have you considered just not doing that?".
>

yup. I've got plenty of ammo to throw at irritating and dubiously
self-righteous backline, but sometimes the only way to raise matters
above somebody who doesn't want to admit there is a problem, is
to provide a little community pressure to fix it. (even if it
isn't critical or may be hard to reproduce without appreciably
fast equipment on hand).

A DOS that makes a machine unusable is a DOS. Mis-categorizing it
(on their half) as jabber  is wrong as well as condescending (left
that part out) and just plain irritating from a company that usually
takes operating system availability much more seriously.

        Doug