Updated: 10/2/04; 11:45:28 AM.
cremes' blog
An online journal covering my experiences with I/OKit, CoreAudio and OpenDarwin.
        

Friday, February 27, 2004

And that's where it has ended. The driver definitely performs better on the UDP_STREAM test. The driver doesn't drop any packets until the packets hit around 227 bytes. Above that value and the machine can process all the packets and respond. Below that value the CPU maxes out and it drops packets (netperf -- -m 185 and smaller). The 'netstat -m' output also shows that a lot of memory requests were denied.

At this point I have shelved the performance tuning.

10:26:32 PM    comment []



    From:       Andrew Gallatin
    Subject:     Re: RX performance needs fixing!  Re: ethernet driver: kIOReturnOutputStall responsibility
    Date:     February 27, 2004 11:38:50 AM CST
    To:       Chuck Remes
    Cc:       darwin-drivers@lists.apple.com


Andrew Gallatin writes:
result in multiple stalls.  I think a G4 has a huge cache line size

Typo.  G4's cachline is 32-bytes, G5's is 128. 

Drew




10:22:33 PM    comment []



    From:       Andrew Gallatin
    Subject:     Re: RX performance needs fixing!  Re: ethernet driver: kIOReturnOutputStall responsibility
    Date:     February 27, 2004 10:57:15 AM CST
    To:       Chuck Remes
    Cc:       darwin-drivers@lists.apple.com

chuck remes writes:
##  61.1%   ml_set_interrupts_enabled ---- mach_kernel
     61.1%     wait_queue_wakeup_all ----- mach_kernel
     60.8%       m_clalloc ----- mach_kernel   
     49.9%         m_getpackets ----- mach_kernel
     49.9%           getPacket(unsigned long, unsigned long, unsigned
long, unsigned long) ---- com.apple.iokit.IONetworkingFamily
     49.9%             IONetworkController::replaceOrCopyPacket(mbuf**,
unsigned long, bool*) ---- com.apple.iokit.IONetworkingFamily
     49.9%               darwin_tulip::_handleRxInterrupt() ----

If I'm reading the right, (and I'm not at all sure I am, I'm not
familiar with shark output), it looks like m_clalloc() is called with
the mbuf cluster pool exhausted, and is spending a lot of time
waking a thread to allocated more mbuf clusters, and a lot of time
adding those new clusters into the IOMMU page tables.

If you don't run out of mbuf clusters, m_clalloc will never
do anything expensive.  m_clalloc() is expensive only when
you are out of mbuf clusters.

Remind me: are you leaking on receive?  There's been so much
going on in this thread that I'm getting lost ;)

Drew
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



10:21:43 PM    comment []



    From:       Andrew Gallatin
    Subject:     Re: RX performance needs fixing!  Re: ethernet driver: kIOReturnOutputStall responsibility
    Date:     February 27, 2004 10:37:33 AM CST
    To:      Chuck Remes
    Cc:       darwin-drivers@lists.apple.com


chuck remes writes:

Not as simple a picture, really. It appears 23.3% of the time is spent
directly in the _handleRxInterrupt() method. Zeroing down to the
source, it is spending the majority of its time on the test I have at
the top of my 'for' loop with looks like this:

     for ( ;
           !( ( status = rx_desc_ring[ i ].status  ) & rx_STAT_OWN );
           processed++, RXINC( i ) )


I can't believe it really takes that much time to AND a 32-bit value
against a single bit and test to see if it is zero or not. This leads
me to conclude that the reason this routine is showing up at the top of

One comment on just this: you said in a subsequent post that your
device DMAs the status of the receive back up to the host.  If so, that
will invalidate the cache, and cause a huge stall.  So I expect most
of what you are seeing is the penalty for this cache miss.

Also, the device may be DMA'ing a new event which *could be in the
same cache line* as the one you are currently reading.  This could
result in multiple stalls.  I think a G4 has a huge cache line size
(128 bytes), so this could be a real problem.

Can you try to align your descriptors on cache-line or better
boundaries, so that a single descriptor does not straddle a cache
line?  Also, using more descriptors might allow the device to get
"further ahead" so that you'd reduce the likelyhood that you were
reading from the same cache line it was DMA'ing to.


Drew



10:20:38 PM    comment []



    From:       Andrew Gallatin
    Subject:     Re: TX performance fixed!  Re: ethernet driver:  kIOReturnOutputStall responsibility
    Date:     February 27, 2004 8:52:58 AM CST
    To:       Chuck Remes
    Cc:       darwin-drivers@lists.apple.com

chuck remes writes:

cremes% ./netperf -H 192.168.2.38 -t UDP_STREAM -- -m 1024
UDP UNIDIRECTIONAL SEND TEST to 192.168.2.38
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

   9216    1024   10.00     1834805      0    1503.06
  42080           10.00      115053             94.25


Ick.  Nothing is returning ENOBUFS, so the application has no idea
that the packets are flying into the bit bucket.  Apple's drivers
behave exactly the same, so its not your fault.


As a counter example, here is what my driver does (2xG5 sending to a
P4 running FreeBSD):

% netperf -Hscream-my -tUDP_STREAM -- -m 8192
UDP UNIDIRECTIONAL SEND TEST to scream-my
Socket  Message  Elapsed      Messages               
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

  9216    8192   10.00      302587 804033    1983.02
 41600           10.00      302428           1981.98


This is a 2 Gb/sec link with a 9K mtu.  So ~1980 is a decent number.

Do you notice the 804033 errors?

This is the difference between an IOKit driver, and a BSD driver.
IOKit seems to have re-implemented the if_snd queue, and hidden it
from the stack.  Because my driver is using the if_snd queue for its
queuing, ip_output() notices when the queue fills:

        /*
         * Verify that we have any chance at all of being able to queue
         *      the packet or packet fragments
         */
        if ((ifp->if_snd.ifq_len + ip->ip_len / ifp->if_mtu + 1) >=
                ifp->if_snd.ifq_maxlen) {
                        error = ENOBUFS;
                        goto bad;
        }


This is arguably better than a silent drop at the driver or IOKit
level, as it avoids a big trip through the lower levels of the stack.
(dlil processing, arp lookups, etc).

There seem to be at least 3 different behaviours for
sending datagrams faster than the link can handle:

1) ENOBUFS -- from all BSDs, and MacOSX with a BSD network driver.
2) Silent drops -- MacOSX with an IOKit driver
3) Blocking -- Linux

I personally like ENOBUFS best.  At least the app has some
clue that there is a problem.  Blocking and silent drops
just seem wrong to me.  But I'm an old BSD hack, so take
my opinion with a grain of salt ;)

Drew
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



10:19:24 PM    comment []



    From:       Chuck Remes
    Subject:     Re: RX performance needs fixing!  Re: ethernet driver: kIOReturnOutputStall responsibility
    Date:     February 26, 2004 11:04:57 PM CST
    To:       darwin-drivers@lists.apple.com

On Feb 26, 2004, at 9:53 PM, Justin Walker wrote:

On Thursday, February 26, 2004, at 06:54 PM, chuck remes wrote:
    for ( ;
          !( ( status = rx_desc_ring[ i ].status  ) & rx_STAT_OWN );
          processed++, RXINC( i ) )

I can't believe it really takes that much time to AND a 32-bit value against a single bit and test to see if it is zero or not.

If this is actually reading device registers, then expect it to take a while.  You are crossing bus boundaries, and that is a non-trivial expense (especially on RISC-style systems).

No, this isn't reading any device register. When the hardware DMAs the packet into the allocated buffer, it clears its "ownership" bit on the descriptor associated with the buffer. The descriptor lives in main memory, so this read cost should be low.

This leads me to conclude that the reason this routine is showing up at the top of the list is because the system is being overrun with RX Interrupts. Sound reasonable? A fix for this would be to use a hardware clock that generated a single RX interrupt every X packets or Y milliseconds, but the ADMtek doesn't have that facility (though some other tulips do). However, I could setup a new timer function and have it fire every 10 ms or so.

I'm not the expert on your code (:-}), so this is just shooting in the dark.
 - if you are really overrun by interrupts, then the check for
   the OWN bit should succeed (or fail; don't know its definition)
   fairly often.  When you find a descriptor you don't own, do
   you bail or wait/spin?

When that test fails, I bail and wait for the next interrupt. While it remains true, I loop through all the RX descriptors and call inputPacket() on them.

 - I've lost track - what's the speed?  If it's gigabit, it may
   be that polling will give you an improvement (a la FreeBSD)
   if done right.

It's for a line of 10/100 cards. I'll work on a gigabit card when I can afford a switch and some cards to test with. :-)

[snip]
I've not directly dealt with gigabit engines, but with lower-speed devices, you should be able have a fairly efficient receive process by emptying the receive queue on each interrupt.  Are you sure that each interrupt is supplying you with a newly received frame?  It might be instructive to look at a histogram of the number of frames you take off the receive queue on each receive interrupt.  I've done that in the past, and it helped to home in on the (some) performance problems

The tulip chipsets generate an interrupt for each received packet. I empty the queue/list each time I receive one. This is serialized through the workloop construct. The secondary interrupt is scheduled by a primary interrupt filter (necessary for multi-port cards).

Thanks for your input. I need to get a little distance from this code and just let my subconscious mull it over.

cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



10:17:51 PM    comment []



    From:       Justin Walker
    Subject:     Re: RX performance needs fixing!  Re: ethernet driver: kIOReturnOutputStall responsibility
    Date:     February 26, 2004 9:53:15 PM CST
    To:       darwin-drivers@lists.apple.com

On Thursday, February 26, 2004, at 06:54 PM, chuck remes wrote:
[snip]
I traced it out through the source code, and I see that replaceOrCopyPacket() internally calls getPacket() with M_NOWAIT. When netperf sends a message of size 185 bytes, adding in the headers gives a total of 227 bytes. This is the length passed in to replaceOrCopyPacket() which exceeds the threshold for MHLEN. It therefore tries to get a packet of length m->m_pkthdr.len which in this case is 1518 bytes.


Run the test again, and this time set the message size to be 64 bytes. Adding in the 42 byte header gives a total of 106 bytes. This falls beneath the MHLEN test, so it should do m_gethdr() and place the payload inside it. So I ran Shark again and this is the result it > gave:
[snip]
Not as simple a picture, really. It appears 23.3% of the time is spent directly in the _handleRxInterrupt() method. Zeroing down to the source, it is spending the majority of its time on the test I have at the top of my 'for' loop with looks like this:

    for ( ;
          !( ( status = rx_desc_ring[ i ].status  ) & rx_STAT_OWN );
          processed++, RXINC( i ) )

I can't believe it really takes that much time to AND a 32-bit value against a single bit and test to see if it is zero or not.

If this is actually reading device registers, then expect it to take a while.  You are crossing bus boundaries, and that is a non-trivial expense (especially on RISC-style systems).

This leads me to conclude that the reason this routine is showing up at the top of the list is because the system is being overrun with RX Interrupts. Sound reasonable? A fix for this would be to use a hardware clock that generated a single RX interrupt every X packets or Y milliseconds, but the ADMtek doesn't have that facility (though some other tulips do). However, I could setup a new timer function and have it fire every 10 ms or so.

I'm not the expert on your code (:-}), so this is just shooting in the dark.
 - if you are really overrun by interrupts, then the check for
   the OWN bit should succeed (or fail; don't know its definition)
   fairly often.  When you find a descriptor you don't own, do
   you bail or wait/spin?
 - I've lost track - what's the speed?  If it's gigabit, it may
   be that polling will give you an improvement (a la FreeBSD)
   if done right.
 - I would not bother with the hardware clock trick, at least
   until you understand what the real problem is.  You are just
   introducing more moving parts into an already complex state machine.

I've not directly dealt with gigabit engines, but with lower-speed devices, you should be able have a fairly efficient receive process by emptying the receive queue on each interrupt.  Are you sure that each interrupt is supplying you with a newly received frame?  It might be instructive to look at a histogram of the number of frames you take off the receive queue on each receive interrupt.  I've done that in the past, and it helped to home in on the (some) performance problems

Moving further down, we see ppc_usimple_unlock_rwmb() is taking up tons of time being called from m_gethdr(). Likewise, m_retry() is contributing a lot of time to the overall 15.7% for that section of code.

Next, _enable_preemption is chewing up a respectable 10.7% of the sampled time being called by getPacket() and m_retryhdr().

Lastly, ppc_usimple_lock() gets its shot at the big time being called by m_gethdr(), m_retry(), and dlil_input().

What does this all add up to? If I knew, I wouldn't have spammed the list.

I think the above is just a side-effect of something else.  Of course, it doesn't hurt to verify that there is not some odd side-effect of your code causing the system to go bonkers...

Regards,

Justin

--
Justin C. Walker, Curmudgeon-At-Large  *
Institute for General Semantics        |    Men are from Earth.
                                       |    Women are from Earth.
                                       |       Deal with it.
*--------------------------------------*-------------------------------*
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



10:16:00 PM    comment []



    From:       Chuck Remes
    Subject:     RX performance needs fixing!  Re: ethernet driver: kIOReturnOutputStall responsibility
    Date:     February 26, 2004 8:54:16 PM CST
    To:       darwin-drivers@lists.apple.com

Okay, now that TX performance is out of the way (see previous message), time to get back to the RX problem.

Summary:
The driver can barely handle receiving UDP packets when the following command is issued from another computer on the network:

netperf -H <addr of OSX box> -t UDP_STREAM -- -m 185 -s 2560 -S 2560

It falls over completely (i.e. kernel_task goes to 102% and 'netstat -m' reports tons of denied memory requests) when the message size goes to 184 bytes ('-m 184'). Very few transmits occur during this time because the system is apparently out of mbufs.

New Information:
I've got lots of fancy debugging tools on the machine courtesy of Apple Computer. I set up my Interrupt method to set/unset the "marked bit" as the routine starts & finishes. I then fire up Shark to sample the marked bits during some heavy load. What do I see?  (I hope the formatting sticks after it goes through the mailer.) The Shark files that generated this data are available if anyone wants to take a look (with the source embedded for easy reference).

##  61.1%   ml_set_interrupts_enabled ---- mach_kernel
    61.1%     wait_queue_wakeup_all ----- mach_kernel
    60.8%       m_clalloc ----- mach_kernel   
    49.9%         m_getpackets ----- mach_kernel
    49.9%           getPacket(unsigned long, unsigned long, unsigned long, unsigned long) ---- com.apple.iokit.IONetworkingFamily
    49.9%             IONetworkController::replaceOrCopyPacket(mbuf**, unsigned long, bool*) ---- com.apple.iokit.IONetworkingFamily
    49.9%               darwin_tulip::_handleRxInterrupt() ---- darwin.tulip
    49.9%                 darwin_tulip::_interruptOccurred(IOInterruptEventSource*, long) ---- darwin.tulip
    49.9%                   darwin_tulip::_wrapInterruptMethod(OSObject*, IOInterruptEventSource*, int) ---- darwin.tulip
    49.9%                     IOInterruptEventSource::checkForWork() ----- mach_kernel
    49.9%                       IOWorkLoop::threadMain() ---- mach_kernel
    10.9%         m_mclalloc ---- mach_kernel
    10.2%       dlil_input ----- mach_kernel   
    0.0%      thread_continue ----- mach_kernel
    0.0%      IOWorkLoop::threadMain() ----- mach_kernel


##  11.3%   ppc_usimple_unlock_rwmb --- mach_kernel
    4.7%      m_gethdr --- mach_kernel
    4.7%        m_getpackets --- mach_kernel
    4.7%          getPacket(unsigned long, unsigned long, unsigned long, unsigned long) ---- com.apple.iokit.IONetworkingFamily
    4.7%            IONetworkController::replaceOrCopyPacket(mbuf**, unsigned long, bool*) ---- com.apple.iokit.IONetworkingFamily
    4.3%      m_getpackets    ---- mach_kernel
    4.3%        getPacket(unsigned long, unsigned long, unsigned long, unsigned long) ---- com.apple.iokit.IONetworkingFamily
    4.3%          IONetworkController::replaceOrCopyPacket(mbuf**, unsigned long, bool*) ----- com.apple.iokit.IONetworkingFamily   

I'm listing the top 2 lines expanded out to show the stack. As expected, the m_getpackets() function is where all the time is being spent.

I traced it out through the source code, and I see that replaceOrCopyPacket() internally calls getPacket() with M_NOWAIT. When netperf sends a message of size 185 bytes, adding in the headers gives a total of 227 bytes. This is the length passed in to replaceOrCopyPacket() which exceeds the threshold for MHLEN. It therefore tries to get a packet of length m->m_pkthdr.len which in this case is 1518 bytes.


Run the test again, and this time set the message size to be 64 bytes. Adding in the 42 byte header gives a total of 106 bytes. This falls beneath the MHLEN test, so it should do m_gethdr() and place the payload inside it. So I ran Shark again and this is the result it gave:

    23.3%   darwin_tulip::_handleRxInterrupt() ---- darwin.tulip
    22.8%     darwin_tulip::_interruptOccurred(IOInterruptEventSource*, long) --- darwin.tulip
    22.8%       darwin_tulip::_wrapInterruptMethod(OSObject*, IOInterruptEventSource*, int) --- darwin.tulip
    22.8%         IOInterruptEventSource::checkForWork() --- mach_kernel
    22.8%           IOWorkLoop::threadMain() --- mach_kernel
    0.5%      darwin_tulip::_handleRxInterrupt() --- darwin.tulip

    15.7%   ppc_usimple_unlock_rwmb --- mach_kernel
    7.9%      m_gethdr --- mach_kernel
    7.9%        getPacket(unsigned long, unsigned long, unsigned long, unsigned long) --- com.apple.iokit.IONetworkingFamily
    7.9%          IONetworkController::replaceOrCopyPacket(mbuf**, unsigned long, bool*) --- com.apple.iokit.IONetworkingFamily
    7.9%            darwin_tulip::_handleRxInterrupt() --- darwin.tulip
    5.9%      m_retry --- mach_kernel
    2.0%      dlil_input --- mach_kernel
    0.0%      IOBasicOutputQueue::service(unsigned long) --- com.apple.iokit.IONetworkingFamily

    10.7%   _enable_preemption --- mach_kernel
    5.2%      getPacket(unsigned long, unsigned long, unsigned long, unsigned long) --- com.apple.iokit.IONetworkingFamily
    3.9%      m_retryhdr --- mach_kernel
    1.2%      IONetworkInterface::inputPacket(mbuf*, unsigned long, unsigned long, void*) --- com.apple.iokit.IONetworkingFamily
    0.2%      m_gethdr --- mach_kernel
    0.2%      m_retry --- mach_kernel
    0.0%      dlil_input --- mach_kernel

    10.0%   ppc_usimple_lock --- mach_kernel
    5.2%      m_gethdr --- mach_kernel
    2.7%      m_retry --- mach_kernel
    2.1%      dlil_input --- mach_kernel

Not as simple a picture, really. It appears 23.3% of the time is spent directly in the _handleRxInterrupt() method. Zeroing down to the source, it is spending the majority of its time on the test I have at the top of my 'for' loop with looks like this:

    for ( ;
          !( ( status = rx_desc_ring[ i ].status  ) & rx_STAT_OWN );
          processed++, RXINC( i ) )


I can't believe it really takes that much time to AND a 32-bit value against a single bit and test to see if it is zero or not. This leads me to conclude that the reason this routine is showing up at the top of the list is because the system is being overrun with RX Interrupts. Sound reasonable? A fix for this would be to use a hardware clock that generated a single RX interrupt every X packets or Y milliseconds, but the ADMtek doesn't have that facility (though some other tulips do). However, I could setup a new timer function and have it fire every 10 ms or so.

Moving further down, we see ppc_usimple_unlock_rwmb() is taking up tons of time being called from m_gethdr(). Likewise, m_retry() is contributing a lot of time to the overall 15.7% for that section of code.

Next, _enable_preemption is chewing up a respectable 10.7% of the sampled time being called by getPacket() and m_retryhdr().

Lastly, ppc_usimple_lock() gets its shot at the big time being called by m_gethdr(), m_retry(), and dlil_input().

What does this all add up to? If I knew, I wouldn't have spammed the list.

cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



10:14:13 PM    comment []



    From:       Chuck Remes
    Subject:     TX performance fixed!  Re: ethernet driver:  kIOReturnOutputStall responsibility
    Date:     February 26, 2004 6:29:18 PM CST
    To:       darwin-drivers@lists.apple.com

I took a break from working on the RX performance problems and took a look at the TX stalls and how I was handling it. I learned a few things that the list may want to add to its collective wisdom.

1. Do not call releaseFreePackets() from the client thread AND the driver workloop thread.

Earlier in this thread I posted all the code from my outputPacket() method. One of the things I tried to alleviate the stalling condition was to call my TX ring cleanup routine directly from within outputPacket(). I theorized that the sooner I could free up those resources, the better of I'd be. In the cleanup routine, I call freePacket( mbuf, kDelayFree ) from within a cleanup loop, and then call releaseFreePackets() at the end of it. It was possible (hell, probable) that _handleTxCleanup() would be preempted by outputPacket() trying to run the same routine.

Panic city! The error was "panic(cpu 0): freeing free mbuf" in the panic.log. Don't do this.

2. Use IOBasicOutputQueue:service( IOBasicOutputQueue::kServiceAsync ) when making this call within the workloop context.

This call clues in the upper layers that the hardware is now ready to begin sending new packets. I discovered I had commented it out during a debug session about a week ago. This resulted in service() only being called by the watchdog timer routine. Also, be sure to call it with the kServiceAsync option.

Performance is better. There was a note posted by an Apple engineer about 5 months ago that gave this clue about the async option. He was so right it hurts.


These two changes made 'netperf' perform significantly better. Here's the output:

cremes% ./netperf -H 192.168.2.38 -t UDP_STREAM -- -m 1024
UDP UNIDIRECTIONAL SEND TEST to 192.168.2.38
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

  9216    1024   10.00     1834805      0    1503.06
 42080           10.00      115053             94.25

Before the change, the send throughput was in the neighborhood of 95 Mbps and the rx throughput was about 1 Mbps.

Many thanks to Andrew Gallatin for entertaining my questions today.

Now all I have left to do is squash the UDP receive performance problems... and those may be intractable due to the hardware requirements for buffer alignment.

cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



10:12:46 PM    comment []



    From:       Chuck Remes
    Subject:     Re: ethernet driver: kIOReturnOutputStall responsibility
    Date:     February 26, 2004 5:18:40 PM CST
    To:       Steve Modica
    Cc:       darwin-drivers@lists.apple.com

On Feb 26, 2004, at 6:32 AM, Steve Modica wrote:

darwin-drivers-request@lists.apple.com wrote:


Oh, and one more thing. I only get the copyPacket error when netperf is doing its small, 64-byte UDP test. When the packets get bigger (1024+), then I never run out of this resource. It looks like a timing thing. I process every receive interrupt I get, so it's not like I'm sitting on my hands waiting to clean up received packets.
I know there aren't too many clues here, so if you want to see some code let me know.
cr

I'm not sure if it's relevant but the IP input queue in os x is only 50 packets by default. When receiving so many tiny packets (and assuming there's some kind of coalescing going on) are you simply overrunning it?  If you look in sysctl, do you see this value incrementing:

net.inet.ip.intr_queue_drops: 0

I've never seen this value change to !0.

cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.


10:12:08 PM    comment []



    From:       Chuck Remes
    Subject:     Re: ethernet driver: kIOReturnOutputStall responsibility
    Date:     February 26, 2004 5:08:04 PM CST
    To:       darwin-drivers@lists.apple.com

The 'netstat -m' output looks like this *after* the program finishes
executing:

286 mbufs in use:
         250 mbufs allocated to data
         35 mbufs allocated to socket names and addresses
         1 mbufs allocated to Appletalk data blocks
15361/16384 mbuf clusters in use

That seems a wee bit high.  Are you sure you don't
have a leak?  Does that increase each time you test?

If you're referring to the 15361/16384 values, then I the answer is yes. When running the test, that value is usually 211/1428, but as it gets closer to the magic value of 185 it shoots upwards and stays there.

If this is a leak, then it would be on the receive side, right? According to netperf on the freebsd side, it's getting very few responses so it can't be a leak from transmission (or can it?). When I receive a packet, there aren't any conditions where I call freePacket(). I'm passing it up the stack, and somewhere above the driver they ought to be calling freePacket().

Looks like I'm getting some resolution on the receive end of things,
but transmission still has issues. I'll tackle that AFTER I tune this
receive process as much as possible.

Question:  Is this one of those sick tulips which requires
that all DMA addresses be 32-bit aligned?

According to the docs I have for the DEC 21143, ADMtek 985, and Lite-On PNIC, they *all* require the descriptors, receive buffers and transmit buffers to be longword aligned. I guess that makes the entire line of chipsets sick.

IOKit hands you mbufs based upon some information passed in through getPacketBufferConstraints(). In there I specify longword alignment like so:

void darwin_tulip::getPacketBufferConstraints( IOPacketBufferConstraints *constraints ) const
{
    constraints->alignStart = kIOPacketBufferAlign4; // longword aligned.
    constraints->alignLength = kIOPacketBufferAlign1; // no restriction.
}

The second constraint tells the system it doesn't have to pad the end of it to end up on any particular address boundary. In comparison, the only public Apple driver that sets any kind of constraint is the AppleIntel8255x project. It requires even word alignment for packets.

Is it possible this requirement puts a throttle on the overall throughput allowed? For kicks, I did an allocatePacket/freePacket loop during my driver load to force the system to create a bunch of mbufs with the alignment I need. After running the test again, the results were a lot better, but not perfect. The plot thickens.

BTW, here are the results of a bunch of netperf runs. The system config is the same as I listed in an earlier email (freebsd 4.9 sending to OSX 10.3.2/darwin 7.2).

netperf -H 192.168.2.1 -t UDP_STREAM -- -m 190

150 mbufs in use:
        142 mbufs allocated to data
        7 mbufs allocated to socket names and addresses
        1 mbufs allocated to Appletalk data blocks
211/1428 mbuf clusters in use
2893 Kbytes allocated to network (15% in use)
0 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines


netperf -H 192.168.2.1 -t UDP_STREAM -- -m 185

150 mbufs in use:
        142 mbufs allocated to data
        7 mbufs allocated to socket names and addresses
        1 mbufs allocated to Appletalk data blocks
1781/16384 mbuf clusters in use
32805 Kbytes allocated to network (10% in use)
0 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines


netperf -H 192.168.2.1 -t UDP_STREAM -- -m 184

150 mbufs in use:
        142 mbufs allocated to data
        7 mbufs allocated to socket names and addresses
        1 mbufs allocated to Appletalk data blocks
1851/16384 mbuf clusters in use
32805 Kbytes allocated to network (11% in use)
50109 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines


netperf -H 192.168.2.1 -t UDP_STREAM -- -m 64

151 mbufs in use:
        143 mbufs allocated to data
        7 mbufs allocated to socket names and addresses
        1 mbufs allocated to Appletalk data blocks
15361/16384 mbuf clusters in use
32805 Kbytes allocated to network (-34% in use)
1144315 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



10:10:59 PM    comment []



    From:       Justin Walker
    Subject:     Re: ethernet driver: kIOReturnOutputStall responsibility
    Date:     February 26, 2004 2:36:34 PM CST
    To:       darwin-drivers@lists.apple.com

You are in good hands for most of this discussion, so I just have a minor comment on one part of the thread:

On Thursday, February 26, 2004, at 11:34 AM, chuck remes wrote:

[snip]
I already responded to this, but I made a driver change which improved the receive process (as documented above in this message). I commented out the copyPacket() stuff and now use replaceOrCopyPacket(). It performs MUCH MUCH better when the UDP_STREAM is using large packets (thanks for the idea!). As a matter of fact, I don't get a single allocation failure when message size is greater than 184 bytes. Any idea why this might be such a magic value? Is this approximately the size of MHLEN?

On PowerPC, these are the values:

Sizeof pkthdr: 32
MSIZE: 256
MHLEN: 204
MLEN: 236
MINCLSIZE: 440

I believe the values are now the same for *86, but for a long time, MSIZE was 128 on that platform.

Note that the MINCLSIZE value is the cutover from 'mbuf chain' to 'cluster'.  This means that if you have more than two mbufs of data to go, the system will (generally) allocate a single cluster, rather than a chain of mbufs for the data.  I've done tests with netperf, and you can see the effect of MINCLSIZE as the size of the transmission increases (as a 'cusp' in the curve).

Note that the MLEN/MHLEN values must include protocol (IP, TCP/UDP, ...) headers as well.  Sometimes these are added as additional mbufs, and sometimes space is left at the beginning  of the mbuf; this depends on the source of the data being transmitted, and the (tortuous) path through the socket and protocol layers.

Regards,

Justin

--
Justin C. Walker, Curmudgeon-At-Large  *
Institute for General Semantics        | Some people have a mental
                                       |  horizon of radius zero, and
                                       |  call it their point of view.
                                       |     -- David Hilbert
*--------------------------------------*-------------------------------*
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



10:09:21 PM    comment []



    From:       Andrew Gallatin
    Subject:     Re: ethernet driver: kIOReturnOutputStall responsibility
    Date:     February 26, 2004 2:11:12 PM CST
    To:       Chuck Remes
    Cc:       Andrew Gallatin, darwin-drivers@lists.apple.com

chuck remes writes:

The 'netstat -m' output looks like this *after* the program finishes
executing:

286 mbufs in use:
         250 mbufs allocated to data
         35 mbufs allocated to socket names and addresses
         1 mbufs allocated to Appletalk data blocks
15361/16384 mbuf clusters in use

That seems a wee bit high.  Are you sure you don't
have a leak?  Does that increase each time you test?

32839 Kbytes allocated to network (-33% in use)
1822557 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines

During execution, the "286 mbufs in use" value is a bit higher.


After unloading my driver after similer tests, I see:

77 mbufs in use:
        66 mbufs allocated to data
        5 mbufs allocated to packet headers
        4 mbufs allocated to socket names and addresses
        2 mbufs allocated to Appletalk data blocks
635/7606 mbuf clusters in use
15231 Kbytes allocated to network (8% in use)




If you want to get a lot of mbufs ahead of time to see if this
improves things, you can allocate them in a context where you can
sleep (ie when loading driver), and then free them.  My driver uses
1280 mbuf clusters per nic (256 receives * 5 clusters/recv).  When I
unload the driver, the amount of network memory in use does not
decrease (but the number of mbufs does .. I'm not leaking).

Explain how I can do this. Do you mean I should call allocatePacket()
in a tight loop during driver load to force the system to create lots
of mbufs and then call freePacket() on them before any transmissions
start?

#define LOTS_OF_MBUFS 512
struct mbuf *m;
m = m_getpackets(LOTS_OF_MBUFS, M_WAIT);
m_freem_list(m);



You should never, ever see this fail on mbufs you allocate via
something like MCLGET() which allocates system mbuf clusters.  The
system mbuf clusters are always pre-loaded into the iommu and have the
DMA addresses pre-calculated.  Getting their DMA address should be
almost free.

The only time I've ever seen it fail was on my hand-crafted
9000 byte virtually contigous jumbo frames.  (I consider this a bug,
btw, read the archives..).

Bear in mind that there IS some overhead, and it may be cheaper
to copy really small (<MHLEN) packets.  I think IONetwork* has
a copy_or_replace thing for this.

I already responded to this, but I made a driver change which improved
the receive process (as documented above in this message). I commented
out the copyPacket() stuff and now use replaceOrCopyPacket(). It
performs MUCH MUCH better when the UDP_STREAM is using large packets
(thanks for the idea!). As a matter of fact, I don't get a single

That's probably because you're no longer copying large packets.

allocation failure when message size is greater than 184 bytes. Any
idea why this might be such a magic value? Is this approximately the
size of MHLEN?


I'd expect the cutoff between copying and replacing to happen sooner.
The headers are 14 (ether) + 20 (IP) + 8 (UDP) = 42 bytes.
If I'm reading mbuf.h right , MHLEN is 256 - 20 - 32 == 204.
So it seems like the cutoff would be at 204 - 42 = 162 bytes.

That's where I see my cutoff (based on MHLEN) kicking in.

Looks like I'm getting some resolution on the receive end of things,
but transmission still has issues. I'll tackle that AFTER I tune this
receive process as much as possible.

Question:  Is this one of those sick tulips which requires
that all DMA addresses be 32-bit aligned?

Drew
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



10:08:08 PM    comment []



    From:       Chuck Remes
    Subject:     Re: ethernet driver: kIOReturnOutputStall responsibility
    Date:     February 26, 2004 1:34:24 PM CST
    To:       Andrew Gallatin
    Cc:       darwin-drivers@lists.apple.com

We have a couple of message threads going simultaneously here, so I'm going to try and pull little bits from each one back into one response.


On Feb 26, 2004, at 11:47 AM, Andrew Gallatin wrote:

chuck remes writes:

On the receiver end it shows:

cremes% netstat -m
258 mbufs in use:
         226 mbufs allocated to data
         31 mbufs allocated to socket names and addresses
         1 mbufs allocated to Appletalk data blocks
202/608 mbuf clusters in use
1280 Kbytes allocated to network (36% in use)
0 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines

This doesn't look so bad. That's probably because the machine
transmitting (a darwin-x86 box running the same version of the driver)

Are you sure its actually failing an allocation and not failing
in some other way?

I'm pretty sure it's failing on allocation. Here are my observations:

Sender: athlon something, single CPU, freebsd 4.9 with the command "netperf -H 192.168.2.1 -t UDP_STREAM -- -m 185"
Reciever: G5 2.0, dual CPU, OSX 10.3.2 (darwin 7.2.0) is running the "netserver" program

During this 10 second test, on the G5 running "top -s3 -u" I see the kernel_task ratchet up towards 100%. It usually tops out around 99.8% On this test, there are no allocation failures.

Same setup, but now the Sender is executing "netperf -H 192.168.2.1 -t UDP_STREAM -- -m 184" and the world comes crashing down. Notice that the message size is only different by a single byte. This is reproducable. I get allocation failures and the kernel_task goes to 102%. This tells me the kernel_task doesn't have any subthreads to take advantage of other CPUs to do cleanup tasks and this may result in the allocation failures.

The 'netstat -m' output looks like this *after* the program finishes executing:

286 mbufs in use:
        250 mbufs allocated to data
        35 mbufs allocated to socket names and addresses
        1 mbufs allocated to Appletalk data blocks
15361/16384 mbuf clusters in use
32839 Kbytes allocated to network (-33% in use)
1822557 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines

During execution, the "286 mbufs in use" value is a bit higher.


If you want to get a lot of mbufs ahead of time to see if this
improves things, you can allocate them in a context where you can
sleep (ie when loading driver), and then free them.  My driver uses
1280 mbuf clusters per nic (256 receives * 5 clusters/recv).  When I
unload the driver, the amount of network memory in use does not
decrease (but the number of mbufs does .. I'm not leaking).

Explain how I can do this. Do you mean I should call allocatePacket() in a tight loop during driver load to force the system to create lots of mbufs and then call freePacket() on them before any transmissions start?

You should never, ever see this fail on mbufs you allocate via
something like MCLGET() which allocates system mbuf clusters.  The
system mbuf clusters are always pre-loaded into the iommu and have the
DMA addresses pre-calculated.  Getting their DMA address should be
almost free.

The only time I've ever seen it fail was on my hand-crafted
9000 byte virtually contigous jumbo frames.  (I consider this a bug,
btw, read the archives..).

Bear in mind that there IS some overhead, and it may be cheaper
to copy really small (<MHLEN) packets.  I think IONetwork* has
a copy_or_replace thing for this.

I already responded to this, but I made a driver change which improved the receive process (as documented above in this message). I commented out the copyPacket() stuff and now use replaceOrCopyPacket(). It performs MUCH MUCH better when the UDP_STREAM is using large packets (thanks for the idea!). As a matter of fact, I don't get a single allocation failure when message size is greater than 184 bytes. Any idea why this might be such a magic value? Is this approximately the size of MHLEN?

Looks like I'm getting some resolution on the receive end of things, but transmission still has issues. I'll tackle that AFTER I tune this receive process as much as possible.

cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



10:05:24 PM    comment []



From: Andrew Gallatin
Subject: Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 26, 2004 11:47:38 AM CST
To: Chuck Remes
Cc: darwin-drivers@lists.apple.com

chuck remes writes:

On the receiver end it shows:

cremes% netstat -m
258 mbufs in use:
226 mbufs allocated to data
31 mbufs allocated to socket names and addresses
1 mbufs allocated to Appletalk data blocks
202/608 mbuf clusters in use
1280 Kbytes allocated to network (36% in use)
0 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines

This doesn't look so bad. That's probably because the machine
transmitting (a darwin-x86 box running the same version of the driver)

Are you sure its actually failing an allocation and not failing
in some other way?

If you want to get a lot of mbufs ahead of time to see if this
improves things, you can allocate them in a context where you can
sleep (ie when loading driver), and then free them. My driver uses
1280 mbuf clusters per nic (256 receives * 5 clusters/recv). When I
unload the driver, the amount of network memory in use does not
decrease (but the number of mbufs does .. I'm not leaking).



Drew
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



10:03:18 PM    comment []



    From:       Chuck Remes
    Subject:     Re: ethernet driver: kIOReturnOutputStall responsibility
    Date:     February 26, 2004 11:32:15 AM CST
    To:       Andrew Gallatin
    Cc:       darwin-drivers@lists.apple.com

On Feb 26, 2004, at 11:00 AM, Andrew Gallatin wrote:
chuck remes writes:

On transmit I don't do this. The call listed above to
getPhysicalSegmentsWithCoalesce() returns the physical address of each
buffer segment in the vector variable (of type IOPhysicalSegment). This
call is necessary even if I don't need to do any coalescing because I
always need to get the physical addresses.

How do you call it?

When pmap_extract() was exiled to siberia, I started using an
iombufcursor thing.  I create mine when loading the driver like this
(#defines expanded)

my_mbufCursor = IOMbufNaturalMemoryCursor::withSpecification(9014, 5);

Yep, that maps closely to my initialization:

    rxMbufCursor = IOMbufNaturalMemoryCursor::withSpecification( 1518, 1 );
    txMbufCursor = IOMbufNaturalMemoryCursor::withSpecification( 1518, 1 );



Then I get my DMA addresses like this (I hate it when people call them
physical addresses, the are not..):

Since you're helping me out, I will endeavor to call them DMA addresses. No sense in jumping on one of your pet peeves... ;)

int
my_encap (struct my_sc *sc,
          struct mbuf *m,
          struct my_dma_info *dma)
{
  struct IOPhysicalSegment segments[5];
  int count, i;

  count = my_macos_mbufCursor->getPhysicalSegments(m, segments, 5);
  for (i = 0; i < count; i++)
    {
      dma->desc[i].ptr = segments[i].location;
      dma->desc[i].len = segments[i].length;
    }
  dma->nsegs = count;

  return count;
}


That's very similar to what I've got, but I use many more methods since I did it in C++ and I wanted to be able to override this stuff. This is my entire transmit routine. The entry point is outputPacket(). Go to the bottom of the code listing and read upwards from the last method to the top one here:

void darwin_tulip_admtek985::_assignDescriptors( struct mbuf *m,
                                                 struct IOPhysicalSegment *vector, SInt32 segments )
{
    UInt32 curIndex, prevIndex;
    tulip_descriptor_t *d;

    curIndex = txHead;
    d = &tx_desc_ring[ curIndex ];
    d->control = _bitOrder( vector[ 0 ].length ) | tx_CTL_TLINK | tx_CTL_FIRSTFRAG;
    d->status = 0;
    d->buffer1 = _bitOrder( vector[ 0 ].location );
    prevIndex = txHead;

    TXINC( curIndex );

    for ( int i = 1; i < segments; i++, TXINC( curIndex ) )
    {
        d = &tx_desc_ring[ curIndex ];
        d->control = _bitOrder( vector[ i ].length ) | tx_CTL_TLINK;
        // set subsequent descriptors in the chain to be owned, set first descriptor last
        d->status = tx_STAT_OWN;
        d->buffer1 = _bitOrder( vector[ i ].location );

        prevIndex = curIndex;
    }
    // save the virtual address so we can safely call freePacket()
    tx_mbuf_ring[ prevIndex ] = m;
    d->control |= tx_CTL_LASTFRAG;

    // request interrupt periodically to allow for housekeeping tasks
    if ( 0 == ( curIndex & ( TULIP_DESC_INT >> TULIP_MAX_TX_SEGMENTS ) ) )
        d->control |= tx_CTL_FINT;

    tx_desc_ring[ txHead ].status = tx_STAT_OWN; // set OWN on FIRSTFRAG so transmitter will detect it and send
    txHead = curIndex;

    _writeRegister( TULIP_TXSTART, 0xbaadf00d ); // kick the transmitter

    netStats->outputPackets += segments;
}


UInt32 darwin_tulip_admtek985::_computeDescriptors( struct mbuf *m,
                                          struct IOPhysicalSegment *vector, SInt32 *segments )
{
    SInt32 available = TULIP_MAX_TX_SEGMENTS, a, b;
    UInt32 ret = kIOReturnOutputSuccess;

    a = TULIP_TX_RING_LENGTH - txHead;
    b = TULIP_MAX_TX_SEGMENTS;
    if ( a > b ) available = b;
    if ( b >= a ) available = a;

    *segments = txMbufCursor->getPhysicalSegmentsWithCoalesce( m, vector, available );

    if ( ! *segments )
    {
        // record error stats
        IOLog( "%s, getPhysicalSegmentswithCoalesce returned zero\n", __FUNCTION__ );
        freePacket( m );
        netStats->outputErrors++;
        ret = kIOReturnOutputDropped;
    }
    return ret;
}


UInt32 darwin_tulip::outputPacket( struct mbuf *m, void *param )
{
    UInt32 ret = kIOReturnOutputSuccess;

    do
    {
        if ( !enabledNetif )
        {
            IOLog( "%s, interface not enabled\n", __FUNCTION__ );
            freePacket( m ); // drop the packet
            ret = kIOReturnOutputDropped;
            break;
        }

        if ( txActiveCount > TRANSMIT_QUEUE_LENGTH )
        {
            // first thing to try is free up some resources by cleaning up the TX descriptor list
            // NOTE: possible race condition since outputPacket is called from the client's thread
            // instead of the driver's workloop. Look at using an atomic TEST&SET to guard it.
            _handleTxCleanup();

            // if after the cleanup efforts we still don't have any resources, stall the transmission
            if ( txActiveCount > TRANSMIT_QUEUE_LENGTH )
            {
                IOLog( "%s, kIOReturnOutputStall, txActiveCount = %d, txHead = %d, txTail = %dn", __FUNCTION__, txActiveCount, txHead, txTail );
                netStats->outputErrors++;

                ret = kIOReturnOutputStall;
                transmitterStalled = true;
                break;
            }
        }

        struct IOPhysicalSegment vector[ TULIP_MAX_TX_SEGMENTS ];
        SInt32 segments;

        ret = _computeDescriptors( m, vector, &segments );
        if ( kIOReturnOutputSuccess != ret )
            break;

        OSAddAtomic( segments, (SInt32*) &txActiveCount ); // update counter tracking active tx descriptors

        _assignDescriptors( m, vector, segments );
        packetsTransmitted = true;
    }
    while ( false );

    return ret;
}



You should never, ever see this fail on mbufs you allocate via
something like MCLGET() which allocates system mbuf clusters.  The
system mbuf clusters are always pre-loaded into the iommu and have the
DMA addresses pre-calculated.  Getting their DMA address should be
almost free.

The only time I've ever seen it fail was on my hand-crafted
9000 byte virtually contigous jumbo frames.  (I consider this a bug,
btw, read the archives..).

Bear in mind that there IS some overhead, and it may be cheaper
to copy really small (<MHLEN) packets.  I think IONetwork* has
a copy_or_replace thing for this.

It does. It's called replaceOrCopyPacket(). I haven't been using it because I didn't want to get the DMA address from a replaced mbuf and update my RX descriptor (as I mentioned in a prior post).

If it's really as cheap as you say, I'll give that a try and see how it goes.

cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



10:02:06 PM    comment []



    From:       Chuck Remes
    Subject:     Re: ethernet driver: kIOReturnOutputStall responsibility
    Date:     February 26, 2004 11:10:52 AM CST
    To:       Andrew Gallatin
    Cc:       darwin-drivers@lists.apple.com

On Feb 26, 2004, at 8:24 AM, Andrew Gallatin wrote:


chuck remes writes:
Oh, and one more thing. I only get the copyPacket error when netperf is
doing its small, 64-byte UDP test. When the packets get bigger (1024+),
then I never run out of this resource. It looks like a timing thing. I
process every receive interrupt I get, so it's not like I'm sitting on
my hands waiting to clean up received packets.


I assume you are calling copyPacket() with size == the amount  of
actual received data, not the size of the entire buffer?

I've tried it both ways.

    newM = copyPacket( m, length ); // do a copy to avoid having to get new Physical Address (lazy programmer!)
AND
    newM = copyPacket( m, 0 ); // do a copy to avoid having to get new Physical Address (lazy programmer!)

When 0 gets passed in, copyPacket() checks for it and uses m->m_pkthdr.len as the copy length.


One thing to consider is that there are a limited number of mbufs and
mbuf clusters in the system.  If you are blasting small packets at a
receiver with the default 41600 byte UDP receive socket buffer size,
then you can have 41600/SIZE (==650 for 64 byte messages) mbufs
pilling up on the receive socket buffer queue, plus another 50 or so
waiting in the ip intr_queue.   So that's 700 mbufs gone.

When you increase the size, you decrease the number of mbufs (and mbuf
clusters) you are using.  41600/1024 == 40.  Plus 50 waiting
is ~90 mbufs + clusters.

I didn't know any of this, but it all makes sense. I know this would just "hide" the problem, but is there a way to use "sysctl" to increase the number of mbufs and mbuf clusters system-wide?

Does shrinking the receiver socket buffer queue for the small packet
case improve things (netperf -tUDP_STREAM .... -- -m 64 -S2560)

Nope, see below.

What does netstat -m show during your test?

On the receiver end it shows:

cremes% netstat -m
258 mbufs in use:
        226 mbufs allocated to data
        31 mbufs allocated to socket names and addresses
        1 mbufs allocated to Appletalk data blocks
202/608 mbuf clusters in use
1280 Kbytes allocated to network (36% in use)
0 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines

This doesn't look so bad. That's probably because the machine transmitting (a darwin-x86 box running the same version of the driver) keeps stalling its output queue. When I boot under freebsd 4.9 and do the same test, it gets a fatal kernel error whenever I use the "-S2560" option. It works fine with the larger values like 32768, but it doesn't like that small value.

BTW, what sort of hardware are you dealing with?  Is this a Gig or
10Gig nic?  I don't think you should be able to livelock a modern
system like this with a 10 or 10/100 nic, unless there's something
really expensive going on in your driver.

It's a 10/100 NIC (DEC tulip and clones as I responded earlier). I'm surprised too, which is why I'm convinced I'm doing something wrong.

If nothing else, I am learning a LOT about the internals of the system. Thanks for sharing your expertise.

cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



9:58:22 PM    comment []



    From:       Andrew Gallatin
    Subject:     Re: ethernet driver: kIOReturnOutputStall responsibility
    Date:     February 26, 2004 11:00:18 AM CST
    To:       Chuck Remes
    Cc:       darwin-drivers@lists.apple.com

chuck remes writes:

This way you at least avoid an allocation/free/mcl_to_paddr() for each
packet, and maybe you avoid updating some descriptor state in your
hardware.

I do this on the receive descriptor list. Getting the physical address
of an mbuf segment appears to be somewhat expensive, so I always call
copyPacket to get a new mbuf to pass up to the stack. So yes, I do
avoid updating the descriptor state for the hardware.

Ugh. mcl_to_paddr() should be very cheap.  You should not need to copy
in this case.


On transmit I don't do this. The call listed above to
getPhysicalSegmentsWithCoalesce() returns the physical address of each
buffer segment in the vector variable (of type IOPhysicalSegment). This
call is necessary even if I don't need to do any coalescing because I
always need to get the physical addresses.

How do you call it?

When pmap_extract() was exiled to siberia, I started using an
iombufcursor thing.  I create mine when loading the driver like this
(#defines expanded)

my_mbufCursor = IOMbufNaturalMemoryCursor::withSpecification(9014, 5);

Then I get my DMA addresses like this (I hate it when people call them
physical addresses, the are not..):

int
my_encap (struct my_sc *sc,
          struct mbuf *m,
          struct my_dma_info *dma)
{
  struct IOPhysicalSegment segments[5];
  int count, i;

  count = my_macos_mbufCursor->getPhysicalSegments(m, segments, 5); 
  for (i = 0; i < count; i++)
    {
      dma->desc[i].ptr = segments[i].location;
      dma->desc[i].len = segments[i].length;
    }
  dma->nsegs = count;

  return count;
}

You should never, ever see this fail on mbufs you allocate via
something like MCLGET() which allocates system mbuf clusters.  The
system mbuf clusters are always pre-loaded into the iommu and have the
DMA addresses pre-calculated.  Getting their DMA address should be
almost free.

The only time I've ever seen it fail was on my hand-crafted
9000 byte virtually contigous jumbo frames.  (I consider this a bug,
btw, read the archives..).

Bear in mind that there IS some overhead, and it may be cheaper
to copy really small (<MHLEN) packets.  I think IONetwork* has
a copy_or_replace thing for this. 

Drew
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



9:56:29 PM    comment []



From: Chuck Remes
Subject: Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 26, 2004 10:39:39 AM CST
To: Andrew Gallatin
Cc: Chuck Remes, darwin-drivers@lists.apple.com

On Feb 26, 2004, at 7:52 AM, Andrew Gallatin wrote:


chuck remes writes:

The UDP_STREAM test was more interesting. Immediately after firing up
the test, the driver started kicking out errors about stalling the
output queue (kIOReturnOutputStall). The netperf application was firing
off packets as fast as it could, and this overran the transmit
descriptor rings in the driver. The error handling detected this
exhaustion of resources, returned kIOReturnOutputStall to the "upper"
layers of the stack, and waited for the watchdog timer to go off to
call transmitQueue->service( IOBasicOutputQueue::kServiceAsync ). The
service call notifies the stack that everything is peachy again and
transmission can restart.


Is there any way to clear the "stall" condition from your
transmit complete interrupt handler without having to wait for
the watchdog to do it? If so, then I think you probably
want to "stall" the queue, and clear the stall when more
space becomes available.

I did make a minor change which appeared to cure stall conditions I saw when running the TCP_STREAM test. Here's the code:

if ( txActiveCount > TRANSMIT_QUEUE_LENGTH )
{
// first thing to try is free up some resources by cleaning up the TX descriptor list
// NOTE: possible race condition since outputPacket is called from the client's thread
// instead of the driver's workloop. Look at using an atomic TEST&SET to guard it.
_handleTxCleanup();

// if after the cleanup efforts we still don't have any resources, stall the transmission
if ( txActiveCount > TRANSMIT_QUEUE_LENGTH )
{
IOLog( "%s, kIOReturnOutputStall, txActiveCount = %d, txHead = %d, txTail = %dn", __FUNCTION__, txActiveCount, txHead, txTail );
netStats->outputErrors++;
ret = kIOReturnOutputStall;
transmitterStalled = true;
break;
}
}

The code is pretty self-explanatory. This does NOT help when running the UDP_STREAM test.

I don't have a lot of experience with IOKit ethernet drivers, since I
just use the raw BSD interface. Here's some background on how a
traditional BSD driver deals with this situation:

My transmit routine appends an mbuf chain to my ifp->if_snd
queue (or drops it if the queue is full). If IFF_OACTIVE is
not set, then it transmits the first chain in the queue.
If IFF_OACTIVE is set, then it returns and the chain waits
in the queue.

When the ifp->if_snd queue fills up, logic in ip_output()
short circuits the call down into the network stack and returns
ENOBUFS until the driver manages to clear the backlog.

I assume IOKit internally issues ENOBUFS when it receives kIOReturnOutputStall or kIOReturnOutputDropped. That's some of the code I haven't examined yet...

When the hardware runs out of resources, my driver sets the
IFF_OACTIVE bit in its ifp->if_flags. Then when a transmit complete
interrupt arrives, the interrupt handler clears the IFF_OACTIVE bit
and calls the transmit routine immediately.

The HW resources are easy to keep "plentiful." It's the mbufs where I'm having trouble...

Thanks a lot for your answers. They have been very helpful.

cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



9:54:01 PM    comment []



    From:      Chuck Remes
    Subject:     Re: ethernet driver: kIOReturnOutputStall responsibility
    Date:     February 26, 2004 10:32:38 AM CST
    To:       Andrew Gallatin
    Cc:       darwin-drivers@lists.apple.com, hackers@opendarwin.org

On Feb 26, 2004, at 7:10 AM, Andrew Gallatin wrote:


chuck remes writes:

Mmm, no. Let me explain again. When I am preparing a packet for
transmission, I coalesce whatever the stack handed to me into a single
mbuf. I call getPhysicalblahblah to do this work AND to retrieve the

I don't know what your hardware is like, but if you can use multiple
DMA descriptors per transmit, you should.  Coalescing IOKit's way
seems to always involve the allocation of a cluster mbuf, and the copy
of the full packet, even when it may not be required.  That's
expensive.  And it will happen all the time, because most packets
from a higher protocol will be at least 2 mbufs.

This is for the DEC tulip and its clones. They all do allow multiple descriptors per transmit. After I sent my note last night, I went back and changed my descriptor allocation routine to use up to 3 mbufs on transmit.

In various places, the code looks like:

    // allocate the Mbuf cursors, set max packet size, and max number of allowable segments
    rxMbufCursor = IOMbufNaturalMemoryCursor::withSpecification( kIOEthernetMaxPacketSize, 1 );
    txMbufCursor = IOMbufNaturalMemoryCursor::withSpecification( kIOEthernetMaxPacketSize, 3 );

and

    /* first, check to see if enough buffers are left to finish request.
     * Coalesce the mbuf segs to make the packet fit into the available resources for
     * transmission.
     */
    a = ( TULIP_TX_RING_LENGTH - txHead ) + txTail;
    b = TULIP_MAX_TX_SEGMENTS;
    if ( a > b ) available = b;
    if ( b >= a ) available = a;

    *segments = txMbufCursor->getPhysicalSegmentsWithCoalesce( m, vector, available );

This resulted in better throughput, but more stalls. I cured the stall condition by making the transmit completion interrupt occur more often so it could do its internal housekeeping and cleanup tx/rx descriptors.

If you really have to coalesce and you have only a "few" transmit
descriptors, it might actually be cheaper to keep a pre-allocated,
pre-pinned buffer for each descriptor that you copy each mbuf chain into.
This way you at least avoid an allocation/free/mcl_to_paddr() for each
packet, and maybe you avoid updating some descriptor state in your
hardware.

I do this on the receive descriptor list. Getting the physical address of an mbuf segment appears to be somewhat expensive, so I always call copyPacket to get a new mbuf to pass up to the stack. So yes, I do avoid updating the descriptor state for the hardware.

On transmit I don't do this. The call listed above to getPhysicalSegmentsWithCoalesce() returns the physical address of each buffer segment in the vector variable (of type IOPhysicalSegment). This call is necessary even if I don't need to do any coalescing because I always need to get the physical addresses.

I looked at calling mcl_to_paddr() directly, but looks like it might be unsafe. It sometimes returns 0 in which case a further call to pmap_extract() is necessary. There was a thread here over a year ago that said calling pmap_extract directly was a no-no since the interface may change. I think it did when the G5s shipped and all this 64-bit stuff got in the way.

If you know a computationally cheap and safe way to get the physical addresses that you know won't break, let me know! I'll avoid the IOKit API for that operation.

BTW, I'm glad you like netperf ;)

Are you kidding me? I *hate* it! It dashed my belief that the driver was complete and perfect. :-)

cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



9:52:20 PM    comment []



    From:       Andrew Gallatin
    Subject:     Re: ethernet driver: kIOReturnOutputStall responsibility
    Date:     February 26, 2004 8:24:22 AM CST
    To:       Chuck Remes
    Cc:       darwin-drivers@lists.apple.com

chuck remes writes:
Oh, and one more thing. I only get the copyPacket error when netperf is
doing its small, 64-byte UDP test. When the packets get bigger (1024+),
then I never run out of this resource. It looks like a timing thing. I
process every receive interrupt I get, so it's not like I'm sitting on
my hands waiting to clean up received packets.


I assume you are calling copyPacket() with size == the amount  of
actual received data, not the size of the entire buffer?

One thing to consider is that there are a limited number of mbufs and
mbuf clusters in the system.  If you are blasting small packets at a
receiver with the default 41600 byte UDP receive socket buffer size,
then you can have 41600/SIZE (==650 for 64 byte messages) mbufs
pilling up on the receive socket buffer queue, plus another 50 or so
waiting in the ip intr_queue.   So that's 700 mbufs gone.

When you increase the size, you decrease the number of mbufs (and mbuf
clusters) you are using.  41600/1024 == 40.  Plus 50 waiting
is ~90 mbufs + clusters.

Does shrinking the receiver socket buffer queue for the small packet
case improve things (netperf -tUDP_STREAM .... -- -m 64 -S2560)

What does netstat -m show during your test?

BTW, what sort of hardware are you dealing with?  Is this a Gig or
10Gig nic?  I don't think you should be able to livelock a modern
system like this with a 10 or 10/100 nic, unless there's something
really expensive going on in your driver.

Drew
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



9:50:49 PM    comment []



    From:       Andrew Gallatin
    Subject:     Re: ethernet driver: kIOReturnOutputStall responsibility
    Date:     February 26, 2004 7:52:29 AM CST
    To:       Chuck Remes
    Cc:       darwin-drivers@lists.apple.com

chuck remes writes:

The UDP_STREAM test was more interesting. Immediately after firing up
the test, the driver started kicking out errors about stalling the
output queue (kIOReturnOutputStall). The netperf application was firing
off packets as fast as it could, and this overran the transmit
descriptor rings in the driver. The error handling detected this
exhaustion of resources, returned kIOReturnOutputStall to the "upper"
layers of the stack, and waited for the watchdog timer to go off to
call transmitQueue->service( IOBasicOutputQueue::kServiceAsync ). The
service call notifies the stack that everything is peachy again and
transmission can restart.


Is there any way to clear the "stall" condition from your
transmit complete interrupt handler without having to wait for
the watchdog to do it?  If so, then I think you probably
want to "stall" the queue, and clear the stall when more
space becomes available.

I don't have a lot of experience with IOKit ethernet drivers, since I
just use the raw BSD interface.  Here's some background on how a
traditional BSD driver deals with this situation:

My transmit routine appends an mbuf chain to my ifp->if_snd
queue (or drops it if the queue is full).  If IFF_OACTIVE is
not set, then it transmits the first chain in the queue.
If IFF_OACTIVE is set, then it returns and the chain waits
in the queue.

When the ifp->if_snd queue fills up, logic in ip_output()
short circuits the call down into the network stack and returns
ENOBUFS until the driver manages to clear the backlog.

When the hardware runs out of resources, my driver sets the
IFF_OACTIVE bit in its ifp->if_flags.   Then when a transmit complete
interrupt arrives, the interrupt handler clears the IFF_OACTIVE bit
and calls the transmit routine immediately.

Drew
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



9:49:03 PM    comment []



    From:       Andrew Gallatin
    Subject:     Re: ethernet driver: kIOReturnOutputStall responsibility
    Date:     February 26, 2004 7:10:03 AM CST
    To:       Chuck Remes
    Cc:       darwin-drivers@lists.apple.com, hackers@opendarwin.org

chuck remes writes:

Mmm, no. Let me explain again. When I am preparing a packet for
transmission, I coalesce whatever the stack handed to me into a single
mbuf. I call getPhysicalblahblah to do this work AND to retrieve the

I don't know what your hardware is like, but if you can use multiple
DMA descriptors per transmit, you should.  Coalescing IOKit's way
seems to always involve the allocation of a cluster mbuf, and the copy
of the full packet, even when it may not be required.  That's
expensive.  And it will happen all the time, because most packets
from a higher protocol will be at least 2 mbufs.

If you really have to coalesce and you have only a "few" transmit
descriptors, it might actually be cheaper to keep a pre-allocated,
pre-pinned buffer for each descriptor that you copy each mbuf chain into.
This way you at least avoid an allocation/free/mcl_to_paddr() for each
packet, and maybe you avoid updating some descriptor state in your
hardware.

BTW, I'm glad you like netperf ;)

Drew
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



9:48:14 PM    comment []



    From:       Steve Modica
    Subject:     Re: ethernet driver: kIOReturnOutputStall responsibility
    Date:     February 26, 2004 6:32:53 AM CST
    To:       darwin-drivers@lists.apple.com

darwin-drivers-request@lists.apple.com wrote:


Oh, and one more thing. I only get the copyPacket error when netperf is doing its small, 64-byte UDP test. When the packets get bigger (1024+), then I never run out of this resource. It looks like a timing thing. I process every receive interrupt I get, so it's not like I'm sitting on my hands waiting to clean up received packets.
I know there aren't too many clues here, so if you want to see some code let me know.
cr

I'm not sure if it's relevant but the IP input queue in os x is only 50 packets by default. When receiving so many tiny packets (and assuming there's some kind of coalescing going on) are you simply overrunning it?  If you look in sysctl, do you see this value incrementing:

net.inet.ip.intr_queue_drops: 0


--
Steve Modica
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



9:47:19 PM    comment []



    From:       Chuck Remes
    Subject:     Re: ethernet driver: kIOReturnOutputStall responsibility
    Date:     February 25, 2004 10:24:13 PM CST
    To:      Justin Walker
    Cc:       darwin-drivers@lists.apple.com, hackers@opendarwin.org

On Feb 25, 2004, at 10:11 PM, chuck remes wrote:

[SNIP]
I now have to investigate why sending and receiving are failing under extreme stress, so if anyone knows why getPhysicalSegmentsWithCoalesce or copyPacket would fail, I'd love to hear it.

Both of these are in the IONetworkingFamily code (which you say you've looked at):
  copyPacket fails when there are no mbufs available
   or when the source mbuf is malformed;
  the Newton-John call (getPhysical :-}) fails for
   a variety of strange and wondrous reasons, mostly
   having to do with malformed mbufs and resource runout.

Oh yes, I've looked at that code. I sometimes dream about it. :-)

I guess the system is running out of mbufs. When I switch to using the built-in ethernet (GMAC driver), it doesn't print any of the errors I see listed in the source. I'll have to double-check, but I don't recall the error counters incrementing from dropped packets either (though I may be wrong on that score). However, I'm left with the impression that it is handling the stress better than my driver and that I cannot abide. I'll go nuts tuning this thing and it's probably due to lame-o hardware. :-)

Oh, and one more thing. I only get the copyPacket error when netperf is doing its small, 64-byte UDP test. When the packets get bigger (1024+), then I never run out of this resource. It looks like a timing thing. I process every receive interrupt I get, so it's not like I'm sitting on my hands waiting to clean up received packets.

I know there aren't too many clues here, so if you want to see some code let me know.

cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



9:46:27 PM    comment []



    From:       Chuck Remes
    Subject:     Re: ethernet driver: kIOReturnOutputStall responsibility
    Date:     February 25, 2004 10:11:44 PM CST
    To:       Justin Walker
    Cc:       darwin-drivers@lists.apple.com, hackers@opendarwin.org

On Feb 25, 2004, at 8:51 PM, Justin Walker wrote:

Thanks for following up; it's always good to complete a discussion for the sake of the archives ;=}

On Wednesday, February 25, 2004, at 06:09 PM, chuck remes wrote:

The UDP_STREAM test was more interesting. Immediately after firing up the test, the driver started kicking out errors about stalling the output queue (kIOReturnOutputStall). The netperf application was firing off packets as fast as it could, and this overran the transmit descriptor rings in the driver. The error handling detected this exhaustion of resources, returned kIOReturnOutputStall to the "upper" layers of the stack, and waited for the watchdog timer to go off to call transmitQueue->service( IOBasicOutputQueue::kServiceAsync ). The service call notifies the stack that everything is peachy again and transmission can restart.

When you say "the stack", are you referring to the networking stack, or code below that stack but above this part of your driver?  AFAIK, the networking stack doesn't do timeouts or other recovery, nor does it keep state about recent attempts to transmit.  Here, I mean UDP, of course; TCP is a different kettle of code.

I'm referring to anything above the driver.


So I modified the code to return false from outputPacket if there weren't any available resources to transmit another packet. This put the onus for error recovery back on the stack (or application). Rerunning the test resulted in a maximum usage of resources at all times within the driver and then some. Honestly, it exposed some problems with performance when servicing interrupts, so I'm glad I ran it. It caused my call to IOMbufCursor::getPhysicalSegmentsWithCoalesce to return zero a bunch of times.

To make sure I understand: does this mean that you are just passing an error condition upstream and not attempting to do any recovery on your own (which would not be A Good Thing)?

Mmm, no. Let me explain again. When I am preparing a packet for transmission, I coalesce whatever the stack handed to me into a single mbuf. I call getPhysicalblahblah to do this work AND to retrieve the physical address of the buffer. That method (and its non-coalescing brethren) are the only "safe" ways to get physical addresses unless you want to use the sekret function calls buried within the bowels of the IONetworkingFamily. I want this code to work on new releases of darwin/OSX, so I don't mess around and just use the nice IOKit api call.

Under extreme stress (netperf is trying to send a bajillion 64 byte UDP packets as fast as it can), the getPhysical* call will return 0. There is no way to recover from this since the resource is exhausted except for waiting/sleeping and trying again. Instead of retrying, I return "false" from my outputPacket method. Is this not A Good Thing?

I now have to investigate why sending and receiving are failing under extreme stress, so if anyone knows why getPhysicalSegmentsWithCoalesce or copyPacket would fail, I'd love to hear it.

Both of these are in the IONetworkingFamily code (which you say you've looked at):
  copyPacket fails when there are no mbufs available
   or when the source mbuf is malformed;
  the Newton-John call (getPhysical :-}) fails for
   a variety of strange and wondrous reasons, mostly
   having to do with malformed mbufs and resource runout.

Oh yes, I've looked at that code. I sometimes dream about it. :-)

I guess the system is running out of mbufs. When I switch to using the built-in ethernet (GMAC driver), it doesn't print any of the errors I see listed in the source. I'll have to double-check, but I don't recall the error counters incrementing from dropped packets either (though I may be wrong on that score). However, I'm left with the impression that it is handling the stress better than my driver and that I cannot abide. I'll go nuts tuning this thing and it's probably due to lame-o hardware. :-)

cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



9:44:41 PM    comment []



From: Justin Walker
Subject: Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 25, 2004 8:51:35 PM CST
To: Chuck Remes
Cc: darwin-drivers@lists.apple.com, hackers@opendarwin.org

Thanks for following up; it's always good to complete a discussion for the sake of the archives ;=}

On Wednesday, February 25, 2004, at 06:09 PM, chuck remes wrote:

The UDP_STREAM test was more interesting. Immediately after firing up the test, the driver started kicking out errors about stalling the output queue (kIOReturnOutputStall). The netperf application was firing off packets as fast as it could, and this overran the transmit descriptor rings in the driver. The error handling detected this exhaustion of resources, returned kIOReturnOutputStall to the "upper" layers of the stack, and waited for the watchdog timer to go off to call transmitQueue->service( IOBasicOutputQueue::kServiceAsync ). The service call notifies the stack that everything is peachy again and transmission can restart.

When you say "the stack", are you referring to the networking stack, or code below that stack but above this part of your driver? AFAIK, the networking stack doesn't do timeouts or other recovery, nor does it keep state about recent attempts to transmit. Here, I mean UDP, of course; TCP is a different kettle of code.

So I modified the code to return false from outputPacket if there weren't any available resources to transmit another packet. This put the onus for error recovery back on the stack (or application). Rerunning the test resulted in a maximum usage of resources at all times within the driver and then some. Honestly, it exposed some problems with performance when servicing interrupts, so I'm glad I ran it. It caused my call to IOMbufCursor::getPhysicalSegmentsWithCoalesce to return zero a bunch of times.

To make sure I understand: does this mean that you are just passing an error condition upstream and not attempting to do any recovery on your own (which would not be A Good Thing)?

I now have to investigate why sending and receiving are failing under extreme stress, so if anyone knows why getPhysicalSegmentsWithCoalesce or copyPacket would fail, I'd love to hear it.

Both of these are in the IONetworkingFamily code (which you say you've looked at):
copyPacket fails when there are no mbufs available
or when the source mbuf is malformed;
the Newton-John call (getPhysical :-}) fails for
a variety of strange and wondrous reasons, mostly
having to do with malformed mbufs and resource runout.

Cheers,

Justin

--
Justin C. Walker, Curmudgeon-At-Large *
Institute for General Semantics | When LuteFisk is outlawed
| Only outlaws will have
| LuteFisk
*--------------------------------------*-------------------------------*



9:41:26 PM    comment []



    From:       Chuck Remes
    Subject:     Re: ethernet driver: kIOReturnOutputStall responsibility
    Date:     February 25, 2004 8:09:41 PM CST
    To:       darwin-drivers@lists.apple.com
    Cc:       hackers@opendarwin.org

I'm dredging up an email from a LONG time ago cuz I promised I'd let the list know what I found.

There was a thread on here a few weeks back about stress testing a NIC driver. One of the responses suggested using the 'netperf' utility (available at http://www.netperf.org/netperf/NetperfPage.html). I downloaded that utility and started doing some testing.

The TCP_STREAM test was pretty ordinary. From a darwin-ppc box (G5 dual 2.0) to a darwin-x86 box (Athlon 1800), each test resulted in about 94 Mbps throughput.

The UDP_STREAM test was more interesting. Immediately after firing up the test, the driver started kicking out errors about stalling the output queue (kIOReturnOutputStall). The netperf application was firing off packets as fast as it could, and this overran the transmit descriptor rings in the driver. The error handling detected this exhaustion of resources, returned kIOReturnOutputStall to the "upper" layers of the stack, and waited for the watchdog timer to go off to call transmitQueue->service( IOBasicOutputQueue::kServiceAsync ). The service call notifies the stack that everything is peachy again and transmission can restart.

This meant there could be up to a full second of delay while the output queue was stalled. The test returned a dismal score in the neighborhood of 3 Mb. Not good...

So I modified the code to return false from outputPacket if there weren't any available resources to transmit another packet. This put the onus for error recovery back on the stack (or application). Rerunning the test resulted in a maximum usage of resources at all times within the driver and then some. Honestly, it exposed some problems with performance when servicing interrupts, so I'm glad I ran it. It caused my call to IOMbufCursor::getPhysicalSegmentsWithCoalesce to return zero a bunch of times.

Swapping the effort around so the PC was sending to the Mac, it exposed a problem in my receive handling. The first UDP_STREAM test uses a packet size of 64 bytes which causes copyPacket to fail a LOT.

I now have to investigate why sending and receiving are failing under extreme stress, so if anyone knows why getPhysicalSegmentsWithCoalesce or copyPacket would fail, I'd love to hear it.

cr

On May 4, 2002, at 3:49 PM, chuck remes wrote:

On Saturday, May 4, 2002, at 03:31  PM, Justin C. Walker wrote:

This is more opinion than fact, but it is based on a lot of experience in this area.

I don't think that trying to exert "back pressure" from an ethernet (or similar hardware type) driver is a good idea.  The networking layers are conditioned to expect failures, and the protocols are designed to work reasonably well in the face of resource problems in the network.

Trying to exert back pressure is (for the current software base) not useful, and in fact, it may exacerbate a resource exhaustion condition if intervening layers try to hold on to packets to be attempted later.

I would personally rather have you drop the packet (and bump the appropriate counters in the 'ifnet' structure), and let Higher Authority deal with it as a normal "congestion" problem.  There may be an error condition you can propagate back upstream, but this one doesn't sound like the right one.

On Saturday, May 4, 2002, at 10:28 AM, chuck remes wrote:

In an ethernet driver, you are supposed to return kIOReturnOutputStall when there aren't any available resources to send a packet from your outputPacket() method.

Whose responsibility is it to restart the output queue?  If I look at the doc in IOOutputQueue.h, the headerdoc specifies:
<snip>

Justin,

thanks for the response.  You raise a very good point.  I did a lot more digging in the IONetworkingFamily code and discovered that the only way to clear a stall condition (which is held by your IOOutputQueue, BTW) is to call its start() or service() methods.  I looked at *all* of the drivers in the darwin cvs repository and all of them just return kIOReturnOutputStall without setting a timer or anything to make sure that condition is eventually cleared.  This is probably a bug, so I'll probably file something on it after I've completed my research.

In the meantime, I think I agree with you.  Provided the driver has allocated "reasonable" resources for packet transmission, if these resources are overrun the driver should probably just drop the packet and internally start freeing up some structures.

I'll code up a couple of different things and see how it all behaves.  I'll let the list know what I find.

cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



9:40:06 PM    comment []



From:      Justin Walker
    Subject:     Re: ethernet driver: kIOReturnOutputStall responsibility
    Date:     May 4, 2002 5:38:55 PM CDT
    To:       darwin-drivers@lists.apple.com

On Saturday, May 4, 2002, at 01:49 PM, chuck remes wrote:

On Saturday, May 4, 2002, at 03:31  PM, Justin C. Walker wrote:

This is more opinion than fact, but it is based on a lot of experience in this area.

[snip]
Justin,

thanks for the response.  You raise a very good point.  I did a lot more digging in the IONetworkingFamily code and discovered that the only way to clear a stall condition (which is held by your IOOutputQueue, BTW) is to call its start() or service() methods.  I looked at *all* of the drivers in the darwin cvs repository and all of them just return kIOReturnOutputStall without setting a timer or anything to make sure that condition is eventually cleared.  This is probably a bug, so I'll probably file something on it after I've completed my research.

Yuck.  I agree, pending clarification from someone who may understand this better than I.  Odd that this hasn't shown up in any obvious way yet.

A couple of notes: there was a 'start' field in the "ifnet" structure at one point; we removed it because its use seemed to create blocking points where none need be, and because of potential races.  It seemed better to let the driver itself handle all of this, rather than have the users of the driver try to figure out what to do.

Also, while the "stall" response doesn't make sense to me for ethernet-like devices, it may make sense for circuit-oriented devices like ATM.  In this case, back-pressure does make some sense, and dropping frames does not.

In the meantime, I think I agree with you.  Provided the driver has allocated "reasonable" resources for packet transmission, if these resources are overrun the driver should probably just drop the packet and internally start freeing up some structures.

I'm not sure it's a driver-specific issue.  The driver can do everything "right", but if the system mbuf pool runs dry, there's nothing it can do.  Trying to plan for this type of eventuality just introduces needless complexity (IMHO).  But dropping the packet without doing much else seems right to me.

I'll code up a couple of different things and see how it all behaves.  I'll let the list know what I find.

Thanks.

Regards,

Justin

--
Justin C. Walker, Curmudgeon-At-Large  *
Institute for General Semantics        |   If you're not confused,
                                       |   You're not paying attention
*--------------------------------------*-------------------------------*
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



9:38:51 PM    comment []



    From:       Chuck Remes
    Subject:     Re: ethernet driver: kIOReturnOutputStall responsibility
    Date:     May 4, 2002 3:49:54 PM CDT
    To:       darwin-drivers@lists.apple.com

On Saturday, May 4, 2002, at 03:31  PM, Justin C. Walker wrote:

This is more opinion than fact, but it is based on a lot of experience in this area.

I don't think that trying to exert "back pressure" from an ethernet (or similar hardware type) driver is a good idea.  The networking layers are conditioned to expect failures, and the protocols are designed to work reasonably well in the face of resource problems in the network.

Trying to exert back pressure is (for the current software base) not useful, and in fact, it may exacerbate a resource exhaustion condition if intervening layers try to hold on to packets to be attempted later.

I would personally rather have you drop the packet (and bump the appropriate counters in the 'ifnet' structure), and let Higher Authority deal with it as a normal "congestion" problem.  There may be an error condition you can propagate back upstream, but this one doesn't sound like the right one.

On Saturday, May 4, 2002, at 10:28 AM, chuck remes wrote:

In an ethernet driver, you are supposed to return kIOReturnOutputStall when there aren't any available resources to send a packet from your outputPacket() method.

Whose responsibility is it to restart the output queue?  If I look at the doc in IOOutputQueue.h, the headerdoc specifies:
<snip>

Justin,

thanks for the response.  You raise a very good point.  I did a lot more digging in the IONetworkingFamily code and discovered that the only way to clear a stall condition (which is held by your IOOutputQueue, BTW) is to call its start() or service() methods.  I looked at *all* of the drivers in the darwin cvs repository and all of them just return kIOReturnOutputStall without setting a timer or anything to make sure that condition is eventually cleared.  This is probably a bug, so I'll probably file something on it after I've completed my research.

In the meantime, I think I agree with you.  Provided the driver has allocated "reasonable" resources for packet transmission, if these resources are overrun the driver should probably just drop the packet and internally start freeing up some structures.

I'll code up a couple of different things and see how it all behaves.  I'll let the list know what I find.

cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



9:35:52 PM    comment []



    From:       Justin Walker
    Subject:     Re: ethernet driver: kIOReturnOutputStall responsibility
    Date:     May 4, 2002 3:31:46 PM CDT
    To:       darwin-drivers@lists.apple.com

This is more opinion than fact, but it is based on a lot of experience in this area.

I don't think that trying to exert "back pressure" from an ethernet (or similar hardware type) driver is a good idea.  The networking layers are conditioned to expect failures, and the protocols are designed to work reasonably well in the face of resource problems in the network.

Trying to exert back pressure is (for the current software base) not useful, and in fact, it may exacerbate a resource exhaustion condition if intervening layers try to hold on to packets to be attempted later.

I would personally rather have you drop the packet (and bump the appropriate counters in the 'ifnet' structure), and let Higher Authority deal with it as a normal "congestion" problem.  There may be an error condition you can propagate back upstream, but this one doesn't sound like the right one.

Regards,

Justin

On Saturday, May 4, 2002, at 10:28 AM, chuck remes wrote:

In an ethernet driver, you are supposed to return kIOReturnOutputStall when there aren't any available resources to send a packet from your outputPacket() method.

Whose responsibility is it to restart the output queue?  If I look at the doc in IOOutputQueue.h, the headerdoc specifies:

@constant kIOReturnOutputStall   Stall the queue and retry the same packet
              when the queue is restarted. */

I searched through the rest of the code in the IONetworkingFamily but I didn't see anything that ever checked a return code for kIOReturnOutputStall.

When I hit this condition in my driver, the interface is effectively dead until I do an "ifconfig up/down" sequence which resets everything.

cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.


--
Justin C. Walker, Curmudgeon-At-Large  *
Institute for General Semantics        |    Men are from Earth.
                                       |    Women are from Earth.
                                       |       Deal with it.
*--------------------------------------*-------------------------------*
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.



9:34:16 PM    comment []



The Apple mailing lists aren't open to 'bots to crawl them, so googling for information on those lists is impossible. In the next few posts, I'll reproduce the thread as it occurred on the darwin-drivers list the last few days. The information there is pretty useful.

The first post was actually made back in 2002. I ran across it while searching through my personal mail archives and decided to close the loop on something said in it.

9:31:31 PM    comment []



Got a lot of help on the <a href="http://lists.apple.com/mailman/listinfo">darwin-drivers</a> mailing list the last few days. I was home from work sick with the flu but started going stir-crazy about 3 hours into my first day. I started doing some performance testing with a tool call <a href="http://www.netperf.org/netperf/NetperfPage.html">netperf</a> which had been suggested on the driver's list as a good stress-test tool.

I began testing the driver using a G5 (darwin 7.2.0) as the sender and an Athlon 1400 (?) (darwin 7.01) as the receiver. The command I used was:

netperf -H 192.168.2.5 -t TCP_STREAM -- -m 1024

The driver performed very nicely. Right away I started getting back 94Mb throughput from both sides. Excellent, I thought.

Next, I tried the UDP test. It's a real ball-buster because UDP, unlike TCP, has no flow control. This test streams packets just as fast as your driver and hardware can push them out the door.

The G5 started stalling on transmission almost immediately, and the x86 box just sat there twiddling its thumbs. Switching things up so the x86 box was the sender and the G5 the receiver resulted in the same situation (x86 stalling, G5 waiting).

Obviously a driver problem, but where to start? Since transmit was stalling so quickly, it made sense to tackle that first.

I noticed that after the stall, it would be a "long" time before packets started being sent again. I thought (erroneously, as it turned out) that the box was busy cleaning up after itself and the interrupt delivery was slow. So I added a descriptor cleanup method call directly into my outputPacket() method. This didn't do much. I stumped myself right out of the gate, so I did what any programmer would do in the same situation... I forgot to record my changes and started touching code all over the place.

One of the things I did was reboot the x86 box into freebsd 4.9 and ran the test from there. Running the UDP_STREAM test from the freebsd box towards darwin caused the IONetworkController::copyPacket() method to fail a lot. So at this point I got distracted by receive performance and started working on it. I posted some notes to darwin-drivers and waited. Within 15 minutes or so I got a response; a very detailed response.

Fast forward here... lots of emails went back and forth that day and the next. I got a lot of good hints and information from another programmer who had "been there, done that."  I'll post the entire email thread in subsequent posts, but for now I'll just post what I learned.

1. Do not call a method from outputPacket() that is also being called from your workloop context.

outputPacket() runs on the client's thread, not your driver thread. It can, and does, preempt any work being done by your workloop, so there is a possibility of a race condition. As the system gets more stressed, this possibility becomes a certainty. I panic'ed the machine a few times trying to release an already free mbuf (calling releaseFreePackets()).

2. After completing your housekeeping tasks from your interrupt method, calling IOBasicOutputQueue::service() and pass in the async option.

The service() call informs the stack that your hardware is now ready to begin processing packets again. The weird delay I'd seen when stalling before packets would transmit again was caused by only calling service() from the timer routine which runs once per second. I did call service() at the end of my interrupt routine, but I had commented it out when chasing down the RX problems and never made a note to undo it.

3. If any variables are shared between the outputPacket() method and your interrupt or timer methods, make sure operations on them are atomic.

This is due to them running in different contexts as described in #1. I use temporary variables to track activity in each context and then use OSAddAtomic() or a similar routine to modify the variable in one atomic action. This fixed a couple of bizzarro problems that I had seen in earlier testing that refused to be reliably duplicated.

4. Shark is your friend.

The great engineers at Apple have provided a wonderful set of tools they call <a href="http://developer.apple.com/tools/performance/">CHUD Tools</a> which is an acronym for Cannibalistic Humanoid Underground Dweller Tools. In some circles it means Computer Hardware Understanding Development Tools, but we don't associate with those people. Anyway, I used Shark to sample the kernel activity when running the netperf tests. It gave a lot of good hints about where, deep in the bowels of the system, the code was choking. Shark, and a side comment made in the mailing list thread, led me to my next discovery.

5. Always use replaceOrCopyPacket instead of copyPacket() or replacePacket() unless you really really really know how to tune performance better than years of measurement and diagnosis from the BSD programmer community.

I was shy about giving up the RX mbufs I allocated during driver startup. It was a pain in the ass, and in my mind, an expensive operation to get the DMA address of each mbuf instead of just reusing the same packets over and over via copyPacket(). I was copying packets ranging from the minimum size all the way up to max (1500 for ethernet). It turns out bcopy() is even more expensive than figuring out the DMA address of a new mbuf and storing it.

6. In reference to #5, IOMbufCursor::getPhysicalSegments() is not that scary of a routine. Just use it.

I probably had a few more insights, but they're lost to memory now.

9:30:15 PM    comment []



© Copyright 2004 Chuck Remes.
 
February 2004
Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29            
Jan   Mar


My Programming Project Home Pages:

Helpful Radio Links:

Click here to visit the Radio UserLand website.

Click to see the XML version of this web page.

Click here to send an email to the editor of this weblog.