 |
Friday, February 27, 2004 |
And that's where it has ended. The driver definitely performs better on
the UDP_STREAM test. The driver doesn't drop any packets until the
packets hit around 227 bytes. Above that value and the machine can
process all the packets and respond. Below that value the CPU maxes out
and it drops packets (netperf -- -m 185 and smaller). The 'netstat -m'
output also shows that a lot of memory requests were denied.
At this point I have shelved the performance tuning.
10:26:32 PM
|
|
From: Andrew Gallatin
Subject: Re: RX performance needs
fixing! Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 27, 2004 11:38:50 AM CST
To: Chuck Remes
Cc: darwin-drivers@lists.apple.com
Andrew Gallatin writes:
result in multiple stalls. I think a G4 has a huge cache line size
Typo. G4's cachline is 32-bytes, G5's is 128.
Drew
10:22:33 PM
|
|
From: Andrew Gallatin
Subject: Re: RX performance needs
fixing! Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 27, 2004 10:57:15 AM CST
To: Chuck Remes
Cc: darwin-drivers@lists.apple.com
chuck remes writes:
## 61.1% ml_set_interrupts_enabled ---- mach_kernel
61.1% wait_queue_wakeup_all ----- mach_kernel
60.8% m_clalloc ----- mach_kernel
49.9% m_getpackets ----- mach_kernel
49.9%
getPacket(unsigned long, unsigned long, unsigned
long, unsigned long) ---- com.apple.iokit.IONetworkingFamily
49.9%
IONetworkController::replaceOrCopyPacket(mbuf**,
unsigned long, bool*) ---- com.apple.iokit.IONetworkingFamily
49.9%
darwin_tulip::_handleRxInterrupt() ----
If I'm reading the right, (and I'm not at all sure I am, I'm not
familiar with shark output), it looks like m_clalloc() is called with
the mbuf cluster pool exhausted, and is spending a lot of time
waking a thread to allocated more mbuf clusters, and a lot of time
adding those new clusters into the IOMMU page tables.
If you don't run out of mbuf clusters, m_clalloc will never
do anything expensive. m_clalloc() is expensive only when
you are out of mbuf clusters.
Remind me: are you leaking on receive? There's been so much
going on in this thread that I'm getting lost ;)
Drew
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
10:21:43 PM
|
|
From: Andrew Gallatin
Subject: Re: RX performance needs
fixing! Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 27, 2004 10:37:33 AM CST
To: Chuck Remes
Cc: darwin-drivers@lists.apple.com
chuck remes writes:
Not as simple a picture, really. It appears 23.3% of the time is spent
directly in the _handleRxInterrupt() method. Zeroing down to the
source, it is spending the majority of its time on the test I have at
the top of my 'for' loop with looks like this:
for ( ;
!( ( status = rx_desc_ring[ i ].status ) & rx_STAT_OWN );
processed++, RXINC( i ) )
I can't believe it really takes that much time to AND a 32-bit value
against a single bit and test to see if it is zero or not. This leads
me to conclude that the reason this routine is showing up at the top of
One comment on just this: you said in a subsequent post that your
device DMAs the status of the receive back up to the host. If so, that
will invalidate the cache, and cause a huge stall. So I expect most
of what you are seeing is the penalty for this cache miss.
Also, the device may be DMA'ing a new event which *could be in the
same cache line* as the one you are currently reading. This could
result in multiple stalls. I think a G4 has a huge cache line size
(128 bytes), so this could be a real problem.
Can you try to align your descriptors on cache-line or better
boundaries, so that a single descriptor does not straddle a cache
line? Also, using more descriptors might allow the device to get
"further ahead" so that you'd reduce the likelyhood that you were
reading from the same cache line it was DMA'ing to.
Drew
10:20:38 PM
|
|
From: Andrew Gallatin
Subject: Re: TX performance
fixed! Re: ethernet driver: kIOReturnOutputStall
responsibility
Date: February 27, 2004 8:52:58 AM CST
To: Chuck Remes
Cc: darwin-drivers@lists.apple.com
chuck remes writes:
cremes% ./netperf -H 192.168.2.38 -t UDP_STREAM -- -m 1024
UDP UNIDIRECTIONAL SEND TEST to 192.168.2.38
Socket Message Elapsed Messages
Size Size
Time Okay
Errors Throughput
bytes bytes
secs
# # 10^6bits/sec
9216 1024
10.00 1834805
0 1503.06
42080
10.00
115053
94.25
Ick. Nothing is returning ENOBUFS, so the application has no idea
that the packets are flying into the bit bucket. Apple's drivers
behave exactly the same, so its not your fault.
As a counter example, here is what my driver does (2xG5 sending to a
P4 running FreeBSD):
% netperf -Hscream-my -tUDP_STREAM -- -m 8192
UDP UNIDIRECTIONAL SEND TEST to scream-my
Socket Message Elapsed
Messages
Size Size
Time Okay
Errors Throughput
bytes bytes
secs
# # 10^6bits/sec
9216 8192 10.00 302587 804033 1983.02
41600
10.00
302428
1981.98
This is a 2 Gb/sec link with a 9K mtu. So ~1980 is a decent number.
Do you notice the 804033 errors?
This is the difference between an IOKit driver, and a BSD driver.
IOKit seems to have re-implemented the if_snd queue, and hidden it
from the stack. Because my driver is using the if_snd queue for its
queuing, ip_output() notices when the queue fills:
/*
* Verify that we have any chance at all of being able to queue
* the packet or packet fragments
*/
if ((ifp->if_snd.ifq_len + ip->ip_len / ifp->if_mtu + 1) >=
ifp->if_snd.ifq_maxlen) {
error = ENOBUFS;
goto bad;
}
This is arguably better than a silent drop at the driver or IOKit
level, as it avoids a big trip through the lower levels of the stack.
(dlil processing, arp lookups, etc).
There seem to be at least 3 different behaviours for
sending datagrams faster than the link can handle:
1) ENOBUFS -- from all BSDs, and MacOSX with a BSD network driver.
2) Silent drops -- MacOSX with an IOKit driver
3) Blocking -- Linux
I personally like ENOBUFS best. At least the app has some
clue that there is a problem. Blocking and silent drops
just seem wrong to me. But I'm an old BSD hack, so take
my opinion with a grain of salt ;)
Drew
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
10:19:24 PM
|
|
From: Chuck Remes
Subject: Re: RX performance needs
fixing! Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 26, 2004 11:04:57 PM CST
To: darwin-drivers@lists.apple.com
On Feb 26, 2004, at 9:53 PM, Justin Walker wrote:
On Thursday, February 26, 2004, at 06:54 PM, chuck remes wrote:
for ( ;
!( ( status = rx_desc_ring[ i ].status ) & rx_STAT_OWN );
processed++, RXINC( i ) )
I can't believe it really takes that much time to AND a 32-bit value against a single bit and test to see if it is zero or not.
If
this is actually reading device registers, then expect it to take a
while. You are crossing bus boundaries, and that is a non-trivial
expense (especially on RISC-style systems).
No, this isn't reading any device register. When the hardware DMAs the
packet into the allocated buffer, it clears its "ownership" bit on the
descriptor associated with the buffer. The descriptor lives in main
memory, so this read cost should be low.
This leads me to conclude that the
reason this routine is showing up at the top of the list is because the
system is being overrun with RX Interrupts. Sound reasonable? A fix for
this would be to use a hardware clock that generated a single RX
interrupt every X packets or Y milliseconds, but the ADMtek doesn't
have that facility (though some other tulips do). However, I could
setup a new timer function and have it fire every 10 ms or so.
I'm not the expert on your code (:-}), so this is just shooting in the dark.
- if you are really overrun by interrupts, then the check for
the OWN bit should succeed (or fail; don't know its definition)
fairly often. When you find a descriptor you don't own, do
you bail or wait/spin?
When that test fails, I bail and wait for the next interrupt. While it
remains true, I loop through all the RX descriptors and call
inputPacket() on them.
- I've lost track - what's the speed? If it's gigabit, it may
be that polling will give you an improvement (a la FreeBSD)
if done right.
It's for a line of 10/100 cards. I'll work on a gigabit card when I can afford a switch and some cards to test with. :-)
[snip]
I've not directly dealt with gigabit engines, but with lower-speed
devices, you should be able have a fairly efficient receive process by
emptying the receive queue on each interrupt. Are you sure that
each interrupt is supplying you with a newly received frame? It
might be instructive to look at a histogram of the number of frames you
take off the receive queue on each receive interrupt. I've done
that in the past, and it helped to home in on the (some) performance
problems
The tulip chipsets generate an interrupt for each received packet. I
empty the queue/list each time I receive one. This is serialized
through the workloop construct. The secondary interrupt is scheduled by
a primary interrupt filter (necessary for multi-port cards).
Thanks for your input. I need to get a little distance from this code and just let my subconscious mull it over.
cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
10:17:51 PM
|
|
From: Justin Walker
Subject: Re: RX performance needs
fixing! Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 26, 2004 9:53:15 PM CST
To: darwin-drivers@lists.apple.com
On Thursday, February 26, 2004, at 06:54 PM, chuck remes wrote:
[snip]
I traced it out through the source code, and I see that
replaceOrCopyPacket() internally calls getPacket() with M_NOWAIT. When
netperf sends a message of size 185 bytes, adding in the headers gives
a total of 227 bytes. This is the length passed in to
replaceOrCopyPacket() which exceeds the threshold for MHLEN. It
therefore tries to get a packet of length m->m_pkthdr.len which in
this case is 1518 bytes.
Run the test again, and this time set the message size to be 64 bytes.
Adding in the 42 byte header gives a total of 106 bytes. This falls
beneath the MHLEN test, so it should do m_gethdr() and place the
payload inside it. So I ran Shark again and this is the result it >
gave:
[snip]
Not as simple a picture, really. It appears 23.3% of the time is spent
directly in the _handleRxInterrupt() method. Zeroing down to the
source, it is spending the majority of its time on the test I have at
the top of my 'for' loop with looks like this:
for ( ;
!( ( status = rx_desc_ring[ i ].status ) & rx_STAT_OWN );
processed++, RXINC( i ) )
I can't believe it really takes that much time to AND a 32-bit value against a single bit and test to see if it is zero or not.
If this is actually reading device registers, then expect it to take a
while. You are crossing bus boundaries, and that is a non-trivial
expense (especially on RISC-style systems).
This
leads me to conclude that the reason this routine is showing up at the
top of the list is because the system is being overrun with RX
Interrupts. Sound reasonable? A fix for this would be to use a hardware
clock that generated a single RX interrupt every X packets or Y
milliseconds, but the ADMtek doesn't have that facility (though some
other tulips do). However, I could setup a new timer function and have
it fire every 10 ms or so.
I'm not the expert on your code (:-}), so this is just shooting in the dark.
- if you are really overrun by interrupts, then the check for
the OWN bit should succeed (or fail; don't know its definition)
fairly often. When you find a descriptor you don't own, do
you bail or wait/spin?
- I've lost track - what's the speed? If it's gigabit, it may
be that polling will give you an improvement (a la FreeBSD)
if done right.
- I would not bother with the hardware clock trick, at least
until you understand what the real problem is. You are just
introducing more moving parts into an already complex state machine.
I've not directly dealt with gigabit engines, but with lower-speed
devices, you should be able have a fairly efficient receive process by
emptying the receive queue on each interrupt. Are you sure that
each interrupt is supplying you with a newly received frame? It
might be instructive to look at a histogram of the number of frames you
take off the receive queue on each receive interrupt. I've done
that in the past, and it helped to home in on the (some) performance
problems
Moving
further down, we see ppc_usimple_unlock_rwmb() is taking up tons of
time being called from m_gethdr(). Likewise, m_retry() is contributing
a lot of time to the overall 15.7% for that section of code.
Next, _enable_preemption is chewing up a respectable 10.7% of the sampled time being called by getPacket() and m_retryhdr().
Lastly, ppc_usimple_lock() gets its shot at the big time being called by m_gethdr(), m_retry(), and dlil_input().
What does this all add up to? If I knew, I wouldn't have spammed the list.
I think the above is just a side-effect of something else. Of
course, it doesn't hurt to verify that there is not some odd
side-effect of your code causing the system to go bonkers...
Regards,
Justin
--
Justin C. Walker, Curmudgeon-At-Large *
Institute for General Semantics | Men are from Earth.
| Women are from Earth.
| Deal with it.
*--------------------------------------*-------------------------------*
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
10:16:00 PM
|
|
From: Chuck Remes
Subject: RX performance needs
fixing! Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 26, 2004 8:54:16 PM CST
To: darwin-drivers@lists.apple.com
Okay, now that TX performance is out of the way (see previous message), time to get back to the RX problem.
Summary:
The driver can barely handle receiving UDP packets when the following command is issued from another computer on the network:
netperf -H <addr of OSX box> -t UDP_STREAM -- -m 185 -s 2560 -S 2560
It falls over completely (i.e. kernel_task goes to 102% and 'netstat
-m' reports tons of denied memory requests) when the message size goes
to 184 bytes ('-m 184'). Very few transmits occur during this time
because the system is apparently out of mbufs.
New Information:
I've got lots of fancy debugging tools on the machine courtesy of Apple
Computer. I set up my Interrupt method to set/unset the "marked bit" as
the routine starts & finishes. I then fire up Shark to sample the
marked bits during some heavy load. What do I see? (I hope the
formatting sticks after it goes through the mailer.) The Shark files
that generated this data are available if anyone wants to take a look
(with the source embedded for easy reference).
## 61.1% ml_set_interrupts_enabled ---- mach_kernel
61.1% wait_queue_wakeup_all ----- mach_kernel
60.8% m_clalloc ----- mach_kernel
49.9% m_getpackets ----- mach_kernel
49.9%
getPacket(unsigned long, unsigned long, unsigned long, unsigned long)
---- com.apple.iokit.IONetworkingFamily
49.9%
IONetworkController::replaceOrCopyPacket(mbuf**, unsigned long, bool*)
---- com.apple.iokit.IONetworkingFamily
49.9%
darwin_tulip::_handleRxInterrupt() ---- darwin.tulip
49.9%
darwin_tulip::_interruptOccurred(IOInterruptEventSource*, long) ----
darwin.tulip
49.9%
darwin_tulip::_wrapInterruptMethod(OSObject*, IOInterruptEventSource*,
int) ---- darwin.tulip
49.9%
IOInterruptEventSource::checkForWork() ----- mach_kernel
49.9%
IOWorkLoop::threadMain() ---- mach_kernel
10.9% m_mclalloc ---- mach_kernel
10.2% dlil_input ----- mach_kernel
0.0% thread_continue ----- mach_kernel
0.0% IOWorkLoop::threadMain() ----- mach_kernel
## 11.3% ppc_usimple_unlock_rwmb --- mach_kernel
4.7% m_gethdr --- mach_kernel
4.7% m_getpackets --- mach_kernel
4.7%
getPacket(unsigned long, unsigned long, unsigned long, unsigned long)
---- com.apple.iokit.IONetworkingFamily
4.7%
IONetworkController::replaceOrCopyPacket(mbuf**, unsigned long, bool*)
---- com.apple.iokit.IONetworkingFamily
4.3% m_getpackets ---- mach_kernel
4.3%
getPacket(unsigned long, unsigned long, unsigned long, unsigned long)
---- com.apple.iokit.IONetworkingFamily
4.3%
IONetworkController::replaceOrCopyPacket(mbuf**, unsigned long, bool*)
----- com.apple.iokit.IONetworkingFamily
I'm listing the top 2 lines expanded out to show the stack. As
expected, the m_getpackets() function is where all the time is being
spent.
I traced it out through the source code, and I see that
replaceOrCopyPacket() internally calls getPacket() with M_NOWAIT. When
netperf sends a message of size 185 bytes, adding in the headers gives
a total of 227 bytes. This is the length passed in to
replaceOrCopyPacket() which exceeds the threshold for MHLEN. It
therefore tries to get a packet of length m->m_pkthdr.len which in
this case is 1518 bytes.
Run the test again, and this time set the message size to be 64 bytes.
Adding in the 42 byte header gives a total of 106 bytes. This falls
beneath the MHLEN test, so it should do m_gethdr() and place the
payload inside it. So I ran Shark again and this is the result it gave:
23.3% darwin_tulip::_handleRxInterrupt() ---- darwin.tulip
22.8%
darwin_tulip::_interruptOccurred(IOInterruptEventSource*, long) ---
darwin.tulip
22.8%
darwin_tulip::_wrapInterruptMethod(OSObject*, IOInterruptEventSource*,
int) --- darwin.tulip
22.8% IOInterruptEventSource::checkForWork() --- mach_kernel
22.8% IOWorkLoop::threadMain() --- mach_kernel
0.5% darwin_tulip::_handleRxInterrupt() --- darwin.tulip
15.7% ppc_usimple_unlock_rwmb --- mach_kernel
7.9% m_gethdr --- mach_kernel
7.9%
getPacket(unsigned long, unsigned long, unsigned long, unsigned long)
--- com.apple.iokit.IONetworkingFamily
7.9%
IONetworkController::replaceOrCopyPacket(mbuf**, unsigned long, bool*)
--- com.apple.iokit.IONetworkingFamily
7.9%
darwin_tulip::_handleRxInterrupt() --- darwin.tulip
5.9% m_retry --- mach_kernel
2.0% dlil_input --- mach_kernel
0.0%
IOBasicOutputQueue::service(unsigned long) ---
com.apple.iokit.IONetworkingFamily
10.7% _enable_preemption --- mach_kernel
5.2%
getPacket(unsigned long, unsigned long, unsigned long, unsigned long)
--- com.apple.iokit.IONetworkingFamily
3.9% m_retryhdr --- mach_kernel
1.2%
IONetworkInterface::inputPacket(mbuf*, unsigned long, unsigned long,
void*) --- com.apple.iokit.IONetworkingFamily
0.2% m_gethdr --- mach_kernel
0.2% m_retry --- mach_kernel
0.0% dlil_input --- mach_kernel
10.0% ppc_usimple_lock --- mach_kernel
5.2% m_gethdr --- mach_kernel
2.7% m_retry --- mach_kernel
2.1% dlil_input --- mach_kernel
Not as simple a picture, really. It appears 23.3% of the time is spent
directly in the _handleRxInterrupt() method. Zeroing down to the
source, it is spending the majority of its time on the test I have at
the top of my 'for' loop with looks like this:
for ( ;
!( ( status = rx_desc_ring[ i ].status ) & rx_STAT_OWN );
processed++, RXINC( i ) )
I can't believe it really takes that much time to AND a 32-bit value
against a single bit and test to see if it is zero or not. This leads
me to conclude that the reason this routine is showing up at the top of
the list is because the system is being overrun with RX Interrupts.
Sound reasonable? A fix for this would be to use a hardware clock that
generated a single RX interrupt every X packets or Y milliseconds, but
the ADMtek doesn't have that facility (though some other tulips do).
However, I could setup a new timer function and have it fire every 10
ms or so.
Moving further down, we see ppc_usimple_unlock_rwmb() is taking up tons
of time being called from m_gethdr(). Likewise, m_retry() is
contributing a lot of time to the overall 15.7% for that section of
code.
Next, _enable_preemption is chewing up a respectable 10.7% of the sampled time being called by getPacket() and m_retryhdr().
Lastly, ppc_usimple_lock() gets its shot at the big time being called by m_gethdr(), m_retry(), and dlil_input().
What does this all add up to? If I knew, I wouldn't have spammed the list.
cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
10:14:13 PM
|
|
From: Chuck Remes
Subject: TX performance
fixed! Re: ethernet driver: kIOReturnOutputStall
responsibility
Date: February 26, 2004 6:29:18 PM CST
To: darwin-drivers@lists.apple.com
I took a break from working on the RX performance problems and took a
look at the TX stalls and how I was handling it. I learned a few things
that the list may want to add to its collective wisdom.
1. Do not call releaseFreePackets() from the client thread AND the driver workloop thread.
Earlier in this thread I posted all the code from my outputPacket()
method. One of the things I tried to alleviate the stalling condition
was to call my TX ring cleanup routine directly from within
outputPacket(). I theorized that the sooner I could free up those
resources, the better of I'd be. In the cleanup routine, I call
freePacket( mbuf, kDelayFree ) from within a cleanup loop, and then
call releaseFreePackets() at the end of it. It was possible (hell,
probable) that _handleTxCleanup() would be preempted by outputPacket()
trying to run the same routine.
Panic city! The error was "panic(cpu 0): freeing free mbuf" in the panic.log. Don't do this.
2. Use IOBasicOutputQueue:service( IOBasicOutputQueue::kServiceAsync ) when making this call within the workloop context.
This call clues in the upper layers that the hardware is now ready to
begin sending new packets. I discovered I had commented it out during a
debug session about a week ago. This resulted in service() only being
called by the watchdog timer routine. Also, be sure to call it with the
kServiceAsync option.
Performance is better. There was a note posted by an Apple engineer
about 5 months ago that gave this clue about the async option. He was
so right it hurts.
These two changes made 'netperf' perform significantly better. Here's the output:
cremes% ./netperf -H 192.168.2.38 -t UDP_STREAM -- -m 1024
UDP UNIDIRECTIONAL SEND TEST to 192.168.2.38
Socket Message Elapsed Messages
Size Size
Time Okay
Errors Throughput
bytes bytes
secs
# # 10^6bits/sec
9216 1024
10.00 1834805
0 1503.06
42080
10.00
115053
94.25
Before the change, the send throughput was in the neighborhood of 95 Mbps and the rx throughput was about 1 Mbps.
Many thanks to Andrew Gallatin for entertaining my questions today.
Now all I have left to do is squash the UDP receive performance
problems... and those may be intractable due to the hardware
requirements for buffer alignment.
cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
10:12:46 PM
|
|
From: Chuck Remes
Subject: Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 26, 2004 5:18:40 PM CST
To: Steve Modica
Cc: darwin-drivers@lists.apple.com
On Feb 26, 2004, at 6:32 AM, Steve Modica wrote:
darwin-drivers-request@lists.apple.com wrote:
Oh, and one more thing. I only get the copyPacket error when netperf is
doing its small, 64-byte UDP test. When the packets get bigger (1024+),
then I never run out of this resource. It looks like a timing thing. I
process every receive interrupt I get, so it's not like I'm sitting on
my hands waiting to clean up received packets.
I know there aren't too many clues here, so if you want to see some code let me know.
cr
I'm
not sure if it's relevant but the IP input queue in os x is only 50
packets by default. When receiving so many tiny packets (and assuming
there's some kind of coalescing going on) are you simply overrunning
it? If you look in sysctl, do you see this value incrementing:
net.inet.ip.intr_queue_drops: 0
I've never seen this value change to !0.
cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
10:12:08 PM
|
|
From: Chuck Remes
Subject: Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 26, 2004 5:08:04 PM CST
To: darwin-drivers@lists.apple.com
The 'netstat -m' output looks like this *after* the program finishes
executing:
286 mbufs in use:
250 mbufs allocated to data
35 mbufs allocated to socket names and addresses
1 mbufs allocated to Appletalk data blocks
15361/16384 mbuf clusters in use
That seems a wee bit high. Are you sure you don't
have a leak? Does that increase each time you test?
If you're referring to the 15361/16384 values, then I the answer is
yes. When running the test, that value is usually 211/1428, but as it
gets closer to the magic value of 185 it shoots upwards and stays there.
If this is a leak, then it would be on the receive side, right?
According to netperf on the freebsd side, it's getting very few
responses so it can't be a leak from transmission (or can it?). When I
receive a packet, there aren't any conditions where I call
freePacket(). I'm passing it up the stack, and somewhere above the
driver they ought to be calling freePacket().
Looks like I'm getting some resolution on the receive end of things,
but transmission still has issues. I'll tackle that AFTER I tune this
receive process as much as possible.
Question: Is this one of those sick tulips which requires
that all DMA addresses be 32-bit aligned?
According to the docs I have for the DEC 21143, ADMtek 985, and Lite-On
PNIC, they *all* require the descriptors, receive buffers and transmit
buffers to be longword aligned. I guess that makes the entire line of
chipsets sick.
IOKit hands you mbufs based upon some information passed in through
getPacketBufferConstraints(). In there I specify longword alignment
like so:
void darwin_tulip::getPacketBufferConstraints( IOPacketBufferConstraints *constraints ) const
{
constraints->alignStart = kIOPacketBufferAlign4; // longword aligned.
constraints->alignLength = kIOPacketBufferAlign1; // no restriction.
}
The second constraint tells the system it doesn't have to pad the end
of it to end up on any particular address boundary. In comparison, the
only public Apple driver that sets any kind of constraint is the
AppleIntel8255x project. It requires even word alignment for packets.
Is it possible this requirement puts a throttle on the overall
throughput allowed? For kicks, I did an allocatePacket/freePacket loop
during my driver load to force the system to create a bunch of mbufs
with the alignment I need. After running the test again, the results
were a lot better, but not perfect. The plot thickens.
BTW, here are the results of a bunch of netperf runs. The system config
is the same as I listed in an earlier email (freebsd 4.9 sending to OSX
10.3.2/darwin 7.2).
netperf -H 192.168.2.1 -t UDP_STREAM -- -m 190
150 mbufs in use:
142 mbufs allocated to data
7 mbufs allocated to socket names and addresses
1 mbufs allocated to Appletalk data blocks
211/1428 mbuf clusters in use
2893 Kbytes allocated to network (15% in use)
0 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines
netperf -H 192.168.2.1 -t UDP_STREAM -- -m 185
150 mbufs in use:
142 mbufs allocated to data
7 mbufs allocated to socket names and addresses
1 mbufs allocated to Appletalk data blocks
1781/16384 mbuf clusters in use
32805 Kbytes allocated to network (10% in use)
0 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines
netperf -H 192.168.2.1 -t UDP_STREAM -- -m 184
150 mbufs in use:
142 mbufs allocated to data
7 mbufs allocated to socket names and addresses
1 mbufs allocated to Appletalk data blocks
1851/16384 mbuf clusters in use
32805 Kbytes allocated to network (11% in use)
50109 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines
netperf -H 192.168.2.1 -t UDP_STREAM -- -m 64
151 mbufs in use:
143 mbufs allocated to data
7 mbufs allocated to socket names and addresses
1 mbufs allocated to Appletalk data blocks
15361/16384 mbuf clusters in use
32805 Kbytes allocated to network (-34% in use)
1144315 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
10:10:59 PM
|
|
From: Justin Walker
Subject: Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 26, 2004 2:36:34 PM CST
To: darwin-drivers@lists.apple.com
You are in good hands for most of this discussion, so I just have a minor comment on one part of the thread:
On Thursday, February 26, 2004, at 11:34 AM, chuck remes wrote:
[snip]
I already responded to this, but I
made a driver change which improved the receive process (as documented
above in this message). I commented out the copyPacket() stuff and now
use replaceOrCopyPacket(). It performs MUCH MUCH better when the
UDP_STREAM is using large packets (thanks for the idea!). As a matter
of fact, I don't get a single allocation failure when message size is
greater than 184 bytes. Any idea why this might be such a magic value?
Is this approximately the size of MHLEN?
On PowerPC, these are the values:
Sizeof pkthdr: 32
MSIZE: 256
MHLEN: 204
MLEN: 236
MINCLSIZE: 440
I believe the values are now the same for *86, but for a long time, MSIZE was 128 on that platform.
Note that the MINCLSIZE value is the cutover from 'mbuf chain' to
'cluster'. This means that if you have more than two mbufs of
data to go, the system will (generally) allocate a single cluster,
rather than a chain of mbufs for the data. I've done tests with
netperf, and you can see the effect of MINCLSIZE as the size of the
transmission increases (as a 'cusp' in the curve).
Note that the MLEN/MHLEN values must include protocol (IP, TCP/UDP,
...) headers as well. Sometimes these are added as additional
mbufs, and sometimes space is left at the beginning of the mbuf;
this depends on the source of the data being transmitted, and the
(tortuous) path through the socket and protocol layers.
Regards,
Justin
--
Justin C. Walker, Curmudgeon-At-Large *
Institute for General Semantics | Some people have a mental
| horizon of radius zero, and
| call it their point of view.
| -- David Hilbert
*--------------------------------------*-------------------------------*
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
10:09:21 PM
|
|
From: Andrew Gallatin
Subject: Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 26, 2004 2:11:12 PM CST
To: Chuck Remes
Cc: Andrew Gallatin, darwin-drivers@lists.apple.com
chuck remes writes:
The 'netstat -m' output looks like this *after* the program finishes
executing:
286 mbufs in use:
250 mbufs allocated to data
35 mbufs allocated to socket names and addresses
1 mbufs allocated to Appletalk data blocks
15361/16384 mbuf clusters in use
That seems a wee bit high. Are you sure you don't
have a leak? Does that increase each time you test?
32839 Kbytes allocated to network (-33% in use)
1822557 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines
During execution, the "286 mbufs in use" value is a bit higher.
After unloading my driver after similer tests, I see:
77 mbufs in use:
66 mbufs allocated to data
5 mbufs allocated to packet headers
4 mbufs allocated to socket names and addresses
2 mbufs allocated to Appletalk data blocks
635/7606 mbuf clusters in use
15231 Kbytes allocated to network (8% in use)
If you want to get a lot of mbufs ahead of time to see if this
improves things, you can allocate them in a context where you can
sleep (ie when loading driver), and then free them. My driver uses
1280 mbuf clusters per nic (256 receives * 5 clusters/recv). When I
unload the driver, the amount of network memory in use does not
decrease (but the number of mbufs does .. I'm not leaking).
Explain how I can do this. Do you mean I should call allocatePacket()
in a tight loop during driver load to force the system to create lots
of mbufs and then call freePacket() on them before any transmissions
start?
#define LOTS_OF_MBUFS 512
struct mbuf *m;
m = m_getpackets(LOTS_OF_MBUFS, M_WAIT);
m_freem_list(m);
You should never, ever see this fail on mbufs you allocate via
something like MCLGET() which allocates system mbuf clusters. The
system mbuf clusters are always pre-loaded into the iommu and have the
DMA addresses pre-calculated. Getting their DMA address should be
almost free.
The only time I've ever seen it fail was on my hand-crafted
9000 byte virtually contigous jumbo frames. (I consider this a bug,
btw, read the archives..).
Bear in mind that there IS some overhead, and it may be cheaper
to copy really small (<MHLEN) packets. I think IONetwork* has
a copy_or_replace thing for this.
I already responded to this, but I made a driver change which improved
the receive process (as documented above in this message). I commented
out the copyPacket() stuff and now use replaceOrCopyPacket(). It
performs MUCH MUCH better when the UDP_STREAM is using large packets
(thanks for the idea!). As a matter of fact, I don't get a single
That's probably because you're no longer copying large packets.
allocation failure when message size is greater than 184 bytes. Any
idea why this might be such a magic value? Is this approximately the
size of MHLEN?
I'd expect the cutoff between copying and replacing to happen sooner.
The headers are 14 (ether) + 20 (IP) + 8 (UDP) = 42 bytes.
If I'm reading mbuf.h right , MHLEN is 256 - 20 - 32 == 204.
So it seems like the cutoff would be at 204 - 42 = 162 bytes.
That's where I see my cutoff (based on MHLEN) kicking in.
Looks like I'm getting some resolution on the receive end of things,
but transmission still has issues. I'll tackle that AFTER I tune this
receive process as much as possible.
Question: Is this one of those sick tulips which requires
that all DMA addresses be 32-bit aligned?
Drew
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
10:08:08 PM
|
|
From: Chuck Remes
Subject: Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 26, 2004 1:34:24 PM CST
To: Andrew Gallatin
Cc: darwin-drivers@lists.apple.com
We have a couple of message threads going simultaneously here, so I'm
going to try and pull little bits from each one back into one response.
On Feb 26, 2004, at 11:47 AM, Andrew Gallatin wrote:
chuck remes writes:
On the receiver end it shows:
cremes% netstat -m
258 mbufs in use:
226 mbufs allocated to data
31 mbufs allocated to socket names and addresses
1 mbufs allocated to Appletalk data blocks
202/608 mbuf clusters in use
1280 Kbytes allocated to network (36% in use)
0 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines
This doesn't look so bad. That's probably because the machine
transmitting (a darwin-x86 box running the same version of the driver)
Are you sure its actually failing an allocation and not failing
in some other way?
I'm pretty sure it's failing on allocation. Here are my observations:
Sender: athlon something, single CPU, freebsd 4.9 with the command "netperf -H 192.168.2.1 -t UDP_STREAM -- -m 185"
Reciever: G5 2.0, dual CPU, OSX 10.3.2 (darwin 7.2.0) is running the "netserver" program
During this 10 second test, on the G5 running "top -s3 -u" I see the
kernel_task ratchet up towards 100%. It usually tops out around 99.8%
On this test, there are no allocation failures.
Same setup, but now the Sender is executing "netperf -H 192.168.2.1 -t
UDP_STREAM -- -m 184" and the world comes crashing down. Notice that
the message size is only different by a single byte. This is
reproducable. I get allocation failures and the kernel_task goes to
102%. This tells me the kernel_task doesn't have any subthreads to take
advantage of other CPUs to do cleanup tasks and this may result in the
allocation failures.
The 'netstat -m' output looks like this *after* the program finishes executing:
286 mbufs in use:
250 mbufs allocated to data
35 mbufs allocated to socket names and addresses
1 mbufs allocated to Appletalk data blocks
15361/16384 mbuf clusters in use
32839 Kbytes allocated to network (-33% in use)
1822557 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines
During execution, the "286 mbufs in use" value is a bit higher.
If you want to get a lot of mbufs ahead of time to see if this
improves things, you can allocate them in a context where you can
sleep (ie when loading driver), and then free them. My driver uses
1280 mbuf clusters per nic (256 receives * 5 clusters/recv). When I
unload the driver, the amount of network memory in use does not
decrease (but the number of mbufs does .. I'm not leaking).
Explain how I can do this. Do you mean I should call allocatePacket()
in a tight loop during driver load to force the system to create lots
of mbufs and then call freePacket() on them before any transmissions
start?
You should never, ever see this fail on mbufs you allocate via
something like MCLGET() which allocates system mbuf clusters. The
system mbuf clusters are always pre-loaded into the iommu and have the
DMA addresses pre-calculated. Getting their DMA address should be
almost free.
The only time I've ever seen it fail was on my hand-crafted
9000 byte virtually contigous jumbo frames. (I consider this a bug,
btw, read the archives..).
Bear in mind that there IS some overhead, and it may be cheaper
to copy really small (<MHLEN) packets. I think IONetwork* has
a copy_or_replace thing for this.
I already responded to this, but I made a driver change which improved
the receive process (as documented above in this message). I commented
out the copyPacket() stuff and now use replaceOrCopyPacket(). It
performs MUCH MUCH better when the UDP_STREAM is using large packets
(thanks for the idea!). As a matter of fact, I don't get a single
allocation failure when message size is greater than 184 bytes. Any
idea why this might be such a magic value? Is this approximately the
size of MHLEN?
Looks like I'm getting some resolution on the receive end of things,
but transmission still has issues. I'll tackle that AFTER I tune this
receive process as much as possible.
cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
10:05:24 PM
|
|
From: Andrew Gallatin
Subject: Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 26, 2004 11:47:38 AM CST
To: Chuck Remes
Cc: darwin-drivers@lists.apple.com
chuck remes writes:
On the receiver end it shows:
cremes% netstat -m
258 mbufs in use:
226 mbufs allocated to data
31 mbufs allocated to socket names and addresses
1 mbufs allocated to Appletalk data blocks
202/608 mbuf clusters in use
1280 Kbytes allocated to network (36% in use)
0 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines
This doesn't look so bad. That's probably because the machine
transmitting (a darwin-x86 box running the same version of the driver)
Are you sure its actually failing an allocation and not failing
in some other way?
If you want to get a lot of mbufs ahead of time to see if this
improves things, you can allocate them in a context where you can
sleep (ie when loading driver), and then free them. My driver uses
1280 mbuf clusters per nic (256 receives * 5 clusters/recv). When I
unload the driver, the amount of network memory in use does not
decrease (but the number of mbufs does .. I'm not leaking).
Drew
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
10:03:18 PM
|
|
From: Chuck Remes
Subject: Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 26, 2004 11:32:15 AM CST
To: Andrew Gallatin
Cc: darwin-drivers@lists.apple.com
On Feb 26, 2004, at 11:00 AM, Andrew Gallatin wrote:
chuck remes writes:
On transmit I don't do this. The call listed above to
getPhysicalSegmentsWithCoalesce() returns the physical address of each
buffer segment in the vector variable (of type IOPhysicalSegment). This
call is necessary even if I don't need to do any coalescing because I
always need to get the physical addresses.
How do you call it?
When pmap_extract() was exiled to siberia, I started using an
iombufcursor thing. I create mine when loading the driver like this
(#defines expanded)
my_mbufCursor = IOMbufNaturalMemoryCursor::withSpecification(9014, 5);
Yep, that maps closely to my initialization:
rxMbufCursor = IOMbufNaturalMemoryCursor::withSpecification( 1518, 1 );
txMbufCursor = IOMbufNaturalMemoryCursor::withSpecification( 1518, 1 );
Then I get my DMA addresses like this (I hate it when people call them
physical addresses, the are not..):
Since you're helping me out, I will endeavor to call them DMA addresses. No sense in jumping on one of your pet peeves... ;)
int
my_encap (struct my_sc *sc,
struct mbuf *m,
struct my_dma_info *dma)
{
struct IOPhysicalSegment segments[5];
int count, i;
count = my_macos_mbufCursor->getPhysicalSegments(m, segments, 5);
for (i = 0; i < count; i++)
{
dma->desc[i].ptr = segments[i].location;
dma->desc[i].len = segments[i].length;
}
dma->nsegs = count;
return count;
}
That's very similar to what I've got, but I use many more methods since
I did it in C++ and I wanted to be able to override this stuff. This is
my entire transmit routine. The entry point is outputPacket(). Go to
the bottom of the code listing and read upwards from the last method to
the top one here:
void darwin_tulip_admtek985::_assignDescriptors( struct mbuf *m,
struct IOPhysicalSegment *vector, SInt32 segments )
{
UInt32 curIndex, prevIndex;
tulip_descriptor_t *d;
curIndex = txHead;
d = &tx_desc_ring[ curIndex ];
d->control = _bitOrder( vector[ 0 ].length ) | tx_CTL_TLINK | tx_CTL_FIRSTFRAG;
d->status = 0;
d->buffer1 = _bitOrder( vector[ 0 ].location );
prevIndex = txHead;
TXINC( curIndex );
for ( int i = 1; i < segments; i++, TXINC( curIndex ) )
{
d = &tx_desc_ring[ curIndex ];
d->control = _bitOrder( vector[ i ].length ) | tx_CTL_TLINK;
// set subsequent descriptors in the chain to be owned, set first descriptor last
d->status = tx_STAT_OWN;
d->buffer1 = _bitOrder( vector[ i ].location );
prevIndex = curIndex;
}
// save the virtual address so we can safely call freePacket()
tx_mbuf_ring[ prevIndex ] = m;
d->control |= tx_CTL_LASTFRAG;
// request interrupt periodically to allow for housekeeping tasks
if ( 0 == ( curIndex & ( TULIP_DESC_INT >> TULIP_MAX_TX_SEGMENTS ) ) )
d->control |= tx_CTL_FINT;
tx_desc_ring[ txHead ].status = tx_STAT_OWN; // set OWN on FIRSTFRAG so transmitter will detect it and send
txHead = curIndex;
_writeRegister( TULIP_TXSTART, 0xbaadf00d ); // kick the transmitter
netStats->outputPackets += segments;
}
UInt32 darwin_tulip_admtek985::_computeDescriptors( struct mbuf *m,
struct IOPhysicalSegment *vector, SInt32 *segments )
{
SInt32 available = TULIP_MAX_TX_SEGMENTS, a, b;
UInt32 ret = kIOReturnOutputSuccess;
a = TULIP_TX_RING_LENGTH - txHead;
b = TULIP_MAX_TX_SEGMENTS;
if ( a > b ) available = b;
if ( b >= a ) available = a;
*segments = txMbufCursor->getPhysicalSegmentsWithCoalesce( m, vector, available );
if ( ! *segments )
{
// record error stats
IOLog( "%s, getPhysicalSegmentswithCoalesce returned zero\n", __FUNCTION__ );
freePacket( m );
netStats->outputErrors++;
ret = kIOReturnOutputDropped;
}
return ret;
}
UInt32 darwin_tulip::outputPacket( struct mbuf *m, void *param )
{
UInt32 ret = kIOReturnOutputSuccess;
do
{
if ( !enabledNetif )
{
IOLog( "%s, interface not enabled\n", __FUNCTION__ );
freePacket( m ); // drop the packet
ret = kIOReturnOutputDropped;
break;
}
if ( txActiveCount > TRANSMIT_QUEUE_LENGTH )
{
//
first thing to try is free up some resources by cleaning up the TX
descriptor list
//
NOTE: possible race condition since outputPacket is called from the
client's thread
//
instead of the driver's workloop. Look at using an atomic TEST&SET
to guard it.
_handleTxCleanup();
//
if after the cleanup efforts we still don't have any resources, stall
the transmission
if ( txActiveCount > TRANSMIT_QUEUE_LENGTH )
{
IOLog( "%s, kIOReturnOutputStall, txActiveCount = %d, txHead = %d,
txTail = %dn", __FUNCTION__, txActiveCount, txHead, txTail );
netStats->outputErrors++;
ret = kIOReturnOutputStall;
transmitterStalled = true;
break;
}
}
struct IOPhysicalSegment vector[ TULIP_MAX_TX_SEGMENTS ];
SInt32 segments;
ret = _computeDescriptors( m, vector, &segments );
if ( kIOReturnOutputSuccess != ret )
break;
OSAddAtomic( segments,
(SInt32*) &txActiveCount ); // update counter tracking active tx
descriptors
_assignDescriptors( m, vector, segments );
packetsTransmitted = true;
}
while ( false );
return ret;
}
You should never, ever see this fail on mbufs you allocate via
something like MCLGET() which allocates system mbuf clusters. The
system mbuf clusters are always pre-loaded into the iommu and have the
DMA addresses pre-calculated. Getting their DMA address should be
almost free.
The only time I've ever seen it fail was on my hand-crafted
9000 byte virtually contigous jumbo frames. (I consider this a bug,
btw, read the archives..).
Bear in mind that there IS some overhead, and it may be cheaper
to copy really small (<MHLEN) packets. I think IONetwork* has
a copy_or_replace thing for this.
It does. It's called replaceOrCopyPacket(). I haven't been using it
because I didn't want to get the DMA address from a replaced mbuf and
update my RX descriptor (as I mentioned in a prior post).
If it's really as cheap as you say, I'll give that a try and see how it goes.
cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
10:02:06 PM
|
|
From: Chuck Remes
Subject: Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 26, 2004 11:10:52 AM CST
To: Andrew Gallatin
Cc: darwin-drivers@lists.apple.com
On Feb 26, 2004, at 8:24 AM, Andrew Gallatin wrote:
chuck remes writes:
Oh, and one more thing. I only get the copyPacket error when netperf is
doing its small, 64-byte UDP test. When the packets get bigger (1024+),
then I never run out of this resource. It looks like a timing thing. I
process every receive interrupt I get, so it's not like I'm sitting on
my hands waiting to clean up received packets.
I assume you are calling copyPacket() with size == the amount of
actual received data, not the size of the entire buffer?
I've tried it both ways.
newM = copyPacket( m, length ); // do a copy to avoid having to get new Physical Address (lazy programmer!)
AND
newM = copyPacket( m, 0 ); // do a copy to avoid having to get new Physical Address (lazy programmer!)
When 0 gets passed in, copyPacket() checks for it and uses m->m_pkthdr.len as the copy length.
One thing to consider is that there are a limited number of mbufs and
mbuf clusters in the system. If you are blasting small packets at a
receiver with the default 41600 byte UDP receive socket buffer size,
then you can have 41600/SIZE (==650 for 64 byte messages) mbufs
pilling up on the receive socket buffer queue, plus another 50 or so
waiting in the ip intr_queue. So that's 700 mbufs gone.
When you increase the size, you decrease the number of mbufs (and mbuf
clusters) you are using. 41600/1024 == 40. Plus 50 waiting
is ~90 mbufs + clusters.
I didn't know any of this, but it all makes sense. I know this would
just "hide" the problem, but is there a way to use "sysctl" to increase
the number of mbufs and mbuf clusters system-wide?
Does shrinking the receiver socket buffer queue for the small packet
case improve things (netperf -tUDP_STREAM .... -- -m 64 -S2560)
Nope, see below.
What does netstat -m show during your test?
On the receiver end it shows:
cremes% netstat -m
258 mbufs in use:
226 mbufs allocated to data
31 mbufs allocated to socket names and addresses
1 mbufs allocated to Appletalk data blocks
202/608 mbuf clusters in use
1280 Kbytes allocated to network (36% in use)
0 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines
This doesn't look so bad. That's probably because the machine
transmitting (a darwin-x86 box running the same version of the driver)
keeps stalling its output queue. When I boot under freebsd 4.9 and do
the same test, it gets a fatal kernel error whenever I use the "-S2560"
option. It works fine with the larger values like 32768, but it doesn't
like that small value.
BTW, what sort of hardware are you dealing with? Is this a Gig or
10Gig nic? I don't think you should be able to livelock a modern
system like this with a 10 or 10/100 nic, unless there's something
really expensive going on in your driver.
It's a 10/100 NIC (DEC tulip and clones as I responded earlier). I'm
surprised too, which is why I'm convinced I'm doing something wrong.
If nothing else, I am learning a LOT about the internals of the system. Thanks for sharing your expertise.
cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
9:58:22 PM
|
|
From: Andrew Gallatin
Subject: Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 26, 2004 11:00:18 AM CST
To: Chuck Remes
Cc: darwin-drivers@lists.apple.com
chuck remes writes:
This way you at least avoid an allocation/free/mcl_to_paddr() for each
packet, and maybe you avoid updating some descriptor state in your
hardware.
I do this on the receive descriptor list. Getting the physical address
of an mbuf segment appears to be somewhat expensive, so I always call
copyPacket to get a new mbuf to pass up to the stack. So yes, I do
avoid updating the descriptor state for the hardware.
Ugh. mcl_to_paddr() should be very cheap. You should not need to copy
in this case.
On transmit I don't do this. The call listed above to
getPhysicalSegmentsWithCoalesce() returns the physical address of each
buffer segment in the vector variable (of type IOPhysicalSegment). This
call is necessary even if I don't need to do any coalescing because I
always need to get the physical addresses.
How do you call it?
When pmap_extract() was exiled to siberia, I started using an
iombufcursor thing. I create mine when loading the driver like this
(#defines expanded)
my_mbufCursor = IOMbufNaturalMemoryCursor::withSpecification(9014, 5);
Then I get my DMA addresses like this (I hate it when people call them
physical addresses, the are not..):
int
my_encap (struct my_sc *sc,
struct mbuf *m,
struct my_dma_info *dma)
{
struct IOPhysicalSegment segments[5];
int count, i;
count = my_macos_mbufCursor->getPhysicalSegments(m, segments, 5);
for (i = 0; i < count; i++)
{
dma->desc[i].ptr = segments[i].location;
dma->desc[i].len = segments[i].length;
}
dma->nsegs = count;
return count;
}
You should never, ever see this fail on mbufs you allocate via
something like MCLGET() which allocates system mbuf clusters. The
system mbuf clusters are always pre-loaded into the iommu and have the
DMA addresses pre-calculated. Getting their DMA address should be
almost free.
The only time I've ever seen it fail was on my hand-crafted
9000 byte virtually contigous jumbo frames. (I consider this a bug,
btw, read the archives..).
Bear in mind that there IS some overhead, and it may be cheaper
to copy really small (<MHLEN) packets. I think IONetwork* has
a copy_or_replace thing for this.
Drew
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
9:56:29 PM
|
|
From: Chuck Remes
Subject: Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 26, 2004 10:39:39 AM CST
To: Andrew Gallatin
Cc: Chuck Remes, darwin-drivers@lists.apple.com
On Feb 26, 2004, at 7:52 AM, Andrew Gallatin wrote:
chuck remes writes:
The UDP_STREAM test was more interesting. Immediately after firing up
the test, the driver started kicking out errors about stalling the
output queue (kIOReturnOutputStall). The netperf application was firing
off packets as fast as it could, and this overran the transmit
descriptor rings in the driver. The error handling detected this
exhaustion of resources, returned kIOReturnOutputStall to the "upper"
layers of the stack, and waited for the watchdog timer to go off to
call transmitQueue->service( IOBasicOutputQueue::kServiceAsync ). The
service call notifies the stack that everything is peachy again and
transmission can restart.
Is there any way to clear the "stall" condition from your
transmit complete interrupt handler without having to wait for
the watchdog to do it? If so, then I think you probably
want to "stall" the queue, and clear the stall when more
space becomes available.
I did make a minor change which appeared to cure stall conditions I saw when running the TCP_STREAM test. Here's the code:
if ( txActiveCount > TRANSMIT_QUEUE_LENGTH )
{
// first thing to try is free up some resources by cleaning up the TX descriptor list
// NOTE: possible race
condition since outputPacket is called from the client's thread
// instead of the driver's
workloop. Look at using an atomic TEST&SET to guard it.
_handleTxCleanup();
// if after the cleanup
efforts we still don't have any resources, stall the transmission
if ( txActiveCount > TRANSMIT_QUEUE_LENGTH )
{
IOLog( "%s, kIOReturnOutputStall, txActiveCount = %d, txHead = %d,
txTail = %dn", __FUNCTION__, txActiveCount, txHead, txTail );
netStats->outputErrors++;
ret = kIOReturnOutputStall;
transmitterStalled = true;
break;
}
}
The code is pretty self-explanatory. This does NOT help when running the UDP_STREAM test.
I don't have a lot of experience with IOKit ethernet drivers, since I
just use the raw BSD interface. Here's some background on how a
traditional BSD driver deals with this situation:
My transmit routine appends an mbuf chain to my ifp->if_snd
queue (or drops it if the queue is full). If IFF_OACTIVE is
not set, then it transmits the first chain in the queue.
If IFF_OACTIVE is set, then it returns and the chain waits
in the queue.
When the ifp->if_snd queue fills up, logic in ip_output()
short circuits the call down into the network stack and returns
ENOBUFS until the driver manages to clear the backlog.
I assume IOKit internally issues ENOBUFS when it receives
kIOReturnOutputStall or kIOReturnOutputDropped. That's some of the code
I haven't examined yet...
When the hardware runs out of resources, my driver sets the
IFF_OACTIVE bit in its ifp->if_flags. Then when a transmit complete
interrupt arrives, the interrupt handler clears the IFF_OACTIVE bit
and calls the transmit routine immediately.
The HW resources are easy to keep "plentiful." It's the mbufs where I'm having trouble...
Thanks a lot for your answers. They have been very helpful.
cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
9:54:01 PM
|
|
From: Chuck Remes
Subject: Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 26, 2004 10:32:38 AM CST
To: Andrew Gallatin
Cc: darwin-drivers@lists.apple.com, hackers@opendarwin.org
On Feb 26, 2004, at 7:10 AM, Andrew Gallatin wrote:
chuck remes writes:
Mmm, no. Let me explain again. When I am preparing a packet for
transmission, I coalesce whatever the stack handed to me into a single
mbuf. I call getPhysicalblahblah to do this work AND to retrieve the
I don't know what your hardware is like, but if you can use multiple
DMA descriptors per transmit, you should. Coalescing IOKit's way
seems to always involve the allocation of a cluster mbuf, and the copy
of the full packet, even when it may not be required. That's
expensive. And it will happen all the time, because most packets
from a higher protocol will be at least 2 mbufs.
This is for the DEC tulip and its clones. They all do allow multiple
descriptors per transmit. After I sent my note last night, I went back
and changed my descriptor allocation routine to use up to 3 mbufs on
transmit.
In various places, the code looks like:
// allocate the Mbuf cursors, set max packet size, and max number of allowable segments
rxMbufCursor = IOMbufNaturalMemoryCursor::withSpecification( kIOEthernetMaxPacketSize, 1 );
txMbufCursor = IOMbufNaturalMemoryCursor::withSpecification( kIOEthernetMaxPacketSize, 3 );
and
/* first, check to see if enough buffers are left to finish request.
* Coalesce the mbuf segs to make the packet fit into the available resources for
* transmission.
*/
a = ( TULIP_TX_RING_LENGTH - txHead ) + txTail;
b = TULIP_MAX_TX_SEGMENTS;
if ( a > b ) available = b;
if ( b >= a ) available = a;
*segments = txMbufCursor->getPhysicalSegmentsWithCoalesce( m, vector, available );
This resulted in better throughput, but more stalls. I cured the stall
condition by making the transmit completion interrupt occur more often
so it could do its internal housekeeping and cleanup tx/rx descriptors.
If you really have to coalesce and you have only a "few" transmit
descriptors, it might actually be cheaper to keep a pre-allocated,
pre-pinned buffer for each descriptor that you copy each mbuf chain into.
This way you at least avoid an allocation/free/mcl_to_paddr() for each
packet, and maybe you avoid updating some descriptor state in your
hardware.
I do this on the receive descriptor list. Getting the physical address
of an mbuf segment appears to be somewhat expensive, so I always call
copyPacket to get a new mbuf to pass up to the stack. So yes, I do
avoid updating the descriptor state for the hardware.
On transmit I don't do this. The call listed above to
getPhysicalSegmentsWithCoalesce() returns the physical address of each
buffer segment in the vector variable (of type IOPhysicalSegment). This
call is necessary even if I don't need to do any coalescing because I
always need to get the physical addresses.
I looked at calling mcl_to_paddr() directly, but looks like it might be
unsafe. It sometimes returns 0 in which case a further call to
pmap_extract() is necessary. There was a thread here over a year ago
that said calling pmap_extract directly was a no-no since the interface
may change. I think it did when the G5s shipped and all this 64-bit
stuff got in the way.
If you know a computationally cheap and safe way to get the physical
addresses that you know won't break, let me know! I'll avoid the IOKit
API for that operation.
BTW, I'm glad you like netperf ;)
Are you kidding me? I *hate* it! It dashed my belief that the driver was complete and perfect. :-)
cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
9:52:20 PM
|
|
From: Andrew Gallatin
Subject: Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 26, 2004 8:24:22 AM CST
To: Chuck Remes
Cc: darwin-drivers@lists.apple.com
chuck remes writes:
Oh, and one more thing. I only get the copyPacket error when netperf is
doing its small, 64-byte UDP test. When the packets get bigger (1024+),
then I never run out of this resource. It looks like a timing thing. I
process every receive interrupt I get, so it's not like I'm sitting on
my hands waiting to clean up received packets.
I assume you are calling copyPacket() with size == the amount of
actual received data, not the size of the entire buffer?
One thing to consider is that there are a limited number of mbufs and
mbuf clusters in the system. If you are blasting small packets at a
receiver with the default 41600 byte UDP receive socket buffer size,
then you can have 41600/SIZE (==650 for 64 byte messages) mbufs
pilling up on the receive socket buffer queue, plus another 50 or so
waiting in the ip intr_queue. So that's 700 mbufs gone.
When you increase the size, you decrease the number of mbufs (and mbuf
clusters) you are using. 41600/1024 == 40. Plus 50 waiting
is ~90 mbufs + clusters.
Does shrinking the receiver socket buffer queue for the small packet
case improve things (netperf -tUDP_STREAM .... -- -m 64 -S2560)
What does netstat -m show during your test?
BTW, what sort of hardware are you dealing with? Is this a Gig or
10Gig nic? I don't think you should be able to livelock a modern
system like this with a 10 or 10/100 nic, unless there's something
really expensive going on in your driver.
Drew
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
9:50:49 PM
|
|
From: Andrew Gallatin
Subject: Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 26, 2004 7:52:29 AM CST
To: Chuck Remes
Cc: darwin-drivers@lists.apple.com
chuck remes writes:
The UDP_STREAM test was more interesting. Immediately after firing up
the test, the driver started kicking out errors about stalling the
output queue (kIOReturnOutputStall). The netperf application was firing
off packets as fast as it could, and this overran the transmit
descriptor rings in the driver. The error handling detected this
exhaustion of resources, returned kIOReturnOutputStall to the "upper"
layers of the stack, and waited for the watchdog timer to go off to
call transmitQueue->service( IOBasicOutputQueue::kServiceAsync ). The
service call notifies the stack that everything is peachy again and
transmission can restart.
Is there any way to clear the "stall" condition from your
transmit complete interrupt handler without having to wait for
the watchdog to do it? If so, then I think you probably
want to "stall" the queue, and clear the stall when more
space becomes available.
I don't have a lot of experience with IOKit ethernet drivers, since I
just use the raw BSD interface. Here's some background on how a
traditional BSD driver deals with this situation:
My transmit routine appends an mbuf chain to my ifp->if_snd
queue (or drops it if the queue is full). If IFF_OACTIVE is
not set, then it transmits the first chain in the queue.
If IFF_OACTIVE is set, then it returns and the chain waits
in the queue.
When the ifp->if_snd queue fills up, logic in ip_output()
short circuits the call down into the network stack and returns
ENOBUFS until the driver manages to clear the backlog.
When the hardware runs out of resources, my driver sets the
IFF_OACTIVE bit in its ifp->if_flags. Then when a transmit complete
interrupt arrives, the interrupt handler clears the IFF_OACTIVE bit
and calls the transmit routine immediately.
Drew
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
9:49:03 PM
|
|
From: Andrew Gallatin
Subject: Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 26, 2004 7:10:03 AM CST
To: Chuck Remes
Cc: darwin-drivers@lists.apple.com, hackers@opendarwin.org
chuck remes writes:
Mmm, no. Let me explain again. When I am preparing a packet for
transmission, I coalesce whatever the stack handed to me into a single
mbuf. I call getPhysicalblahblah to do this work AND to retrieve the
I don't know what your hardware is like, but if you can use multiple
DMA descriptors per transmit, you should. Coalescing IOKit's way
seems to always involve the allocation of a cluster mbuf, and the copy
of the full packet, even when it may not be required. That's
expensive. And it will happen all the time, because most packets
from a higher protocol will be at least 2 mbufs.
If you really have to coalesce and you have only a "few" transmit
descriptors, it might actually be cheaper to keep a pre-allocated,
pre-pinned buffer for each descriptor that you copy each mbuf chain into.
This way you at least avoid an allocation/free/mcl_to_paddr() for each
packet, and maybe you avoid updating some descriptor state in your
hardware.
BTW, I'm glad you like netperf ;)
Drew
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
9:48:14 PM
|
|
From: Steve Modica
Subject: Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 26, 2004 6:32:53 AM CST
To: darwin-drivers@lists.apple.com
darwin-drivers-request@lists.apple.com wrote:
Oh,
and one more thing. I only get the copyPacket error when netperf is
doing its small, 64-byte UDP test. When the packets get bigger (1024+),
then I never run out of this resource. It looks like a timing thing. I
process every receive interrupt I get, so it's not like I'm sitting on
my hands waiting to clean up received packets.
I know there aren't too many clues here, so if you want to see some code let me know.
cr
I'm not sure if it's relevant but the IP input queue in os x is only 50
packets by default. When receiving so many tiny packets (and assuming
there's some kind of coalescing going on) are you simply overrunning
it? If you look in sysctl, do you see this value incrementing:
net.inet.ip.intr_queue_drops: 0
--
Steve Modica
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
9:47:19 PM
|
|
From: Chuck Remes
Subject: Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 25, 2004 10:24:13 PM CST
To: Justin Walker
Cc: darwin-drivers@lists.apple.com, hackers@opendarwin.org
On Feb 25, 2004, at 10:11 PM, chuck remes wrote:
[SNIP]
I now have to investigate why sending
and receiving are failing under extreme stress, so if anyone knows why
getPhysicalSegmentsWithCoalesce or copyPacket would fail, I'd love to
hear it.
Both of these are in the IONetworkingFamily code (which you say you've looked at):
copyPacket fails when there are no mbufs available
or when the source mbuf is malformed;
the Newton-John call (getPhysical :-}) fails for
a variety of strange and wondrous reasons, mostly
having to do with malformed mbufs and resource runout.
Oh yes, I've looked at that code. I sometimes dream about it. :-)
I guess the system is running out of
mbufs. When I switch to using the built-in ethernet (GMAC driver), it
doesn't print any of the errors I see listed in the source. I'll have
to double-check, but I don't recall the error counters incrementing
from dropped packets either (though I may be wrong on that score).
However, I'm left with the impression that it is handling the stress
better than my driver and that I cannot abide. I'll go nuts tuning this
thing and it's probably due to lame-o hardware. :-)
Oh, and one more thing. I only get the copyPacket error when netperf is
doing its small, 64-byte UDP test. When the packets get bigger (1024+),
then I never run out of this resource. It looks like a timing thing. I
process every receive interrupt I get, so it's not like I'm sitting on
my hands waiting to clean up received packets.
I know there aren't too many clues here, so if you want to see some code let me know.
cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
9:46:27 PM
|
|
From: Chuck Remes
Subject: Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 25, 2004 10:11:44 PM CST
To: Justin Walker
Cc: darwin-drivers@lists.apple.com, hackers@opendarwin.org
On Feb 25, 2004, at 8:51 PM, Justin Walker wrote:
Thanks for following up; it's always good to complete a discussion for the sake of the archives ;=}
On Wednesday, February 25, 2004, at 06:09 PM, chuck remes wrote:
The UDP_STREAM test was more interesting. Immediately after firing up
the test, the driver started kicking out errors about stalling the
output queue (kIOReturnOutputStall). The netperf application was firing
off packets as fast as it could, and this overran the transmit
descriptor rings in the driver. The error handling detected this
exhaustion of resources, returned kIOReturnOutputStall to the "upper"
layers of the stack, and waited for the watchdog timer to go off to
call transmitQueue->service( IOBasicOutputQueue::kServiceAsync ).
The service call notifies the stack that everything is peachy again and
transmission can restart.
When
you say "the stack", are you referring to the networking stack, or code
below that stack but above this part of your driver? AFAIK, the
networking stack doesn't do timeouts or other recovery, nor does it
keep state about recent attempts to transmit. Here, I mean UDP,
of course; TCP is a different kettle of code.
I'm referring to anything above the driver.
So I modified the code to return false
from outputPacket if there weren't any available resources to transmit
another packet. This put the onus for error recovery back on the stack
(or application). Rerunning the test resulted in a maximum usage of
resources at all times within the driver and then some. Honestly, it
exposed some problems with performance when servicing interrupts, so
I'm glad I ran it. It caused my call to
IOMbufCursor::getPhysicalSegmentsWithCoalesce to return zero a bunch of
times.
To
make sure I understand: does this mean that you are just passing an
error condition upstream and not attempting to do any recovery on your
own (which would not be A Good Thing)?
Mmm, no. Let me explain again. When I am preparing a packet for
transmission, I coalesce whatever the stack handed to me into a single
mbuf. I call getPhysicalblahblah to do this work AND to retrieve the
physical address of the buffer. That method (and its non-coalescing
brethren) are the only "safe" ways to get physical addresses unless you
want to use the sekret function calls buried within the bowels of the
IONetworkingFamily. I want this code to work on new releases of
darwin/OSX, so I don't mess around and just use the nice IOKit api call.
Under extreme stress (netperf is trying to send a bajillion 64 byte UDP
packets as fast as it can), the getPhysical* call will return 0. There
is no way to recover from this since the resource is exhausted except
for waiting/sleeping and trying again. Instead of retrying, I return
"false" from my outputPacket method. Is this not A Good Thing?
I now have to investigate why sending
and receiving are failing under extreme stress, so if anyone knows why
getPhysicalSegmentsWithCoalesce or copyPacket would fail, I'd love to
hear it.
Both of these are in the IONetworkingFamily code (which you say you've looked at):
copyPacket fails when there are no mbufs available
or when the source mbuf is malformed;
the Newton-John call (getPhysical :-}) fails for
a variety of strange and wondrous reasons, mostly
having to do with malformed mbufs and resource runout.
Oh yes, I've looked at that code. I sometimes dream about it. :-)
I guess the system is running out of mbufs. When I switch to using the
built-in ethernet (GMAC driver), it doesn't print any of the errors I
see listed in the source. I'll have to double-check, but I don't recall
the error counters incrementing from dropped packets either (though I
may be wrong on that score). However, I'm left with the impression that
it is handling the stress better than my driver and that I cannot
abide. I'll go nuts tuning this thing and it's probably due to lame-o
hardware. :-)
cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
9:44:41 PM
|
|
From: Justin Walker
Subject: Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 25, 2004 8:51:35 PM CST
To: Chuck Remes
Cc: darwin-drivers@lists.apple.com, hackers@opendarwin.org
Thanks for following up; it's always good to complete a discussion for the sake of the archives ;=}
On Wednesday, February 25, 2004, at 06:09 PM, chuck remes wrote:
The
UDP_STREAM test was more interesting. Immediately after firing up the
test, the driver started kicking out errors about stalling the output
queue (kIOReturnOutputStall). The netperf application was firing off
packets as fast as it could, and this overran the transmit descriptor
rings in the driver. The error handling detected this exhaustion of
resources, returned kIOReturnOutputStall to the "upper" layers of the
stack, and waited for the watchdog timer to go off to call
transmitQueue->service( IOBasicOutputQueue::kServiceAsync ). The
service call notifies the stack that everything is peachy again and
transmission can restart.
When you say "the stack", are you referring to the networking stack, or
code below that stack but above this part of your driver? AFAIK,
the networking stack doesn't do timeouts or other recovery, nor does it
keep state about recent attempts to transmit. Here, I mean UDP,
of course; TCP is a different kettle of code.
So I modified the
code to return false from outputPacket if there weren't any available
resources to transmit another packet. This put the onus for error
recovery back on the stack (or application). Rerunning the test
resulted in a maximum usage of resources at all times within the driver
and then some. Honestly, it exposed some problems with performance when
servicing interrupts, so I'm glad I ran it. It caused my call to
IOMbufCursor::getPhysicalSegmentsWithCoalesce to return zero a bunch of
times.
To make sure I understand: does this mean that you are just passing an
error condition upstream and not attempting to do any recovery on your
own (which would not be A Good Thing)?
I now
have to investigate why sending and receiving are failing under extreme
stress, so if anyone knows why getPhysicalSegmentsWithCoalesce or
copyPacket would fail, I'd love to hear it.
Both of these are in the IONetworkingFamily code (which you say you've looked at):
copyPacket fails when there are no mbufs available
or when the source mbuf is malformed;
the Newton-John call (getPhysical :-}) fails for
a variety of strange and wondrous reasons, mostly
having to do with malformed mbufs and resource runout.
Cheers,
Justin
--
Justin C. Walker, Curmudgeon-At-Large *
Institute for General Semantics | When LuteFisk is outlawed
| Only outlaws will have
| LuteFisk
*--------------------------------------*-------------------------------*
9:41:26 PM
|
|
From: Chuck Remes
Subject: Re: ethernet driver: kIOReturnOutputStall responsibility
Date: February 25, 2004 8:09:41 PM CST
To: darwin-drivers@lists.apple.com
Cc: hackers@opendarwin.org
I'm dredging up an email from a LONG time ago cuz I promised I'd let the list know what I found.
There was a thread on here a few weeks back about stress testing a NIC
driver. One of the responses suggested using the 'netperf' utility
(available at http://www.netperf.org/netperf/NetperfPage.html). I
downloaded that utility and started doing some testing.
The TCP_STREAM test was pretty ordinary. From a darwin-ppc box (G5 dual
2.0) to a darwin-x86 box (Athlon 1800), each test resulted in about 94
Mbps throughput.
The UDP_STREAM test was more interesting. Immediately after firing up
the test, the driver started kicking out errors about stalling the
output queue (kIOReturnOutputStall). The netperf application was firing
off packets as fast as it could, and this overran the transmit
descriptor rings in the driver. The error handling detected this
exhaustion of resources, returned kIOReturnOutputStall to the "upper"
layers of the stack, and waited for the watchdog timer to go off to
call transmitQueue->service( IOBasicOutputQueue::kServiceAsync ).
The service call notifies the stack that everything is peachy again and
transmission can restart.
This meant there could be up to a full second of delay while the output
queue was stalled. The test returned a dismal score in the neighborhood
of 3 Mb. Not good...
So I modified the code to return false from outputPacket if there
weren't any available resources to transmit another packet. This put
the onus for error recovery back on the stack (or application).
Rerunning the test resulted in a maximum usage of resources at all
times within the driver and then some. Honestly, it exposed some
problems with performance when servicing interrupts, so I'm glad I ran
it. It caused my call to IOMbufCursor::getPhysicalSegmentsWithCoalesce
to return zero a bunch of times.
Swapping the effort around so the PC was sending to the Mac, it exposed
a problem in my receive handling. The first UDP_STREAM test uses a
packet size of 64 bytes which causes copyPacket to fail a LOT.
I now have to investigate why sending and receiving are failing under
extreme stress, so if anyone knows why getPhysicalSegmentsWithCoalesce
or copyPacket would fail, I'd love to hear it.
cr
On May 4, 2002, at 3:49 PM, chuck remes wrote:
On Saturday, May 4, 2002, at 03:31 PM, Justin C. Walker wrote:
This is more opinion than fact, but it is based on a lot of experience in this area.
I don't think that trying to exert "back pressure" from an ethernet (or
similar hardware type) driver is a good idea. The networking
layers are conditioned to expect failures, and the protocols are
designed to work reasonably well in the face of resource problems in
the network.
Trying to exert back pressure is (for the current software base) not
useful, and in fact, it may exacerbate a resource exhaustion condition
if intervening layers try to hold on to packets to be attempted later.
I would personally rather have you drop the packet (and bump the
appropriate counters in the 'ifnet' structure), and let Higher
Authority deal with it as a normal "congestion" problem. There
may be an error condition you can propagate back upstream, but this one
doesn't sound like the right one.
On Saturday, May 4, 2002, at 10:28 AM, chuck remes wrote:
In an ethernet driver, you are supposed to return kIOReturnOutputStall
when there aren't any available resources to send a packet from your
outputPacket() method.
Whose responsibility is it to restart the output queue? If I look at the doc in IOOutputQueue.h, the headerdoc specifies:
<snip>
Justin,
thanks for the response. You raise a very good point. I did
a lot more digging in the IONetworkingFamily code and discovered that
the only way to clear a stall condition (which is held by your
IOOutputQueue, BTW) is to call its start() or service() methods.
I looked at *all* of the drivers in the darwin cvs repository and all
of them just return kIOReturnOutputStall without setting a timer or
anything to make sure that condition is eventually cleared. This
is probably a bug, so I'll probably file something on it after I've
completed my research.
In the meantime, I think I agree with you. Provided the driver
has allocated "reasonable" resources for packet transmission, if these
resources are overrun the driver should probably just drop the packet
and internally start freeing up some structures.
I'll code up a couple of different things and see how it all behaves. I'll let the list know what I find.
cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
9:40:06 PM
|
|
From: Justin Walker
Subject: Re: ethernet driver: kIOReturnOutputStall responsibility
Date: May 4, 2002 5:38:55 PM CDT
To: darwin-drivers@lists.apple.com
On Saturday, May 4, 2002, at 01:49 PM, chuck remes wrote:
On Saturday, May 4, 2002, at 03:31 PM, Justin C. Walker wrote:
This is more opinion than fact, but it is based on a lot of experience in this area.
[snip]
Justin,
thanks for the response. You
raise a very good point. I did a lot more digging in the
IONetworkingFamily code and discovered that the only way to clear a
stall condition (which is held by your IOOutputQueue, BTW) is to call
its start() or service() methods. I looked at *all* of the
drivers in the darwin cvs repository and all of them just return
kIOReturnOutputStall without setting a timer or anything to make sure
that condition is eventually cleared. This is probably a bug, so
I'll probably file something on it after I've completed my research.
Yuck. I agree, pending clarification from someone who may
understand this better than I. Odd that this hasn't shown up in
any obvious way yet.
A couple of notes: there was a 'start' field in the "ifnet" structure
at one point; we removed it because its use seemed to create blocking
points where none need be, and because of potential races. It
seemed better to let the driver itself handle all of this, rather than
have the users of the driver try to figure out what to do.
Also, while the "stall" response doesn't make sense to me for
ethernet-like devices, it may make sense for circuit-oriented devices
like ATM. In this case, back-pressure does make some sense, and
dropping frames does not.
In
the meantime, I think I agree with you. Provided the driver has
allocated "reasonable" resources for packet transmission, if these
resources are overrun the driver should probably just drop the packet
and internally start freeing up some structures.
I'm not sure it's a driver-specific issue. The driver can do
everything "right", but if the system mbuf pool runs dry, there's
nothing it can do. Trying to plan for this type of eventuality
just introduces needless complexity (IMHO). But dropping the
packet without doing much else seems right to me.
I'll code up a couple of different things and see how it all behaves. I'll let the list know what I find.
Thanks.
Regards,
Justin
--
Justin C. Walker, Curmudgeon-At-Large *
Institute for General Semantics | If you're not confused,
| You're not paying attention
*--------------------------------------*-------------------------------*
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
9:38:51 PM
|
|
From: Chuck Remes
Subject: Re: ethernet driver: kIOReturnOutputStall responsibility
Date: May 4, 2002 3:49:54 PM CDT
To: darwin-drivers@lists.apple.com
On Saturday, May 4, 2002, at 03:31 PM, Justin C. Walker wrote:
This is more opinion than fact, but it is based on a lot of experience in this area.
I don't think that trying to exert
"back pressure" from an ethernet (or similar hardware type) driver is a
good idea. The networking layers are conditioned to expect
failures, and the protocols are designed to work reasonably well in the
face of resource problems in the network.
Trying to exert back pressure is (for
the current software base) not useful, and in fact, it may exacerbate a
resource exhaustion condition if intervening layers try to hold on to
packets to be attempted later.
I would personally rather have you
drop the packet (and bump the appropriate counters in the 'ifnet'
structure), and let Higher Authority deal with it as a normal
"congestion" problem. There may be an error condition you can
propagate back upstream, but this one doesn't sound like the right one.
On Saturday, May 4, 2002, at 10:28 AM, chuck remes wrote:
In an ethernet driver, you are supposed to return kIOReturnOutputStall
when there aren't any available resources to send a packet from your
outputPacket() method.
Whose responsibility is it to restart the output queue? If I look at the doc in IOOutputQueue.h, the headerdoc specifies:
<snip>
Justin,
thanks for the response. You raise a very good point. I did
a lot more digging in the IONetworkingFamily code and discovered that
the only way to clear a stall condition (which is held by your
IOOutputQueue, BTW) is to call its start() or service() methods.
I looked at *all* of the drivers in the darwin cvs repository and all
of them just return kIOReturnOutputStall without setting a timer or
anything to make sure that condition is eventually cleared. This
is probably a bug, so I'll probably file something on it after I've
completed my research.
In the meantime, I think I agree with you. Provided the driver
has allocated "reasonable" resources for packet transmission, if these
resources are overrun the driver should probably just drop the packet
and internally start freeing up some structures.
I'll code up a couple of different things and see how it all behaves. I'll let the list know what I find.
cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
9:35:52 PM
|
|
From: Justin Walker
Subject: Re: ethernet driver: kIOReturnOutputStall responsibility
Date: May 4, 2002 3:31:46 PM CDT
To: darwin-drivers@lists.apple.com
This is more opinion than fact, but it is based on a lot of experience in this area.
I don't think that trying to exert "back pressure" from an ethernet (or
similar hardware type) driver is a good idea. The networking
layers are conditioned to expect failures, and the protocols are
designed to work reasonably well in the face of resource problems in
the network.
Trying to exert back pressure is (for the current software base) not
useful, and in fact, it may exacerbate a resource exhaustion condition
if intervening layers try to hold on to packets to be attempted later.
I would personally rather have you drop the packet (and bump the
appropriate counters in the 'ifnet' structure), and let Higher
Authority deal with it as a normal "congestion" problem. There
may be an error condition you can propagate back upstream, but this one
doesn't sound like the right one.
Regards,
Justin
On Saturday, May 4, 2002, at 10:28 AM, chuck remes wrote:
In an
ethernet driver, you are supposed to return kIOReturnOutputStall when
there aren't any available resources to send a packet from your
outputPacket() method.
Whose responsibility is it to restart the output queue? If I look at the doc in IOOutputQueue.h, the headerdoc specifies:
@constant kIOReturnOutputStall Stall the queue and retry the same packet
when the queue is restarted. */
I searched through the rest of the
code in the IONetworkingFamily but I didn't see anything that ever
checked a return code for kIOReturnOutputStall.
When I hit this condition in my
driver, the interface is effectively dead until I do an "ifconfig
up/down" sequence which resets everything.
cr
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
--
Justin C. Walker, Curmudgeon-At-Large *
Institute for General Semantics | Men are from Earth.
| Women are from Earth.
| Deal with it.
*--------------------------------------*-------------------------------*
_______________________________________________
darwin-drivers mailing list | darwin-drivers@lists.apple.com
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-drivers
Do not post admin requests to the list. They will be ignored.
9:34:16 PM
|
|
The Apple mailing lists aren't open to 'bots to crawl them, so googling
for information on those lists is impossible. In the next few posts,
I'll reproduce the thread as it occurred on the darwin-drivers list the
last few days. The information there is pretty useful.
The first post was actually made back in 2002. I ran across it while
searching through my personal mail archives and decided to close the
loop on something said in it.
9:31:31 PM
|
|
Got a lot of help on the <a
href="http://lists.apple.com/mailman/listinfo">darwin-drivers</a>
mailing list the last few days. I was home from work sick with the flu
but started going stir-crazy about 3 hours into my first day. I started
doing some performance testing with a tool call <a
href="http://www.netperf.org/netperf/NetperfPage.html">netperf</a>
which had been suggested on the driver's list as a good stress-test
tool.
I began testing the driver using a G5 (darwin 7.2.0) as the sender and
an Athlon 1400 (?) (darwin 7.01) as the receiver. The command I used
was:
netperf -H 192.168.2.5 -t TCP_STREAM -- -m 1024
The driver performed very nicely. Right away I started getting back 94Mb throughput from both sides. Excellent, I thought.
Next, I tried the UDP test. It's a real ball-buster because UDP, unlike
TCP, has no flow control. This test streams packets just as fast as
your driver and hardware can push them out the door.
The G5 started stalling on transmission almost immediately, and the x86
box just sat there twiddling its thumbs. Switching things up so the x86
box was the sender and the G5 the receiver resulted in the same
situation (x86 stalling, G5 waiting).
Obviously a driver problem, but where to start? Since transmit was stalling so quickly, it made sense to tackle that first.
I noticed that after the stall, it would be a "long" time before
packets started being sent again. I thought (erroneously, as it turned
out) that the box was busy cleaning up after itself and the interrupt
delivery was slow. So I added a descriptor cleanup method call directly
into my outputPacket() method. This didn't do much. I stumped myself
right out of the gate, so I did what any programmer would do in the
same situation... I forgot to record my changes and started touching
code all over the place.
One of the things I did was reboot the x86 box into freebsd 4.9 and ran
the test from there. Running the UDP_STREAM test from the freebsd box
towards darwin caused the IONetworkController::copyPacket() method to
fail a lot. So at this point
I got distracted by receive performance and started working on it. I
posted some notes to darwin-drivers and waited. Within 15 minutes or so
I got a response; a very detailed response.
Fast forward here... lots of emails went back and forth that day and
the next. I got a lot of good hints and information from another
programmer who had "been there, done that." I'll post the entire
email thread in subsequent posts, but for now I'll just post what I
learned.
1. Do not call a method from outputPacket() that is also being called from your workloop context.
outputPacket() runs on the client's thread, not your driver thread. It
can, and does, preempt any work being done by your workloop, so there
is a possibility of a race condition. As the system gets more stressed,
this possibility becomes a certainty. I panic'ed the machine a few
times trying to release an already free mbuf (calling
releaseFreePackets()).
2. After completing your housekeeping tasks from your interrupt method,
calling IOBasicOutputQueue::service() and pass in the async option.
The service() call informs the stack that your hardware is now ready to
begin processing packets again. The weird delay I'd seen when stalling
before packets would transmit again was caused by only calling
service() from the timer routine which runs once per second. I did
call service() at the end of my interrupt routine, but I had commented
it out when chasing down the RX problems and never made a note to undo
it.
3. If any variables are shared between the outputPacket() method and
your interrupt or timer methods, make sure operations on them are
atomic.
This is due to them running in different contexts as described in #1. I
use temporary variables to track activity in each context and then use
OSAddAtomic() or a similar routine to modify the variable in one atomic
action. This fixed a couple of bizzarro problems that I had seen in
earlier testing that refused to be reliably duplicated.
4. Shark is your friend.
The great engineers at Apple have provided a wonderful set of tools
they call <a
href="http://developer.apple.com/tools/performance/">CHUD
Tools</a> which is an acronym for Cannibalistic Humanoid
Underground Dweller Tools. In some circles it means Computer Hardware
Understanding Development Tools, but we don't associate with those
people. Anyway, I used Shark to sample the kernel activity when running
the netperf tests. It gave a lot of good hints about where, deep in the
bowels of the system, the code was choking. Shark, and a side comment
made in the mailing list thread, led me to my next discovery.
5. Always use replaceOrCopyPacket instead of copyPacket() or
replacePacket() unless you really really really know how to tune
performance better than years of measurement and diagnosis from the BSD
programmer community.
I was shy about giving up the RX mbufs I allocated during driver
startup. It was a pain in the ass, and in my mind, an expensive
operation to get the DMA address of each mbuf instead of just reusing
the same packets over and over via copyPacket(). I was copying packets
ranging from the minimum size all the way up to max (1500 for
ethernet). It turns out bcopy() is even more expensive than figuring
out the DMA address of a new mbuf and storing it.
6. In reference to #5, IOMbufCursor::getPhysicalSegments() is not that scary of a routine. Just use it.
I probably had a few more insights, but they're lost to memory now.
9:30:15 PM
|
|
© Copyright 2004 Chuck Remes.
|
|
|