Sounds like progress being made!
Responses to your questions:
> Does your WSJS event always result in the cascade of usb errors and the
> webservice exiting?
It's just a matter of definition: For the sake of anal categorization, I've defined the "WSJS" fault mode as always being accompanied by a cascade of USB errors, and resulting in either 0 or 1 threads left standing. But there are many variations on this basic theme that I don't classify as WSJS (just in order to keep them straight and categorize them nicely) For example some similar fault modes which have been seen:
- A cascade of USB errors, but it doesn't result in 0 or 1 process remaining, but "a few". But WS is still effectively dead (i.e. client cannot connect to it).
- Cascade of USB errors but WS eventually (several seconds later) winds up cleaning itself up and client can again connect.
> Does the webservice crash with a fault code?
If by "fault code" you mean the exit code returned to the shell environment, then the answer is that it has always been zero on the occasions I've checked it (which I don't do very often).
> Does it even stay running after the usb errors, but in a bad un-connectable state
> in the case of a WSJS?
See first respnse above. There are many variations on the basic theme of "WS gets wedged". I have classified only the main ones so far. WSJS, TGW, SBO, and a few others which I didn't put into that "events.html" description.
> Are you able to get WSJS errors if you use ethernet on the SBC - or is it necessary
> to use WiFi? What about using ethernet vs. WiFi on the client-side as the interface
> that is brought up/down?
I tried copper only once, briefly. (It's somewhat painful for me to do, because I have to take the SBC out of service and remove it from its normal mounting location and put it on the bench, etc etc.). So I was highly motivated not to play with it for very long.
Nevertheless, during that brief trial, I did not see any WSJS's when simulating link outages by plugging/unplugging the RJ-45. So it's inconclusive, really.
But... there is a very important difference between the way that link up/down/up events appear between copper and wireless: With copper, if you bring the interface down (either via SW "ifconfig" or by unplugging the RJ-45), both sides -- and importantly, the server side --knows immediately that the link is down because carrier ceases, and the driver does "something" to cleanly handle this as soon as it occurs. I'm not sure exactly what "something" is, but whatever it is must make its way to the TCP logic in a graceful way, since it's a more or less standard type of event. Otoh, with a WiFi up/down/up initiated via ifconfig at the client side (as done in the diagnostic suite with "intf_cycle") the server side has no clue that the link just came down. It just stops getting data, and then starts again once the link is restored. Another important difference is that link bobbling (on the down->up transition) is extremely unlikely over copper, yet is probably fairly common with WiFi. So, given the above contextual differences, my thinking was that the way that the server-side TCP handles copper vs. WiFi drops is going to be different enough that I would not want to draw any conclusions based on comparisons. So it didn't seem worthwhile to continue probing with copper.
My working hypothesis here, which seems to be supported by the evidence so far, is that the main source of trouble is that the heartbeat logic, upon detection of link loss and subsequent restoral, immediately fires off a close() to the server, and then shortly thereafter attempts to connect() again. Thus, upon restoral of link connectivity, the client's TCP winds up barfing two sets of conflicting stuff back at the server: The first set of stuff comprises all the un-ACKed segments in its transmit window from the original connection, followed by the FIN for the close, BUT also -- and asynchronously with that leftover Tx window stuff -- the SYN (possibly more than one!) for the new connection. Because the window-clearout and the SYNs are asynchronous, the time ordering at the server side of receipt of these events is totally arbitrary. And in some cases, the threads that are being started/stopped by these two different sets of stuff can get into conflict when exercising the USB, and cause "some sort of" problem that in turn sometimes leads to the USB error cascade. (And sometimes to other fault modes, as you've seen with TGW.)
I'm trying to work up a nice relatively repeatable smoking gun example. And have some ideas as to how to go about working around it, if it turns out to be the case.
Let me know if you have any other questions, I'll respond right away. Really glad that you're working on it.