Unfortunately, I've not gotten around to much more work on this webservice-just-stops ("WSJS") isssue. Was planning to spend most of this weekend on it but got caught up in something else. Still actively looking into it though, just not this weekend. But the code is plump with logging printfs and various hooks for getting glimpses into what's going on, and decent though slow progress is being made.
Anyway, in the mean time, I had a question for you guys which may help to guide further investigations. It requires a little background though, so bear with me here.
Just to review what I mentioned earlier: On my setup, it seems like WSJS is always associated with the concurrence of two events (in this order):
- Loss of link on the WiFi
- A near simultaneous cascade of USB errors on (at least one of) the phidgets. (In my setup, that would be either the SBC1 on-board IK 8/8/8 or the external IK 0/0/4 of the 1014 relay set. Not sure which (and may perhaps be both).
I've established pretty clearly that the webservice is
capable of gracefully recovering from WiFi link drops cleanly, as long as it doesn't get into this secondary horsehockey with the USB, and am able to induce such graceful drop/recoveries easily, using WiFi interference or just flood pinging the SBC. The WSJS events occur only (and exclusively, so far in my experiments) when the WiFi link drop is "accompanied" somehow by these USB events.
So the two obvious hypotheses in play -- and neither of these is necessarily correct, could be something entirely different -- are as follows:
- Supply sag: Bringing the WiFi link back up after a link loss may include re-initializing the driver, which in turn causes the WiFi transciever to suck heavier current for a few hundred ms during its renegotiation with the access point. (In a previous life, I did design work on various 802.x devices, and they do go thru a startup phase which generally draws quite a bit more current than during steady-state.)
So maybe this hypothetical current spike is pulling down the supply just enough to get the USB devices right on the hairy edge of misbehaving. So, during recovery from WiFi drops sometimes they fail, sometimes not.
Bolstering this hypothesis is that the little wall-wart power supply for the SBC1 is pretty wussy, though I have not scoped it yet (which I should do.) I may also try comparing fail rates when the SBC is powered from the wall-wart vs. running it off a bench supply.
- Inter-thread interference/confusion: During the WiFi up/down/up cycle, the various read/write threads come up and down indirectly in response to the WiFi link state as they reconnect with the control app. Perhaps they sometimes become "confused" about the desired state of the USB bus or the individual devices, with maybe one thread trying to bring a device down while another is trying to bring it up. Or some sort of confusion like this, where the threads are acting at cross purposes due to [astonishingly!] poor software design, including mutex use/abuse, etc.
So that's the background.
Here's my question: It would be interesting to know if you guys have heard of this WSJS problem occurring when the WS (and its associated local phidgets) are running not on an SBC but on a laptop or a desktop machine, which is then connected to the controlling ap (on some other machine) via WiFi?
The reason for asking is simply that a laptop/desktop would have a heftier power supply than the SBC, hence more margin against current spikes. So for example, if this problem has never been observed on a laptop, that lends more credence to hypothesis #1. It would be nicer though if this problem has been seen on a laptop, because then that strongly suggests #2.
On the other hand, a confounding factor is that the SW execution environment on a laptop/desktop is also obviously somewhat different than on the SBC, and so hypothesis #2 would be affected by this too (and certainly not clear whether the laptop environment would be more or less prone to the thread confusion). Still, if WSJS has been seen on a WiFi-connected laptop, then I would probably concentrate on #2.
So any thoughts you guys (or anyone) has on this, please post them here.