SBC webservice stops

General PhidgetSBC Discussion.
AndreGermain
Phidgetly
Posts: 41
Joined: Fri Sep 30, 2011 7:09 am
Contact:

Re: SBC webservice stops

Postby AndreGermain » Wed Jul 17, 2013 7:40 pm

Patrick,

I swapped the SBC1 with the SBC2 [thus changing roles] so that the SBC2 is on the WiFi and where I have the PhidgetWebService issue. I did the steps you provided to add "supervise" PhidgetWebService starting [darn nice online file editor!] and it booted fine.

I'll keep you posted as to how well it works out.

Cheers & thank you

AndreGermain
Phidgetly
Posts: 41
Joined: Fri Sep 30, 2011 7:09 am
Contact:

Re: SBC webservice stops

Postby AndreGermain » Mon Jul 22, 2013 8:17 pm

Patrick,

the supervise seems to correct the web service stopping, but it doesn't resolve the second problem that has the same final outcome when the Phidgets detach inexplicably and remain so until a reboot (over Wifi from the SBC2). I don't have more details about this remaining problem. I do note that the average in/out rate is 40 kb/s (e.g: low) and continues so despite the detachment.

Interestingly, this problem used to be visible by the fact that one could not stop the service, but now I can't tell because of course supervise re-starts it.

Cheers

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Tue Jul 23, 2013 7:55 am

Patrick/Andre,

Last weekend I built a mildly instrumented custom version of both webservice and libphidget for my SBC1 and started trying to gather some failure mode data on this "webservice just stops" problem.

Since my setup fails pretty consistently (few times/day) have been able to see several events so far. I'll post more detail next weekend, but so far it seems pretty clear -- at least on my setup -- that these "webservice stops" events are associated with USB faults, which are in turn being tickled by transient link-drop events on the WiFi that cause the webservice session threads to exit and then respawn (i.e., to re-associate with client.)

Anyway, will post more info next weekend. Just thought I'd update you on current findings. A little progress being made anyway.

Regards,

Glenn

AndreGermain
Phidgetly
Posts: 41
Joined: Fri Sep 30, 2011 7:09 am
Contact:

Re: SBC webservice stops

Postby AndreGermain » Tue Jul 23, 2013 8:23 pm

Glenn,

that makes a lot of sense! My SBC1 (now SBC2) on WiFi had for the longest of time only the board's 8-8-8, and a 0-0-4 relay module, and was fine. Later I added an IR 1045, then it began to show signs (USB and networ issues), later I added a Spatial 3/3/3 (gen 2) and it got worst. Oddly, it's the IR that goes offline with these 4 Phidgets. For sure, reducing the network activity helped a lot (larger resolution threshold), but it is rather sad to lose out on resolution for the sake of stability.

Thank you much for working out a test fixture to help us all out! :)

Cheers

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Sun Jul 28, 2013 3:06 pm

Patrick/Andre,

Unfortunately, I've not gotten around to much more work on this webservice-just-stops ("WSJS") isssue. Was planning to spend most of this weekend on it but got caught up in something else. Still actively looking into it though, just not this weekend. But the code is plump with logging printfs and various hooks for getting glimpses into what's going on, and decent though slow progress is being made.

Anyway, in the mean time, I had a question for you guys which may help to guide further investigations. It requires a little background though, so bear with me here.

Just to review what I mentioned earlier: On my setup, it seems like WSJS is always associated with the concurrence of two events (in this order):

  1. Loss of link on the WiFi
  2. A near simultaneous cascade of USB errors on (at least one of) the phidgets. (In my setup, that would be either the SBC1 on-board IK 8/8/8 or the external IK 0/0/4 of the 1014 relay set. Not sure which (and may perhaps be both).

I've established pretty clearly that the webservice is capable of gracefully recovering from WiFi link drops cleanly, as long as it doesn't get into this secondary horsehockey with the USB, and am able to induce such graceful drop/recoveries easily, using WiFi interference or just flood pinging the SBC. The WSJS events occur only (and exclusively, so far in my experiments) when the WiFi link drop is "accompanied" somehow by these USB events.

So the two obvious hypotheses in play -- and neither of these is necessarily correct, could be something entirely different -- are as follows:

  1. Supply sag: Bringing the WiFi link back up after a link loss may include re-initializing the driver, which in turn causes the WiFi transciever to suck heavier current for a few hundred ms during its renegotiation with the access point. (In a previous life, I did design work on various 802.x devices, and they do go thru a startup phase which generally draws quite a bit more current than during steady-state.)

    So maybe this hypothetical current spike is pulling down the supply just enough to get the USB devices right on the hairy edge of misbehaving. So, during recovery from WiFi drops sometimes they fail, sometimes not.

    Bolstering this hypothesis is that the little wall-wart power supply for the SBC1 is pretty wussy, though I have not scoped it yet (which I should do.) I may also try comparing fail rates when the SBC is powered from the wall-wart vs. running it off a bench supply.

  2. Inter-thread interference/confusion: During the WiFi up/down/up cycle, the various read/write threads come up and down indirectly in response to the WiFi link state as they reconnect with the control app. Perhaps they sometimes become "confused" about the desired state of the USB bus or the individual devices, with maybe one thread trying to bring a device down while another is trying to bring it up. Or some sort of confusion like this, where the threads are acting at cross purposes due to [astonishingly!] poor software design, including mutex use/abuse, etc.

So that's the background.

Here's my question: It would be interesting to know if you guys have heard of this WSJS problem occurring when the WS (and its associated local phidgets) are running not on an SBC but on a laptop or a desktop machine, which is then connected to the controlling ap (on some other machine) via WiFi?

The reason for asking is simply that a laptop/desktop would have a heftier power supply than the SBC, hence more margin against current spikes. So for example, if this problem has never been observed on a laptop, that lends more credence to hypothesis #1. It would be nicer though if this problem has been seen on a laptop, because then that strongly suggests #2.

On the other hand, a confounding factor is that the SW execution environment on a laptop/desktop is also obviously somewhat different than on the SBC, and so hypothesis #2 would be affected by this too (and certainly not clear whether the laptop environment would be more or less prone to the thread confusion). Still, if WSJS has been seen on a WiFi-connected laptop, then I would probably concentrate on #2.

So any thoughts you guys (or anyone) has on this, please post them here.

Thx!

Glenn

AndreGermain
Phidgetly
Posts: 41
Joined: Fri Sep 30, 2011 7:09 am
Contact:

Re: SBC webservice stops

Postby AndreGermain » Sun Jul 28, 2013 4:02 pm

Patrick, Glenn,

as most problems I've had with Phidget are USB related, I swapped out the extender cable for the IR sensor (overall was less than 5 m) for a powered hub cable and it seems to have gotten rid of the connectivity issue with the IR 1045_1. However, a few hours later, there was a WiFi hiccup, the Phidget SysLog is below. Oddly the Phidgets attached remotely at 15:59:36, far later than when the event had resolved. Any wireless settings on the SBC, access point or main router possible? The SBC links wireless through a WRT54G, which in turn is linked wired to the main router/internet router in the house, and via this the PC wired makes connection requests for the Phidgets.

Cheers

Jul 28 15:42:15 PhidgetSBC2 kernel: No probe response from AP 00:14:bf:7c:46:77 after 500ms, disconnecting.
Jul 28 15:42:15 PhidgetSBC2 kernel: cfg80211: Calling CRDA to update world regulatory domain
Jul 28 15:42:15 PhidgetSBC2 wpa_action: WPA_IFACE=wlan0 WPA_ACTION=DISCONNECTED
Jul 28 15:42:15 PhidgetSBC2 wpa_action: WPA_ID=0 WPA_ID_STR= WPA_CTRL_DIR=/var/run/wpa_supplicant
Jul 28 15:42:15 PhidgetSBC2 wpa_action: ifdown wlan0
Jul 28 15:42:16 PhidgetSBC2 avahi-daemon[911]: Interface wlan0.IPv4 no longer relevant for mDNS.
Jul 28 15:42:16 PhidgetSBC2 avahi-daemon[911]: Leaving mDNS multicast group on interface wlan0.IPv4 with address 192.168.2.122.
Jul 28 15:42:16 PhidgetSBC2 avahi-daemon[911]: Withdrawing address record for 192.168.2.122 on wlan0.
Jul 28 15:42:16 PhidgetSBC2 wpa_action: removing sendsigs omission pidfile: /lib/init/rw/sendsigs.omit.d/wpasupplicant.wpa_supplicant.wlan0.pid
Jul 28 15:42:52 PhidgetSBC2 kernel: wlan0: authenticate with 00:14:bf:7c:46:77 (try 1)
Jul 28 15:42:52 PhidgetSBC2 kernel: wlan0: authenticated
Jul 28 15:42:52 PhidgetSBC2 kernel: wlan0: associate with 00:14:bf:7c:46:77 (try 1)
Jul 28 15:42:52 PhidgetSBC2 kernel: wlan0: RX AssocResp from 00:14:bf:7c:46:77 (capab=0x31 status=0 aid=1)
Jul 28 15:42:52 PhidgetSBC2 kernel: wlan0: associated
Jul 28 15:42:53 PhidgetSBC2 wpa_action: WPA_IFACE=wlan0 WPA_ACTION=CONNECTED
Jul 28 15:42:53 PhidgetSBC2 wpa_action: WPA_ID=0 WPA_ID_STR= WPA_CTRL_DIR=/var/run/wpa_supplicant
Jul 28 15:42:53 PhidgetSBC2 wpa_action: ifup wlan0=default
Jul 28 15:42:53 PhidgetSBC2 avahi-daemon[911]: Joining mDNS multicast group on interface wlan0.IPv4 with address 192.168.2.122.
Jul 28 15:42:53 PhidgetSBC2 avahi-daemon[911]: New relevant interface wlan0.IPv4 for mDNS.
Jul 28 15:42:53 PhidgetSBC2 avahi-daemon[911]: Registering new address record for 192.168.2.122 on wlan0.IPv4.
Jul 28 15:43:01 PhidgetSBC2 wpa_action: creating sendsigs omission pidfile: /lib/init/rw/sendsigs.omit.d/wpasupplicant.wpa_supplicant.wlan0.pid
Jul 28 15:43:02 PhidgetSBC2 wpa_action: bssid=00:14:bf:7c:46:77
Jul 28 15:43:02 PhidgetSBC2 wpa_action: ssid=DomeObs
Jul 28 15:43:02 PhidgetSBC2 wpa_action: id=0
Jul 28 15:43:02 PhidgetSBC2 wpa_action: pairwise_cipher=TKIP
Jul 28 15:43:02 PhidgetSBC2 wpa_action: group_cipher=TKIP
Jul 28 15:43:02 PhidgetSBC2 wpa_action: key_mgmt=WPA-PSK
Jul 28 15:43:02 PhidgetSBC2 wpa_action: wpa_state=COMPLETED
Jul 28 15:43:02 PhidgetSBC2 wpa_action: ip_address=192.168.2.122

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Sun Jul 28, 2013 4:22 pm

Andre,

Just fyi, perhaps of interest, perhaps not: This WiFi hiccup in your log may simply be a normal periodic WPA re-key event. Your AP/Router probably sets the interval for these. If you're interested in distinguishing these normal re-keys from random or interference-induced events, just keep an eye on the log for a few hours, and you should see the periodicity of the re-keys. Most AP's I've had over the years use a default interval of 15 or 30 minutes, something like that. Might want to check yours and see what it is, and see if these hiccups have the same periodicity.

AndreGermain
Phidgetly
Posts: 41
Joined: Fri Sep 30, 2011 7:09 am
Contact:

Re: SBC webservice stops

Postby AndreGermain » Sun Jul 28, 2013 4:26 pm

Glenn,

I had not seen your latest message before writing my last one; You are onto something about WS failing to recover when there is also a USB and WiFi hiccup as when I added the powered hub cable, the IR 1045_1 no longer drops like it used to (many times in a day) and I've not had to reboot the SBC2 to get the PhidgetWebService back. Without the USB issues, Patrick's supervise fix can finally work as expected.

My SBC is on a 60 ampere 12 V deep cycle battery with proper gauge wire. I doubt there is a voltage drop, but I could put a monitoring on it (in fact, I already have through the SBC's 8/8/8 but that is not so useful without code on the SBC. If there is a drop, it would have to be on the SBC PCB traces... and perhaps additionally in the USB cables (e.g: my previous thin USB cabling).

After nearly of a decade working with Phidgets, it's clear that the USB needs to be rock solid - shielded cables, proper length, proper gauge, ESD/EMI protected, etc. Once I had this all done right, wired PhidgetService is A-1, but throw in a wireless, not so reliable. Patrick mentions his systems don't have issues, but perhaps his environment both at work and home are ideal, which is not always possible.

Cheers

AndreGermain
Phidgetly
Posts: 41
Joined: Fri Sep 30, 2011 7:09 am
Contact:

Re: SBC webservice stops

Postby AndreGermain » Sun Jul 28, 2013 4:34 pm

Glenn,

indeed I had thought of that and other potential WiFi settings, but not being a guru at networks, I was reading up the WRT54G help (from the device itself) and Googling...Inexplicably, the WRT's web pages go all funny and no longer respond to input, so I need to reboot it. It is after all 10+ years old and living outside year around. I'll look into it some more, but I haven't notice a periodicity - in fact, the system was fine until I added the IR and spatial.

Cheers

User avatar
Patrick
Lead Developer
Posts: 3177
Joined: Mon Jun 20, 2005 8:46 am
Location: Canada
Contact:

Re: SBC webservice stops

Postby Patrick » Mon Jul 29, 2013 9:39 am

Glenn,

You mention USB errors - are you seeing there in the system/kernel logs? Do you see errors on the whole bus at hub level? The WiFi dongle is of course also on USB, so the USB errors could be causing the WiFi dongle to reset, or vice-versa, the Hub could be resetting, causing the dongle to reset. Can you post logs of the event?

Also, Phidget webservice logs would be helpful. The webservice can be made to output logs with the -v flag. Maybe you could output webservice logs to a file, and post a lockup event?

The webservice definitely has a threading-deadlock condition that it can get into - where the webservice is still running, but not accepting new connections. Already-established connections may still be active. If a WiFi drop happens here, then connection cannot be reestablished, but supervise will not restart the webservice because it's still 'running'. This is the biggest problem right now, but I have a lot of trouble creating this condition to test. We see this very occasionally on a 24-hour on SBC we have in the office, generally when there are many clients connected - I don't know if it's related to USB problems. The SBC is on Wired ethernet so it's not wifi-specific.

-Patrick


Return to “General”

Who is online

Users browsing this forum: No registered users and 2 guests