SBC webservice stops

General PhidgetSBC Discussion.
glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Mon Jul 29, 2013 11:59 am

Hi Patrick,

Quickie/partial response just for now, will post more when I've more time.

Re: WS logs. Yes, I've got lots of them. The WS logging mechanism -- by now, heavily modified for format and content -- has been my primary source of insight into the confluence of the WiFi link restart and the USB errors. I've added numerous of pu_log()s and DPRINT()s to capture additional info as well. But I've not looked at the system logs to correlate them with the USB errors, that's a very good suggestion (and I'm duly embarassed for not having thought of it.) :)

I've got an experiment running right now, so hopefully it will fail soon and I'll capture the system logs at that time.

Anyway, can't write more now, but later or tmw will post a collection of the WS logs, system logs, and other info..

Regards,
Glenn

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Mon Jul 29, 2013 6:44 pm

Patrick/Andre,

OK.... so here's a big fat post on what I've been doing so far on this WSJS issue, and a big fat tarball too, with logfiles and other goodies. So grab yerself a cup o' java and prepare to be bored.

First some comments going back to the 07-July posts, just to backfill some gaps in the thread.

On TCP/UDP: Yeah, I'm not sure how I got the wrong idea that WS was UDP-based. It was a few years back that I was originally dorfing around with it and by now have totally forgotten what led me to that wrong conclusion. Maybe I was looking at the link traffic and noticed the mdns UDP stuff and thought it was webservice traffic. I dunno.

In any case, sorry for the confusion on that, but glad that it is in fact TCP. This makes it less likely that the problem is an app-level comm deadlock as I was conjecturing under the assumption of UDP.

About the outsourced webservice code: Yeaaaahhhhhhhh... hooo baybay. It's a shame you guys got sold a bill o' goods on that. I've had my head in it now for a week or so and it is truly gawd-awful. Essentially uncommented, undocumented, syntactically ugly and disorganized, poorly structured... violates just about every style and best-practices coding recommendation ever made. I used to teach internal "C for DSP" classes at Bell Labs many years ago, and sometimes I would try to find real-world code like this to show as "don't do" examples, make people's eyes water. Honestly, it's irritating even just to look through it.

Anyway, for your sake, I hope it was like a Midas Muffler and you didn't pay a lot for it. :)

A little about how the experiments were performed that led to the attached logs and other detritus:

The "code base" for the experiments is phidgetwebservice-2.1.8.20130618 and libphidget-2.1.8.20130320. I built each on the SBC1 from the tarballs, then made sure they worked (and failed) just like the installed stock versions. Yep.

Then made some minor mods to improve the logging output format, force timestamps under all circumstances, and some other convenience stuff like that. (The stock log output is some silly-ass csv format with no whitespace that gives me a headache after about 10 minutes.)

Then I began instrumenting it by adding pu_log() and DPRINT()s to various points of interest in the code as I slogged along learning (with great difficulty) what the heck the code was doing or trying to do. Honestly, even after more than a week, I still have only a very vague understanding of the thread structure and functionality. It's that hard to follow.

I have made a few benign "actual code" mods too, but just stuff like improving or rewriting the way that some routines handle errors so that more details are available in the log output if such errors occur. (For example, stream_server_accept(): It dutifully checks for about half a dozen error conditions, and then immediately throws what it found on the floor by always returning 0 to the caller. Very useful.)

A few words about execution environment: I noticed that the threads implementation on the SBC1 appears to be based on the older so-called "LinuxThreads" lib (which implements each thread as a distinct process) rather than PosixThreads (which implements threads within a single process). I'm by no means an expert in either one (in fact barely familiar with both) but just to mention a few items: On the plus side, since each SBC1 thread is a process, the threads can be easily observed using the BusyBox ps(1) that's installed on the stock SBC. So that's a major convenience. (This would not have been true for PosixThreads because that BusyBox ps is wuss and doesn't have thread inspection.)

On the minus side -- and this is something I've only read about, not something I have any direct experience with -- it seeems that LinuxTheads have some peculiarities with regard to thread creation and teardown. I thought possibly this might be implicated in the WSJS issue. Do you know anything about this? I don't, but have been reading up on it a little.

OK, the WS logs themselves: In the tarball are ten logfiles, named as 201307dd.hhmm, which is the starting date/time. Each is briefly documented within. (Btw, as you go thru the logs in date order, you'll notice a progression of formats as I tweaked around with the logging printfs, but I think all the mods are pretty self-explanatory.)

These are by no means all of the WSJS events that have occurred seen since I began instrumenting the setup, but most others were qualitatively similar enough to the first two (20130721.0956 and 20130721.2157) that they didn't seem worth keeping and documenting.

There are (at least) two post-mortem modes:

PM0: No WS processes are running after WSJS observed.

PM1: One WS process is still running after the WSJS.

In PM1, I'm not yet sure which thread function is the one left standing, or even whether it is consistent from WSJS to WSJS. It might be a different one each time. Unfortunately I've not kept track of which of the logged WSJS events resulted in which PM. PM1 is definitely more frequent though.

About the system logfile that you asked about: Fortunately, since I've not rebooted the SBC1 since I began the experiments, the entire syslog (/var/log/messages) is integral back to the start of the experiments. So it covers all of the logged WSJS events seen so far. It's in the tarball as systemlog_0721-0729. I looked thru it briefly and there are some USB-related warnings, but didn't see anything nasty, and did not try to correlate those warnings with events from the WS logs. (And just to be clear, I have *zero* under-the-hood knowledge of USB, so I really don't know what to make of these warnings.) Hopefully they will be useful to you and maybe you can deduce something from them.

OK, that's it! Have fun, and certainly shoot any questions back.

Regards,

Glenn
Attachments
goodies.tgz
See post for details
(23.34 KiB) Downloaded 159 times
Last edited by glenn on Mon Jul 29, 2013 7:51 pm, edited 1 time in total.

AndreGermain
Phidgetly
Posts: 41
Joined: Fri Sep 30, 2011 7:09 am
Contact:

Re: SBC webservice stops

Postby AndreGermain » Mon Jul 29, 2013 7:48 pm

Glenn,

if your work doesn't get you a few free Phidgets and a thank you card, I'll buy you some myself! Keep up the good work. I wish I could help, but I know little on Linux, as the company I am employed at switched to Windoz & RTX back in 2000.

Since I've added the power USB cable, I've had ZERO hiccups for 26 hours straight, and counting, as opposed to one easily every hour or less. As I stated earlier, every single issue I've with Phidgets over 10+ years, and there were so many I lost count (and countless hours), has invariably been related to USB, usually power and EMI/ESD. Whether the WiFi issue is a cascade effect of the USB issue, it is nonetheless probably *corrected* by solving the USB issue, at least in my case. Perhaps USB is not entirely robust enough for I/O outside of home/office controlled environments?

Cheers

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Mon Jul 29, 2013 8:42 pm

Hey Andre,

Thanks for the kudos, and hat tip back at ya for the good info on the powered USB results and other observations. Hopefully we're on the road to a better understanding of the problem and maybe even an interim SW fix. Lots better than waiting 1-2 years for a rewrite (as Patrick had mentioned.)

Couple items of backfill on your earlier posts which I didn't get a chance to reply to:

60A car battery... yowza. OK, I definitely agree then that Hypothesis #1 is pretty much deep-sixed by that. I won't even bother doing my expt with the bench supply. Your car battery can beat up my bench supply. (We used to keep an old car battery in the back of the lab to use for fixing internal trace shorts on research prototype boards. The procedure became known as Delco Surgery. Very effective.)

The powered USB stuff.... I was wondering if you have time, could you post just a little hand-drawn diagram of the "before" and "after" configs? Y'know, just showing the SBC, USB hub, all attached Phidgets, etc. I'm kinda losing track of your setup changes across the postings. We could all refer to it then too, and make sure we understand what changed from your 1 hour MTBF to 26+ hours.

Rats, there was one more thing too, but I can't remember it now. Anyway... will post it later if I recall.

Regards,
Glenn

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Tue Jul 30, 2013 5:20 am

Patrick,

Attached is one additional piece of WS log barfage that may be useful for "mental calibration" purposes as you're looking over the set of logs from the big tarball I posted last night. It shows the logging output emitted by the instrumented WS/libphidgets for the following three canonical events, in the absence of any error conditions or other confounding horsehockey:

  1. WS startup while my app is not running. This shows what the message sequence looks like in the absence of logging noise from connection attempts or errors of any kind as the threads come up. Ain't nothing else goin' on, just WS starting. At the end of this logging subsequence, three threads/processes are running: The original WS process (i.e. main()), a "mystery" process (i.e., I haven't yet figured out what thread it corresponds to) and the CentralThreadFunction() process.

  2. A few minutes later, the app is started on the laptop (the name of the app is "solwatd", as referred to in the comments). You can then see the log activity due to this normal connection sequence. After this subsequence completes, nine threads are running: The three from above, plus six more. I'm not yet 100% positive that I understand the identities of the six additional threads but I think it's one set of three threads for each of the two attached phidget (IK 8/8/8 on the SBC1, IK 0/0/4 on the 1014): Each per-phidget 3-tuple of threads comprises a WriteThread(), a ReadThread() and a mystery thread (probably a control thread or so, but not sure yet.)

  3. A minute later, I kill the app, and you can then see the logging activity for that. When the thread teardown activity completes, only the three original threads/processes remain running.

Hope this is useful.

Regards,
Glenn
Attachments
log_canonical_example.txt
(3.88 KiB) Downloaded 161 times

User avatar
Patrick
Lead Developer
Posts: 3176
Joined: Mon Jun 20, 2005 8:46 am
Location: Canada
Contact:

Re: SBC webservice stops

Postby Patrick » Tue Jul 30, 2013 9:36 am

Hi,

I'd be interested in a before/after diagram as well. I'm not very surprised that you were having issues with the USB extender. We have generally found these products to be hit-and-miss. The USB spec allows for 5 m cables - going beyond this violates the spec, even with extenders. They do some clever things to make this work, but the spec is based on timing, and it just doesn't always work. Also, Phidgets are very susceptible to low USB voltage. This is why using thicker (higher-guage) cables can help a lot - because the voltage drop is less then a thinner cable - especially for long cables, or Phidgets which use a lots of power. USB devices in general can suffer from EMI. We see this more often with Phidgets mostly because Phidgets tend to be using the EMI-noisier environments then normal USB devices. This can usually be addressed by using shorted cables / cables with ferrites.

The power supply the comes with the SBC should be more then enough to provide power to all 4 or 6 USB ports. The 12 V 2 A supply needs to provide up to 6 x 500mA at 5 V - this is if every USB device is drawing the maximum allowed 500 mA - this would be 5 V 3 A -> 15 W. The SBC itself takes ~100 mA, and the built-in 8/8/8 can takes up to 500 mA, so we're up to 18 W. At 12 V, this is 1.5 A. Even with the power regulator loss, there is lots of power to go around.

As for the webservice threads:
phidgetwebservice/libphidget21.so threads:
-main() thread
--basically just waits for new connections

-pds_session_serve() thread
--1 per client connection
--started in accept_cb after a new connection is established
--this basically just calls read() in a loop and proccesses each command in turn

-report() thread
--1 per client connection
--started after a 'report' command from client
--handles sending periodic data to the client

-CentralThreadFunction()
--this runs in libphidget21.so
--Keeps track of Phidget attach/detach
--associates attached Phidgets with opne requests made by the clients
--starts/stops the per-phidget read/write threads

-ReadThreadFunction()
--in ibphidget21.so
--1 per Phidget
--calls usbRead() in a loop and sends out event callbacks to the client

WriteThreadFunction()
--in libphidet21.so
--1 per Phidget
--sends out data queued up by the _set functions from the client

1 client-server connection is shared by all of the open Phidgets, so the 6 threads you a re seeing are 2 read threads,
2 write threads and the report and pds_session_serve threads

Concerning USB disconnect at Jul 24 19:55:03 - did you pull/plug the WiFi adapter here yourself?

I can see in the logs the webservice gets into a 'stuck' state after seeing some USB errors. If I can repeat this, perhaps I can track it down.

Glenn: You have an SBC1 with an attached 0/0/4 and WiFi dongle right? Anything else?

I assume that your program is running and the connect/disconnect cycles I see in the log are... loss of WiFi perhaps? You are not actually closing/opening the connection on the client side?

-Patrick

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Tue Jul 30, 2013 10:20 am

Patrick,

On the thread ID's/functions: OK, thanks. I looked it over only briefly now, but will read it more carefully later.

On the USB disconnect at Jul 24 19:55:03: Yes, that was a manual pull/reinsert of the WiFi dongle. What happened was that I had been slogin-ed to the SBC from my laptop and had accidentally disabled the WiFi NIC via an errant ifconfig command (which of course dropped my slogin session). So I just pulled the dongle and reinserted. This was not associated [to my knowledge] with any WSJS event. It's a red herring.

>
> You have an SBC1 with an attached 0/0/4 and WiFi dongle right? Anything else?
>
Nope, nothing else. The little ASCII diagram on the 07-Jul post shows the whole config.

FYI, the cable from the SBC to the dongle is about 6", and from the 1014 to the SBC is about 18". The SBC is inside a steel box, which probably should be a reasonable Faraday cage up to a few GHz; the largest opening in the box is a cutout on the front which is about 10 x 3 cm.

>
> I assume that your program is running and the connect/disconnect cycles I see in the log are...
> loss of WiFi perhaps? You are not actually closing/opening the connection on the client side?
>

It depends exactly where in the logs you're talking about. Here's a more detailed explanation though, at the risk of putting you to sleep. :)

I do periodically start and stop my app, so there are numerous legitimate "logical" disconnects/reconnects in the logs. (By "logical" I mean not associated with a WiFi PHY layer link drop.) However, these "app stop" events are not performed gracefully from a software perspective, i.e. not by calling PhidgetClose() [or whatever the function is, I don't have the API in front of me but you know what I mean] but are always done by just killing the app process. So from the WS's perspective, the disconnect will be sensed as a TCP stream EOF, not via the libphidgets API function. Not sure if that's relevant or not.

But ALL (100%) of the WSJS events are associated with WiFi link drops, and NOT associated with logical app starts or kills. They "just happen" from time to time. Some of the WiFi link drops are probably due to normal WPA rekeys, some are random (i.e. I have no idea why they happened), and some occur when I intentionally torture the thing on purpose, as described earlier. But almost all of the WSJSs in the logs that I sent last night are NOT of the latter variety, i.e. they were either random or WPA rekeys.

Is that comprehensible? If not let me know what is unclear and I'll provide more detail or just send you the code. It's important for you to understand this exactly so you don't misinterpret the logs.

User avatar
Patrick
Lead Developer
Posts: 3176
Joined: Mon Jun 20, 2005 8:46 am
Location: Canada
Contact:

Re: SBC webservice stops

Postby Patrick » Tue Jul 30, 2013 11:29 am

In the logs, what exactly are you associating with a WiFi link drop? How do you know that these are WiFi link drops?

-Patrick

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Tue Jul 30, 2013 12:27 pm

>
> In the logs, what exactly are you associating with a WiFi link drop?
>

The "marker" is an accept() completion from the main thread ("Accepted new connection") after a silent period of minutes or hours.

For example, in log_20130721.2157, at line 90, at 0525.56. This accept() event occured "out of the blue" 15 minutes or so after the prior event at 0511 (the latter which did not result in a WSJS.)

I should point out for completeness, that all of those logfiles are integral from start to finish. There are no gaps or any missing lines. (The comments that I inserted inline there, separated by whitespace, that's just for human readability. Nothing was edited out or deleted.)



>
> How do you know that these are WiFi link drops?
>

Fair question. Hmm. Honestly I can't say with 100% certainty that these are PHY layer drops (i.e. a drop from which the the device has to go thru a true PHY startup to recover). They could be MAC layer events that happen to be long enough to time out the TCP stream. But if the accept() thread was invoked, it implies that the TCP stream must have EOFed. It can't be my app doing a SW disconnect/reconnect because there's no path in the app that ever returns to the initialization clause once it's inside the polling loop (see 07-Jul pseudocode).

One time -- exactly once, so I can't vouch for this, may have been hallucinating again -- I ran a quickie test with a parallel ping session from the laptop to the SBC, pinging 2x/second, and logging the output. Ran for several hours and finally a WSJS occurred. The ping log showed an outage of a few seconds just before my app noticed the connection loss. But, to be complete about it, I should also say that the ping log also showed several 1-2 second outages prior to the WSJS. Could have been coincidence.

I can do this again much more carefully to be sure, if you are suspecting that there's really no link loss happening and maybe it's something else. In fact, I have an expt running now, I'll just start up a ping session alongside and see what shows up next time it WSJSes. Will let you know.

Also, as I mentioned, WSJSs occur more often when I'm torturing the WiFi. But there too, it's not an absolute 100% correlation. I've sometimes tortured the thing for 1-2 hours without inducing a WSJS.

What leads to your suspicion that maybe these re-accept()s are not link loss events?

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Tue Jul 30, 2013 1:17 pm

Patrick, btw, if you think it would be worthwhile, you're welcome to gimme a call on landline, or skype, whatever. Just let me know, give ya my contact info. Happy to interact that way if you think it might be faster.


Return to “General”

Who is online

Users browsing this forum: temushin and 3 guests