SBC webservice stops

General PhidgetSBC Discussion.
AndreGermain
Phidgetly
Posts: 41
Joined: Fri Sep 30, 2011 7:09 am
Contact:

Re: SBC webservice stops

Postby AndreGermain » Mon Aug 12, 2013 8:29 am

Gents,

I've been following this thread quietly on the side - most interesting. I'm quite resourceful in programming and hardware, but one of my shortcomings is Ethernet [but learning with this thread!].

Since I've solidified the USB portion of my SBC2, there has been no webservice issues. Mind you, losing the connection for up to a few minutes due to key renewal or other isn't conducive to a semi real time control, so WiFi certainly has to be made more reliable. It further begs code running on the SBC, so I'll have to get around to porting a portion of the PC code there - as recommended by Patrick.

Keep up the great work - and if I can help in any way, would be a pleasure to assist.

Cheers

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Mon Aug 12, 2013 1:03 pm

Hey Andre,

Welcome back from vacation. We know you were thinking about Phidgets and webservice the whole time. :)

Yeeaaahhh, so... anyway.. In the next day or so, I'll post a little collection of diagnostic tools genned up for poking around this issue. It includes a minimal-example client specifically set up for investigating WSJS, plus a bunch of little scripts and stuff for exercising it and observing/logging the results. On my setup, using that client, been able to get WSJSing (and some other interesting but less serious failure modes) to occur on a pretty regular basis by simply cycling the laptop WiFi interface up and down to simulate link outages. And from what I can see from the associated logs, I'm starting to get a decent feel for some likely problem sources. Hypothesis #2 is still front runner there.

It does look as though one issue -- perhaps not causative to WSJS, but interesting nonetheless -- is that the heartbeat logic seems prone to becoming confused, which leads to other client-server interlock issues which I won't blab about in detail here. (The posted package will have some detailed blabbing included therein.)

Also included will be the instrumented WS executable and libphidgets too. So certainly if you (or anyone else reading this thread) wants to get involved in seeing if you can recreate the problem on your setup, that would be great and would certainly add to the database of info needed to engineer an eventual fix. The toolkit should give you all you need to whack the thing upside the head pretty good.

But since Patrick is the primary audience I want to be sure the tools work for him out of the box, so he can avoid spending too much time setting it up. When he has time to get back to me with the question I posted yesterday about the ss(1) tool on his Debian laptop, I'll be able to make any necessary adjustments to the socket monitor tool (which uses ss) and then get the package posted here. So I'm waiting on that right now.

It would be most excellent if Patrick were able to cause and observe WSJSing his lab. If so, that will demonstrate that the failure modes I'm seeing are not unique to my setup but more general, and probably closely related to WiFi outage probability and duration. If it does seem like a general problem, that would motivate getting to solution that should be applicable to everyone. If my (still vague and very immature) guesses about the sources of the problem are correct, I suspect there's a good chance of a not too complicated fix either in WS itself or in the client side API libphidgets "persistent open" logic. Just guessing at this point, but that's the way it looks now.

Just my 2c.

Regards,
Glenn

User avatar
Patrick
Lead Developer
Posts: 3038
Joined: Mon Jun 20, 2005 8:46 am
Location: Canada
Contact:

Re: SBC webservice stops

Postby Patrick » Mon Aug 12, 2013 1:57 pm

Code: Select all

patrick@debian-mac:~$ ss -t -n
State      Recv-Q Send-Q        Local Address:Port          Peer Address:Port
ESTAB      0      0                 10.0.2.15:44890        69.58.183.142:80   
ESTAB      0      0                 10.0.2.15:54068          31.13.76.49:80   
ESTAB      0      0                 10.0.2.15:42981        69.192.95.139:80   
ESTAB      0      0                 10.0.2.15:40271      206.251.255.126:80   
ESTAB      0      0                 10.0.2.15:57785         173.194.33.4:80   
ESTAB      0      0                 10.0.2.15:40270      206.251.255.126:80   
ESTAB      0      0                 10.0.2.15:43885          199.7.59.72:80   
ESTAB      0      0                 10.0.2.15:40272      206.251.255.126:80

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Tue Aug 13, 2013 5:24 am

Thanks for [ code ]-ifying that, that's what I needed. :) Looks the same as Arch, modulo some unimportant whitespace, so should be fine as-is.

I'll put the diagnostic kit up later today or tomorrow. Hopefully it will work for you out of the box.

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Fri Aug 16, 2013 2:22 pm

Oops, looks like my "later today or tomorrow" is turning into "this weekend". :) Should definitely have time to get to it by then.

Question regarding the persistent-open heartbeat logic: Is it home-grown at the app level, or based on goosing the timeout parms for SO_KEEPALIVE? From what you said earlier, I've been assuming it was the former, but the reason for the question is that I see some usage of SO_ KEEPALIVE in the code. Otoh, the default parms for KEEPALIVE are usually on the order of hours (or at least "many" minutes) so seemed unlikely it would be used for few-second heartbeat (as you had described) since there doesn't appear to be any playing around with the timeout parms. But just wanted to ask to be sure.

In general, there are maaaaaany things about WS-client socket operation I don't understand when looking at the socket behavior in the face of link outages. I'll have lots more questions for you next week.

Good weekend to you,
Glenn

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Sun Aug 18, 2013 2:40 pm

Patrick, Andre, et al:

Here's a tarball of the diagnostic utilities mentioned earlier:

http://misc.postpro.net/phidgets/wsdiag-1.1.tgz

(It was too big to post here as an attachment).

Just unpack it and point your browser at the README file, wsdiag-1.1/doc/README.html. That should give you all you need to get going with it.

LATER NOTE 20130820.2205UT: Erratum in the Build/Install section of the README file: The instruction for editing wsdiag.c should be in the "laptop" section, not in the "SBC" section. Apolgies for any confusion.

I think it should take you only a few minutes to have it all up and running, it's pretty straightforward. Should be interesting to see what you get with it. Hopefully you'll be able to reproduce some of the events (including WSJS) that I've been seeing.

Let me know if you have any questions on it.

Regards,
Glenn

P.S. Btw, disregard my earlier question about the heartbeat logic; I have since discovered it in csocketopen.c.
Last edited by glenn on Tue Aug 20, 2013 4:07 pm, edited 1 time in total.

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Thu Aug 22, 2013 9:03 am

Squeak-squeak... :)

Hey Patrick, I'm sure you're busy with other stuff, but just wondering if you might be able to give this wsdiag thing a quick try before the weekend and let me know whether you're able to reproduce any WSJSing. If so, and if I know by Friday, I'll be able to do some work on potential fixes/workarounds over the weekend. I'm getting an improved idea of what all is going on and perhaps might be able to try out some things. (Otoh, if you can't reproduce it, then I need to investigate further... it obviously makes it much more likely the problem is setup-specific somehow.)

Honestly, I think it should only take about 5-10 minutes to get the stuff built and installed. The instructions are straightforward, it's all described in the README. (But note the erratum mentioned in the previous message.) You can do it in "toaster mode". :) Once you have it installed, you can run failure-mode experiments completely unattended between the laptop and an SBC, and it will log everything nicely, and no need for any further attention, just check every few hours if you see a failure.

Anyway, just a plug for moving it along. :) If you're too busy for it now, I totally understand, no problem. We'll get to it eventually.

Thx!
Glenn

User avatar
Patrick
Lead Developer
Posts: 3038
Joined: Mon Jun 20, 2005 8:46 am
Location: Canada
Contact:

Re: SBC webservice stops

Postby Patrick » Thu Aug 22, 2013 11:51 am

Hi,

Yes, I've been looking into it over the last few days. Still gathering data.

One question: does your WSJS event always result in the cascade of usb errors and the webservice exiting? Does the webservice crash with a fault code? Does it even stay running after the usb errors, but in a bad un-connectable state in the case of a WSJS?

Are you able to get WSJS errors if you use ethernet on the SBC - or is it necessary to use WiFi? What about using ethernet vs. WiFi on the client-side as the interface that is brought up/down?

I have not yet been able to re-create the USB error cascade, however I can create the 'Threads Gone Wild' easily, so this should be able to fix.

Unfortunately, GDB on the SBC1 is broken for thread debugging, or I would get to to send me the backtrace of each thread when the webservice crashes.

-Patrick

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Thu Aug 22, 2013 12:49 pm

Hi Patrick,

Sounds like progress being made!

Responses to your questions:

>
> Does your WSJS event always result in the cascade of usb errors and the
> webservice exiting?
>

It's just a matter of definition: For the sake of anal categorization, I've defined the "WSJS" fault mode as always being accompanied by a cascade of USB errors, and resulting in either 0 or 1 threads left standing. But there are many variations on this basic theme that I don't classify as WSJS (just in order to keep them straight and categorize them nicely) For example some similar fault modes which have been seen:

  • A cascade of USB errors, but it doesn't result in 0 or 1 process remaining, but "a few". But WS is still effectively dead (i.e. client cannot connect to it).
  • Cascade of USB errors but WS eventually (several seconds later) winds up cleaning itself up and client can again connect.

>
> Does the webservice crash with a fault code?
>

If by "fault code" you mean the exit code returned to the shell environment, then the answer is that it has always been zero on the occasions I've checked it (which I don't do very often).

>
> Does it even stay running after the usb errors, but in a bad un-connectable state
> in the case of a WSJS?
>

See first respnse above. There are many variations on the basic theme of "WS gets wedged". I have classified only the main ones so far. WSJS, TGW, SBO, and a few others which I didn't put into that "events.html" description.

>
> Are you able to get WSJS errors if you use ethernet on the SBC - or is it necessary
> to use WiFi? What about using ethernet vs. WiFi on the client-side as the interface
> that is brought up/down?
>

I tried copper only once, briefly. (It's somewhat painful for me to do, because I have to take the SBC out of service and remove it from its normal mounting location and put it on the bench, etc etc.). So I was highly motivated not to play with it for very long. :) Nevertheless, during that brief trial, I did not see any WSJS's when simulating link outages by plugging/unplugging the RJ-45. So it's inconclusive, really.

But... there is a very important difference between the way that link up/down/up events appear between copper and wireless: With copper, if you bring the interface down (either via SW "ifconfig" or by unplugging the RJ-45), both sides -- and importantly, the server side --knows immediately that the link is down because carrier ceases, and the driver does "something" to cleanly handle this as soon as it occurs. I'm not sure exactly what "something" is, but whatever it is must make its way to the TCP logic in a graceful way, since it's a more or less standard type of event. Otoh, with a WiFi up/down/up initiated via ifconfig at the client side (as done in the diagnostic suite with "intf_cycle") the server side has no clue that the link just came down. It just stops getting data, and then starts again once the link is restored. Another important difference is that link bobbling (on the down->up transition) is extremely unlikely over copper, yet is probably fairly common with WiFi. So, given the above contextual differences, my thinking was that the way that the server-side TCP handles copper vs. WiFi drops is going to be different enough that I would not want to draw any conclusions based on comparisons. So it didn't seem worthwhile to continue probing with copper.

My working hypothesis here, which seems to be supported by the evidence so far, is that the main source of trouble is that the heartbeat logic, upon detection of link loss and subsequent restoral, immediately fires off a close() to the server, and then shortly thereafter attempts to connect() again. Thus, upon restoral of link connectivity, the client's TCP winds up barfing two sets of conflicting stuff back at the server: The first set of stuff comprises all the un-ACKed segments in its transmit window from the original connection, followed by the FIN for the close, BUT also -- and asynchronously with that leftover Tx window stuff -- the SYN (possibly more than one!) for the new connection. Because the window-clearout and the SYNs are asynchronous, the time ordering at the server side of receipt of these events is totally arbitrary. And in some cases, the threads that are being started/stopped by these two different sets of stuff can get into conflict when exercising the USB, and cause "some sort of" problem that in turn sometimes leads to the USB error cascade. (And sometimes to other fault modes, as you've seen with TGW.)

I'm trying to work up a nice relatively repeatable smoking gun example. And have some ideas as to how to go about working around it, if it turns out to be the case.

Let me know if you have any other questions, I'll respond right away. Really glad that you're working on it.

Regards,
Glenn

AndreGermain
Phidgetly
Posts: 41
Joined: Fri Sep 30, 2011 7:09 am
Contact:

Re: SBC webservice stops

Postby AndreGermain » Thu Aug 22, 2013 7:25 pm

Glenn,

I downloaded your tarball, but am too busy integrating software on a simulator at work, and yet another vacation in a week! Give you news when I get around to it.

Cheers


Return to “General”

Who is online

Users browsing this forum: No registered users and 1 guest