SBC webservice stops

General PhidgetSBC Discussion.
glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Thu Aug 08, 2013 9:41 am

Thanks, Patrick, got it.

Really appreciate your patience on this. Lots to learn and precious little doc at the necessary level of detail.

Working up a nice minimal example now. Will probably have it to you by Monday.

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Thu Aug 08, 2013 11:15 am

Next question :) You knew this was coming... BWAAAhahahaha...

Just kidding. Really appreciate your time and expertise, Patrick. Starting to feel a bit guilty about peppering you with so many questions... (but I will anyway... :) )

Soo... I'm seeing some evidence that suggests there may be a LinkStatus variable per-phidget, rather than one "global" LinkStatus variable that applies to the entire connection between the client and a given WS host.

The evidence is as follows:

The minimal example client has a status polling loop that looks like this (pseudocode):

Code: Select all

       
while (1)
{
    r1 = CPhidgetInterfaceKit_getSensorValue(ifkit1, 0, &junk);
    r2 = CPhidgetInterfaceKit_getOutputState(ifkit2, 0, &junk);
    sleep(2);   
    Examine r1 and r2 and report status;
}


where ifkit1 is the PH1070 IK8/8/8, and ifkit2 is the PH1014 IK0/0/4. The value of the sensor or output state is ignored; the calls are being used only to assess the client's "persistent open" idea of whether the link is currently experiencing an outage or not, via the return values.

So, while the client-WS link is being intentionally cycled up and down, the code sits in this loop. Most of the time, r1 and r2 wind up having the same value, but somthing like perhaps 2% - 5% of the time they do not, with one indicating 0 and the other indicating NOTATTACHED.

Considering that the two calls are separated in time by probably a few microseconds, it seemed difficult to explain this under the assumption of a single global LinkStatus variable as simply timing misfortune. (By "misfortune" I mean a race condition in which the global LinkStatus was 0 when the first call was made but then became set to 1 just before the second call.) With the calls separated by microseconds, it would be very unlikely to have this type of misfortune occurring 2% of the time.

I suspected that a more likely explanation might be that there is a LinkStatus state variable per phidget (and an associated heartbeat process per-phidget) so that the transition from NOTATTACHED condition to attached is asychronously determined for each phidget. With this assumption, it would not be suprising that the link statuses are sometimes are skewed by a few seconds or so (since "a few seconds" is the granularity of the attached-or-not heartbeat determination.)

Do you have any insight on this?


EDIT @ 17:30 UTC: Here's an example. The link was brought down at 1127.56 and back up 10 seconds later.

Code: Select all

$ minex2

  openRemoteIP(PH1070): r =   0, et = 0.0001
  waitForAttachment():  r =   0, et = 0.2653
  Type:                 PhidgetInterfaceKit
  SN:                   45919
  Version:              826


  openRemoteIP(PH1014): r =   0, et = 0.0000
  waitForAttachment():  r =   0, et = 0.4537
  Type:                 PhidgetInterfaceKit
  SN:                   109223
  Version:              707

20130808.1127.51:  0 (normal)
20130808.1127.53:  0 (normal)
20130808.1127.55:  0 (normal)
20130808.1127.57:  0 (normal)
20130808.1127.59:  0 (normal)
20130808.1128.01:  5 (NOTATTACHED)          <--- Indicates r1 == r2 == 5
20130808.1128.03:  5 (NOTATTACHED)
20130808.1128.05:  5 (NOTATTACHED)
20130808.1128.07:  5 (NOTATTACHED)
20130808.1128.09:  5 (NOTATTACHED)
20130808.1128.11:  IK1070 = 0, IK1014 = 5    <-- Indicates r1 == 0, r2 == 5
20130808.1128.13:  0 (normal)                <--- indicates r1 == r2 == 0
20130808.1128.15:  0 (normal)
20130808.1128.17:  0 (normal)
20130808.1128.19:  0 (normal)
20130808.1128.21:  0 (normal)

User avatar
Patrick
Lead Developer
Posts: 3039
Joined: Mon Jun 20, 2005 8:46 am
Location: Canada
Contact:

Re: SBC webservice stops

Postby Patrick » Thu Aug 08, 2013 1:37 pm

There is just one heartbeat per connection. When it notices a too-long heartbeat delay, it just closes the socket. The Phidgets notice this when they try to read data again, and then process a detach. This works the same as if the server had actually closed the socket. I would expect the attached->detached transition to happen very close together. The client then goes into a mode where it starts trying to communicate with the server again. Once it connects, it will queue up open messages for each phidget - but there could be a delay in the amount of time it takes for each Phidget to attach - as the server actually re-opens these devices and transfers the full state to the client before attached reads as true.

At the same time as this re-attach, there will be the old server-side socket, now noticing that it was closed by the client during an outage. Not sure how long it will take to notice this. But eventually, it gets cleaned up.

-Patrick

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Thu Aug 08, 2013 2:50 pm

OK, thanks Patrick. Let me mull this as usual.

Unrelated question: As mentioned, I'm preparing a set of diagnostic code for you, a minimal example client and some ancillary utilities, so you can (hopefully!) recreate WSJS'es at will in your lab there. To increase the likelihood that it will work properly on your setup out of the box (i.e. so you don't have to spend much time fussing with it to make it work) just want to ask about your setup. The ideal "system requirements" would be a laptop running some reasonably up to date Linux distro, and of course having a WiFi interface. Arch would be nice (since it's what I use and am familiar with it) but any reasonably recent distro should probably be ok as long as it has a full set of development header files.. Also, would be very nice if it had gnuplot >= 4.3. This will allow you to run a little network performance characterization tool that I'lll also include with the minimal example package.

Let me know what distro you're running so I can perhaps guess correctly at how to set up the example code to work on your system.

Thx,
Glenn
Last edited by glenn on Thu Aug 08, 2013 3:57 pm, edited 1 time in total.

User avatar
Patrick
Lead Developer
Posts: 3039
Joined: Mon Jun 20, 2005 8:46 am
Location: Canada
Contact:

Re: SBC webservice stops

Postby Patrick » Thu Aug 08, 2013 3:46 pm

I generally run Debian, but I can install whatever.

-Patrick

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Thu Aug 08, 2013 3:57 pm

gnuplot >= 4.3?

User avatar
Patrick
Lead Developer
Posts: 3039
Joined: Mon Jun 20, 2005 8:46 am
Location: Canada
Contact:

Re: SBC webservice stops

Postby Patrick » Thu Aug 08, 2013 4:13 pm

4.6

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Fri Aug 09, 2013 11:50 am

Patrick,

Here's a tarball with a little network diagnostic tool (tcpping) along with a few support programs, including a companion plotting tool (tcpping_plot).

If you have time, maybe you can try running it from your designated laptop to your target SBC1 for a few hours, or even overnight, see what your outage stats look like.

From what I've seen, the outage duration is critical to how WS fails, in particular whether it WSJSes and dies or just winds up gracefully restarting the report() thread. On my setup, the threshold for obtaining fairly "reliable" WSJSing is around 6-7 seconds. Less, and I typically see graceful recovery via report() thread restart. More than that, and about 10% of the time I see WSJSes. I'll have much more to say on this when I send you the minimal example client and other stuff next week.

The header doc in tcpping has mondo details on what it measures and how to run it. Usage is simple, should be no more than 2-3 minutes to get it going. If you prefer the quick-start approach:

Code: Select all

  1. On SBC:
        (copy yoo.c from the tarball to SBC:/tmp)
        $ cd /tmp
        $ gcc -c yoo.c -o yoo

  2. On laptop:
        Copy parse_integer_spec.pm into some directory in your PERLLIB,
        so tcpping and tcpping_plot can find it.

  3. On laptop:
        $ tcpping -i=1 sbchostname /tmp/yoo


Very simple, you'll get the idea very quickly.

Once you collect some data, you can plot it with tcpping_plot. Again, very straightforward, instructions in the header doc.

I'll also post some of my results a little later, so we can compare notes.

Regards,
Glenn
Attachments
tcpping.tgz
(11.63 KiB) Downloaded 244 times

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Fri Aug 09, 2013 12:16 pm

Patrick et. al.,

fyi: Attached are a couple of plots showing the TCP delays that my setup experiences over the WiFi link. The first one (tcpping_huh-ga_20130809.png) is from my client laptop to another machine on my network, just for reference. The second one (tcpping_huh-phid1_20130807.png) is from my client laptop to my SBC1.

There's really not a lot of difference in the distributions between the two. Both are more or less bimodal, with the bulk of the delays only a few 10's of ms. But you can see that there is also quite a decent sprinking of long events, many of which are related to periodic WPA rekeys. (The solid line of delays around 200 ms is interesting, I've not figured out what that is yet, but probably doesn't matter so haven't looked into it.)

Again, will say more on this when I send you a big package of stuff with the minimal example code.
Attachments
tcpping_huh-phid1_20130807.png
tcpping_huh-phid1_20130807.png (5.28 KiB) Viewed 4700 times
tcpping_huh-ga_20130809.png
tcpping_huh-ga_20130809.png (4.33 KiB) Viewed 4700 times

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Sat Aug 10, 2013 11:01 am

Patrick... does your Debian laptop have the ss(1) network utility? If not, when you have time, could you please install the "iproute" package, and then post the output from this command:

Code: Select all

    $ ss -t -n


This is for the minimal example: I need to know how to parse the output of ss(1), and just want to see if the output format from the Debian version might be differen than the version on my Arch distro.

Thanks,

Glenn


Return to “General”

Who is online

Users browsing this forum: No registered users and 1 guest