SBC webservice stops

General PhidgetSBC Discussion.
glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Fri Aug 23, 2013 8:03 am

Andre: Look forward to hearing your results. Will probably post an updated version of the wsdiag pkg this weekend, so check out the forum posts before devoting time to 1.1. Enjoy your vacation!

Patrick: If you're not seeing WSJSes, one thing to check (after enabling client-side libphidget logging at level >= 4) is whether the client side heartbeat listener is timing out, e.g.

Code: Select all

 pdc_listen: cmd: listen "/PCK/Heartbeat/192.168.1.18/46999" lid13
                            Timed out after 10000ms.


I'm not pretending to fully understand what's causing this, but what I've observed is that once it occurs, WSJSing will no longer happen, because the client will no longer be attempting to close() the current socket (since it no longer detects loss of heartbeat). Afaict, this heartbeat-induced close() is a key underlying factor leading to WSJSing (as described in previous post).

(Btw, this is the event that underlies "SBO" (silently bridging outages) fault mode. I'll write up more on this over the weekend and post it in an updated version of the wsdiag package.)

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Fri Aug 23, 2013 9:06 am

Patrick: For your convenience, here's an updated wsdiag.c which has libphidget logging {enable, filename, log-level} selectable from cmdline.
Attachments
wsdiag.c
(14.87 KiB) Downloaded 186 times

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Sun Aug 25, 2013 8:56 am

Hi Patrick,

Webservice smoking gun safari proceeds apace. Haven't found anything that definitely qualifies yet, but wanted to ask about a few things.

Been concentrating mainly on the fine-grained timing relationship between the phidget open/close requests from the client which occur when the link is restored. These requests arrive at WS closely bunched due to the way the heartbeat logic works, as mentioned earlier. To look more closely at this, I bumped the WS logging up to usec resolution and added/improved the debug logging in and around the phidget open/close functions.

I first looked for obvious stuff like non-atomic open_phidget() or close_phidget() on the same device, i.e. indicative of a mutex bug. Didn't see anything like that, but fairly often see {close_phidget(), open_phidget()} pairs on the same device with only a few ms spacing between. Some of these are associated with WSJSes, some not. Just wanted to ask if this kind of time spacing worries you at all.

Here's an example, extracted from the full log of a WSJS event. (The full heavily annotated log of that event is attached)

Code: Select all

20130824.1505.21.289690  INFO    close_phidget(SN= 45919): starting
20130824.1505.21.298336  INFO    close_phidget(SN= 45919): actually closing
20130824.1505.21.596110  INFO    close_phidget(SN= 45919): returning OK
                        ^
                        |

                       7 ms

                        |
                        v
20130824.1505.21.603351  INFO    open_phidget (SN= 45919): starting
20130824.1505.21.615708  INFO    open_phidget (SN= 45919): returning OK



20130824.1505.21.623090  INFO    close_phidget(SN=109223): starting
20130824.1505.21.624598  INFO    close_phidget(SN=109223): actually closing
20130824.1505.21.816828  INFO    close_phidget(SN=109223): returning OK
                        ^
                        |

                       12 ms

                        |
                        v
20130824.1505.21.828157  INFO    open_phidget (SN=109223): starting
20130824.1505.21.836378  INFO    open_phidget (SN=109223): returning OK


I did not attempt to investigate what (if anything) goes on at the driver level as a result of these open/close_phidget() calls. If these calls only manipulate local data structures then of course the time spacing is irrelevant. But if they call USB driver functions, there's perhaps potential for some timing race mischief.

As I said, I know jack shit about USB or the libusb API, so this concern may be entirely misplaced. But in many years of general device driver experience, intervals like this tend to make me uneasy. Just because a driver API doc may not explicitly state a minimum interval between calls doesn't necessarily mean that you can boxcar 'em and expect it to work reliably in practice either. In this case, all it would take is some asynchronous event below the API level (or even at the HW level) that has a settling time on the order of 10 ms or so, and there's potential for trouble.

What do you think?
Attachments
full_event.txt
(8.83 KiB) Downloaded 183 times

User avatar
Patrick
Lead Developer
Posts: 3038
Joined: Mon Jun 20, 2005 8:46 am
Location: Canada
Contact:

Re: SBC webservice stops

Postby Patrick » Mon Aug 26, 2013 9:26 am

I an mostly concerned about the flood of USB errors - this should never happen, and could potentially be caused by some back locking where the USB device is accessed as it's being closed.. I don't really think there is a logic problem with the USB code, because we have never seen issues like this other then with the webservice, but maybe.

There is a bug the the socket send code, which can cause a deadlock, and stop further connections to the webservice, which I'm working on - but this isn't related to the USB errors.

-Patrick

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Mon Aug 26, 2013 1:55 pm

Have you been able to reproduce the USB error cascades there using the wsdiag stuff? If not, have you verified that the client heartbeat listener has not died (see message from 23-Aug)? Once the HB listener dies -- which is not that rare -- then you won't get these closely-spaced open/close events which are what seem to trigger the USB error cascades, at least on my setup. The WS behaves very stably once the HB monitor wedges. I've never see a long USB error cascade once that happens. So it's critical to monitor this.

Question: When you say
I don't really think there is a logic problem with the USB code

do you mean the Phidgets code, or the libusb code? The reason I ask is that I want to try to write up a little "photon torpedo" which will really torture the crap outta the USB with open/close sequences. I want to understand whether it would be preferable to do it at the driver (libusb) level or at the Phidgets API level.

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Thu Aug 29, 2013 9:19 am

Hi Patrick,

Found what appears to be a smoking gun for the USB errors, at least in the sense of a necessary (though not sufficient) condition.

Ran a long experiment series that seems to confirm what I mentioned last week (22-Aug forum post) as a possible causative factor: The conflict between the queued-up client TCP socket close() and socket open() that occurs during a link outage, due to the HB logic attempting an immediate socket close and reconnect upon detecting loss of heartbeat. When the link comes back up, the associated FIN and SYN segments wind up getting flushed asynchronously (i.e. they arrive at WS with arbitrary ordering).

The experiment shows pretty convincingly -- no exceptions in 66 runs -- that when the FIN [i.e. the socket close() for the original session] flushes first, nothing untoward happens: No USB errors, no WSJS, no unusual behavior at all. In every observed case, the WS recovers gracefully from the outage and returns to normal operation.

But if the SYN [i.e. the socket connect() for the new session] flushes first, then there's about 50% probability that USB errors and other fault modes (including WSJS) occur.

This suggests a mutex or other resource locking issue in which the WS thread sequence started in motion by the stream_socket_accept() for the new session is being interfered with by the socket close() on the original session that arrives a little while later. Sometimes "something bad" happens, sometimes it doesn't.

I'll continue hunting down the details and try to track down the exact source of the conflict.

Certainly one obvious workaround though is not to have the HB monitor behaving as it presently does, i.e. firing off a socket close() as soon as loss of HB is detected. Even if the ordering of the open() and close() were completely benign (i.e. not causing any USB errors or any other problems) I would still argue that the present behavior doesn't seem to accomplish anything useful and could probably be improved with some rethinking.

Anyway, more on this later.

I'll post the experiment results and logfiles later today.

Regards,
Glenn

User avatar
Patrick
Lead Developer
Posts: 3038
Joined: Mon Jun 20, 2005 8:46 am
Location: Canada
Contact:

Re: SBC webservice stops

Postby Patrick » Thu Aug 29, 2013 11:46 am

The reason it closes the socket right away, is because we don't know if the webservice will be coming back up at the same IP address; in which case the socket could have been just left opened, or at a different IP address. We also close the socket if the Phidget is physically unplugged. I suppose we could leave the socket opened, until we decide to open a new socket.

At any rate, the behaviour needs to be made safe regardless.

I haven't had time to look into this much more lately.

-Patrick

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Thu Aug 29, 2013 3:28 pm

Agree 100% that the primary task is to avoid USB errors and WS crashing given HB monitor's present behavior. Regarding the why & wherefore of that behavior... let's save that for a later debate. :)

Here's a description and summary of the experiment, and a tarball with the set of logfiles for each run.


Code: Select all

Experiment 03: 28,29 Aug. 2013
==============================

Overview
========

 Each run consisted of the following:

    1.  SBC: Kill existing WS threads
    2.  SBC: Start WS, dumping log to "logNN", where NN is the run number.
    3.  Laptop: Start wsdiag in polling mode, verify connection to WS and correct
        polling data results for ~15-20 seconds.
    4.  Laptop: Set WiFi link loss rate to 100% for 8 seconds. [NOTE 1]
    5.  When link comes back up, record the ordering of {close,open} in WS log
        and post-outage polling behavior of wsdiag. [NOTE 2]

  NOTE 1:
        Link loss was introduced via tc(1) rather than by bringing the interface
        down.  Cleaner approach: Neither the application nor TCP are aware that
        the link is down, but all outbound data is dropped on the floor.

  NOTE 2:
        The {open,close} ordering was determined by examining the first few log
        lines after the link was restored: "Accepted new connection" from the
        listener thread indicates the ordering was {open,close}. "pd_close()"
        (or its immediate predecessors e.g. "readfunc() returned 0") indicates
        ordering was {close,open}.


Summary
=======

 Of 66 total runs:

    31 runs had {open,close} ordering:
       * 19 had USB errors after link restored; of those, 5 had WSJS.
       * 12 returned gracefully to normal operation with no faults observed.

    35 runs had {close,open} ordering:
       * All recovered gracefully to normal operation with no faults observed.


Per-run details
===============

 In the table below, the following shorthand is used in the "Notes" column:

  RNO:  Resumed normal operation: When the link was restored, wsdiag's polling
        resumed normally: On each read cycle, the values read back from each
        Phidget agreed with the value written on the prior write cycle.

  SRB:  Stuck read bit: When the link was restored, wsdiag's getXxx() calls
        returned "normal" (zero) but the data value read was stuck in one state
        or the other. (This usually happened only on one of the Phidgets, but
        occasionally on both. When it was just one, it could be either; there
        didn't seem to be any obvious pattern.)
 
  NT=n: Number of WS threads (n) still running after the link was restored and
        the thread activity had time to settle.  For this codeset (my instrumented
        code, not stock) the number of threads during normal operation (with
        client connected) is 9:

                stream_server_thread();
                "mystery" thread();
                CentralThreadFunction();

                pds_session_serve();
                report();

                ReadThreadFunction(PH1070);
                WriteThreadFunction(PH1070);

                ReadThreadFunction(PH1014);
                WriteThreadFunction(PH1014);

----------------------------------------------------------------------------------

Logfile {open,close}    USB err
name      ordering      cascade WSJS    Notes
=====   ============    ======= ====    =====

log01   open,close      No      ----
log02   open,close      Yes     Yes     (Forgot to record NT)
log03   open,close      Yes     No
log04   open,close      Yes     No
log05   close,open      No      ----    RNO
log06   close,open      No      ----    RNO
log07   open,close      No      ----    RNO
log08   open,close      Yes     No      Many "Error in set function..."
log09   close,open      No      ----    RNO
log10   close,open      No      ----    RNO
log11   close,open      No      ----    RNO
log12   open,close      No              RNO
log13   open,close      Yes     No      SRB
log14   open,close      No      ----    RNO
log15   open,close      Yes     No      SRB
log16   close,open      No      ----    RNO
log17   open,close      No      ----    RNO
log18   open,close      Yes     No      SRB
log19   open,close      Yes     No      SRB, many "Error in set function..."
log20   close,open      No      ----    RNO
log21   close,open      No      ----    RNO
log22   open,close      Yes     No      SRB, NT=8, Many "Error in set function..."
log23   open,close      Yes     Yes     NT=0
log24   close,open      No      ----    RNO
log25   close,open      No      ----    RNO
log26   open,close      Yes     No      SRB, NT=8
log27   open,close      No      ----    RNO
log28   close,open      No      ----    RNO
log29   open,close      No      ----    RNO
log30   close,open      No      ----    RNO (long time to RNO)
log31   open,close      No      ----    RNO
log32   close,open      No      ----    RNO
log33   open,close      Yes     Yes     Initially SRB with NT=9, then WSJS to NT=0
log34   open,close      No      ----    RNO
log35   close,open      No      ----    RNO
log36   close,open      No      ----    RNO
log37   close,open      No      ----    RNO
log38   open,close      No      ----    RNO
log39   open,close      Yes     No      SRB, NT=9
log40   close,open      No      ----    RNO
log41   open,close      Yes     No      SRB, NT=8, "Error in set function..."
log42   close,open      No      ----    RNO
log43   close,open      No      ----    RNO
log44   close,open      No      ----    RNO
log45   close,open      No      ----    RNO
log46   open,close      Yes     No      SRB, NT=9
log47   close,open      No      ----    RNO
log48   close,open      No      ----    RNO
log49   open,close      No      ----    RNO
log50   close,open      No      ----    RNO
log51   close,open      No      ----    RNO
log52   close,open      No      ----    RNO
log53   close,open      No      ----    RNO
log54   close,open      No      ----    RNO
log55   close,open      No      ----    RNO
log56   close,open      No      ----    RNO
log57   open,close      Yes     No      SRB, NT=9
log58   close,open      No      ----    RNO
log59   open,close      Yes     Yes     NT=0. Only 1 USB error was reported.
log60   close,open      No      ----    RNO
log61   close,open      No      ----    RNO
log62   open,close      No      ----    RNO
log63   open,close      Yes     No      SRB, NT=9
log64   close,open      No      ----    RNO
log65   open,close      Yes     Yes     NT=0
log66   close,open      No      ----    RNO

Attachments
expt03_logs.tgz
(70.59 KiB) Downloaded 178 times

glenn
Phidgeteer!
Posts: 93
Joined: Sun Sep 05, 2010 4:42 pm
Contact:

Re: SBC webservice stops

Postby glenn » Fri Aug 30, 2013 11:49 am

Patrick,

If you have time before the weekend please try to answer briefly the question from 2013.08.26 13:55 post (time = MDT). This will help me understand what direction I should be heading in doing diagnostics over the weekend.

If you're not able to reproduce any of the USB errors that are so readily reproducible here, then it points to an issue with my setup, in which case I need to rethink my approach here.

Thanks, and good weekend to you,
Glenn

User avatar
Patrick
Lead Developer
Posts: 3038
Joined: Mon Jun 20, 2005 8:46 am
Location: Canada
Contact:

Re: SBC webservice stops

Postby Patrick » Fri Aug 30, 2013 12:15 pm

I don't believe there is a problem in the USB code itself. This code has been little changed in many years, and we haven't seen these kinds of issues in other situations. I really don't know what could be causing the USB error cascade, but it's probably some bad locking logic exposed by an open and close being called close together, from different threads. I'm not sure why this would cause USB errors, but it's not that hard to imagine. This probably leads to a deadlocking of two or more threads, because of bad locking code, where each thread is trying to lock a mutex which the other owns. Any new connections will also deadlock waiting for one of these mutexes. Or, we get a seg fault because one thread frees some memory that another is still using, etc etc..

-Patrick


Return to “General”

Who is online

Users browsing this forum: No registered users and 1 guest