discussion
[Top] [All Lists]

[Discussion] Re: Reduce polling costs

To: l.andrew@xxxxxxxx
Subject: [Discussion] Re: Reduce polling costs
From: John Heffner <jheffner@xxxxxxx>
Date: Wed, 22 Aug 2007 18:45:06 -0700
Lachlan Andrew wrote:
Greetings John,

On 14/08/07, John Heffner <jheffner@xxxxxxx> wrote:
I didn't see the original message on the list. Was there more to it?

Yes, it continued:


1. Do you know why the overhead would be so high?  Is that normal?
2. Do you think caching  fp  would help?
3. I assume I'd need to  frewind(fp)  each call to  snap,  but is
there anything else that would be needed?
4. Is there a lower-overhead way to read the web100 variables
frequently, down to intervals of 1ms?

Hm, I had forgotten how unoptimized some of the library was. ;) This is probably unrelated to your problem, though, at >=1ms polling frequency. BTW, I'm just startinga complete API redesign, so if you have suggestions/complaints, now's the time...



We've also noticed that our polling process occasionally freezes for
several seconds when we poll web100 too quickly, but doesn't if we
poll slower.  This may be a bug in our experimental kernel, but the
interaction with web100 seems odd.  Has anyone seen anything like that
before?

Interesting, I haven't heard of that one before. If you can help track that down, I'm be curious what might be causing the problem.



Tom Quetchenbach wrote:

How are you polling?  50ms should be pretty easy, but you should not be
doing an fopen for each read, just a seek and read.  Using libweb100,
that's just web100_raw_read().

We want to capture all the variables, and so were using web100_snap. Is there an equivalent to web100_snap which doesn't open the file each time, or do we have to loop through the variables ourselves?

The problem seems to have to do with the fact that web100 locks the
socket in connection_file_rw before reading. If we remove the calls to
lock_sock and release_sock, the problem goes away.

So, is it dangerous not to lock the socket on a read?
Hm, at a 50ms polling interval? You mean 50us?

No, we mean 50ms. (I've since heard of another group who stopped using web100 because of the same issue of web100 "freezing" when interesting things happen. I really like web100, and so would rather help improve it than abandon it.)

I know Baruch Even at Hamilton had this issue, and rolled his own instrumentation of the things he was interested in using queues of events rather than a polling approach.



One problem seems to be that, at times of heavy loss, Linux can hold
the lock for a second or more going through all the retransmissions
etc.  That is exactly the sort of Linux problem we would like to
debug, but we need to see what is happening during those times.

Ah, I understand your issue now. This is kind of a tough case for Web100 -- it's not really what it was designed to do. Baruch's approach for this type of thing is more appropriate. Taking out the lock_sock() will definitely help you out. (I could add a switch for this or pretty easily.) However, I think if you're running on a uni-processor system you may have some problems no matter what. The softirq processing in 2.6 is better than it used to be in 2.4 in terms of not starving out user processes, but I think it will still be an issue if you're looking at trying to get fine-grain data to see what's happening during recovery.



The lock_sock() is there to ensure the correctness of the stats.
Also, though unlikely some of the 64-bit values could be incorrect if
read at the wrong time.

Outliers like that can be filtered out, but a 1s gap in readings is fatal. At other times, web100 would freeze for 10s of seconds, but this would not happen when we increased the polling interval to 0.5s (which is really too slow to be useful). Again, this time was spent in the locking code.

It spent 10s in lock_sock()? Or was this some other lock?



-John


_______________________________________________
Discussion mailing list
Discussion@xxxxxxxxxx
http://internal.web100.org/mailman/listinfo/discussion

<Prev in Thread] Current Thread [Next in Thread>