help understanding mutex_unlock logic
Hi all, I'm trying to debug a problem with a custom scheduler (https://xgitlab.cels.anl.gov/sds/abt-snoozer/issues/8). To make a long story short, sometimes we can trigger a scenario where unlocking a mutex does not wake up any ULTs that are blocked on it. I've traced something suspicious in ABTI_mutex_wake_de() path leading up to the deadlock, but I'm not sure if this is an Argobots bug or if I just don't understand the logic. Can someone help sanity check this? In the ABTI_mutex_wake_de() invocation just before the hang, the value of num_elem is 1 at this line: https://github.com/pmodels/argobots/blob/master/src/mutex.c#L769 ... but then after checking the high and low priority lists it falls through to here, having not found anything to wake up: https://github.com/pmodels/argobots/blob/master/src/mutex.c#L802 Is that code path supposed to be possible? It seems non intuitive that the count could be > 0 but it can't find a thread to wake up, but there might be a more subtle meaning to num_elem. The particular scheduler that I am debugging will be blocking/sleeping until it finds work to do, so it's important that this path ultimately triggers a pool push or else it won't make progress. That may not be an issue with other schedulers. thanks! -Phil
I have a little bit more detail, but no root cause yet, on what's going wrong here. The ABTI_thread_queue data structure is definitely corrupted in my case. I've traced a call to ABTI_thread_htable_pop() where (going into the function) p_queue->num_threads == 1, p_queue->head == NULL, and p_queue->tail != NULL. There were 3 pushes and 2 pops before that point, so I think the num_threads and tail variables are correct, but the head variable is wrong. The head and tail should both point to the same thing. Possibly a memory corruption in my own code, though. I'll keep digging. thanks, -Phil On 08/11/2017 05:14 PM, Phil Carns wrote:
Hi all,
I'm trying to debug a problem with a custom scheduler (https://xgitlab.cels.anl.gov/sds/abt-snoozer/issues/8). To make a long story short, sometimes we can trigger a scenario where unlocking a mutex does not wake up any ULTs that are blocked on it.
I've traced something suspicious in ABTI_mutex_wake_de() path leading up to the deadlock, but I'm not sure if this is an Argobots bug or if I just don't understand the logic. Can someone help sanity check this?
In the ABTI_mutex_wake_de() invocation just before the hang, the value of num_elem is 1 at this line:
https://github.com/pmodels/argobots/blob/master/src/mutex.c#L769
... but then after checking the high and low priority lists it falls through to here, having not found anything to wake up:
https://github.com/pmodels/argobots/blob/master/src/mutex.c#L802
Is that code path supposed to be possible?
It seems non intuitive that the count could be > 0 but it can't find a thread to wake up, but there might be a more subtle meaning to num_elem. The particular scheduler that I am debugging will be blocking/sleeping until it finds work to do, so it's important that this path ultimately triggers a pool push or else it won't make progress. That may not be an issue with other schedulers.
thanks! -Phil
_______________________________________________ discuss mailing list [email protected] https://lists.argobots.org/mailman/listinfo/discuss
I just opened a pull request that fixes the problem at https://github.com/pmodels/argobots/pull/22. The issue ended up being a field that's not initialized properly when a thread is pushed onto a queue, but is needed by the pop logic later if there is more than one item in the queue. I still feel like probably https://github.com/pmodels/argobots/blob/master/src/mutex.c#L802 should be an assertion rather than silently returning (a lock is held all the way from checking the element count to failing to find a thread, so it seems like an inconsistent scenario to hit that code path), but I did not address that in the pull request. I'd like someone to sanity check that part :) thanks, -Phil On 08/14/2017 04:49 PM, Phil Carns wrote:
I have a little bit more detail, but no root cause yet, on what's going wrong here. The ABTI_thread_queue data structure is definitely corrupted in my case. I've traced a call to ABTI_thread_htable_pop() where (going into the function) p_queue->num_threads == 1, p_queue->head == NULL, and p_queue->tail != NULL.
There were 3 pushes and 2 pops before that point, so I think the num_threads and tail variables are correct, but the head variable is wrong. The head and tail should both point to the same thing.
Possibly a memory corruption in my own code, though. I'll keep digging.
thanks, -Phil
On 08/11/2017 05:14 PM, Phil Carns wrote:
Hi all,
I'm trying to debug a problem with a custom scheduler (https://xgitlab.cels.anl.gov/sds/abt-snoozer/issues/8). To make a long story short, sometimes we can trigger a scenario where unlocking a mutex does not wake up any ULTs that are blocked on it.
I've traced something suspicious in ABTI_mutex_wake_de() path leading up to the deadlock, but I'm not sure if this is an Argobots bug or if I just don't understand the logic. Can someone help sanity check this?
In the ABTI_mutex_wake_de() invocation just before the hang, the value of num_elem is 1 at this line:
https://github.com/pmodels/argobots/blob/master/src/mutex.c#L769
... but then after checking the high and low priority lists it falls through to here, having not found anything to wake up:
https://github.com/pmodels/argobots/blob/master/src/mutex.c#L802
Is that code path supposed to be possible?
It seems non intuitive that the count could be > 0 but it can't find a thread to wake up, but there might be a more subtle meaning to num_elem. The particular scheduler that I am debugging will be blocking/sleeping until it finds work to do, so it's important that this path ultimately triggers a pool push or else it won't make progress. That may not be an issue with other schedulers.
thanks! -Phil
_______________________________________________ discuss mailing list [email protected] https://lists.argobots.org/mailman/listinfo/discuss
_______________________________________________ discuss mailing list [email protected] https://lists.argobots.org/mailman/listinfo/discuss
Hi Phil, Thank you for the detailed report and for the PR. I will review the problem and the PR timely and get back to you. Best, Halim www.mcs.anl.gov/~aamer On 8/14/17 8:42 PM, Phil Carns wrote:
I just opened a pull request that fixes the problem at https://github.com/pmodels/argobots/pull/22. The issue ended up being a field that's not initialized properly when a thread is pushed onto a queue, but is needed by the pop logic later if there is more than one item in the queue.
I still feel like probably https://github.com/pmodels/argobots/blob/master/src/mutex.c#L802 should be an assertion rather than silently returning (a lock is held all the way from checking the element count to failing to find a thread, so it seems like an inconsistent scenario to hit that code path), but I did not address that in the pull request. I'd like someone to sanity check that part :)
thanks, -Phil
On 08/14/2017 04:49 PM, Phil Carns wrote:
I have a little bit more detail, but no root cause yet, on what's going wrong here. The ABTI_thread_queue data structure is definitely corrupted in my case. I've traced a call to ABTI_thread_htable_pop() where (going into the function) p_queue->num_threads == 1, p_queue->head == NULL, and p_queue->tail != NULL.
There were 3 pushes and 2 pops before that point, so I think the num_threads and tail variables are correct, but the head variable is wrong. The head and tail should both point to the same thing.
Possibly a memory corruption in my own code, though. I'll keep digging.
thanks, -Phil
On 08/11/2017 05:14 PM, Phil Carns wrote:
Hi all,
I'm trying to debug a problem with a custom scheduler (https://xgitlab.cels.anl.gov/sds/abt-snoozer/issues/8). To make a long story short, sometimes we can trigger a scenario where unlocking a mutex does not wake up any ULTs that are blocked on it.
I've traced something suspicious in ABTI_mutex_wake_de() path leading up to the deadlock, but I'm not sure if this is an Argobots bug or if I just don't understand the logic. Can someone help sanity check this?
In the ABTI_mutex_wake_de() invocation just before the hang, the value of num_elem is 1 at this line:
https://github.com/pmodels/argobots/blob/master/src/mutex.c#L769
... but then after checking the high and low priority lists it falls through to here, having not found anything to wake up:
https://github.com/pmodels/argobots/blob/master/src/mutex.c#L802
Is that code path supposed to be possible?
It seems non intuitive that the count could be > 0 but it can't find a thread to wake up, but there might be a more subtle meaning to num_elem. The particular scheduler that I am debugging will be blocking/sleeping until it finds work to do, so it's important that this path ultimately triggers a pool push or else it won't make progress. That may not be an issue with other schedulers.
thanks! -Phil
_______________________________________________ discuss mailing list [email protected] https://lists.argobots.org/mailman/listinfo/discuss
_______________________________________________ discuss mailing list [email protected] https://lists.argobots.org/mailman/listinfo/discuss
_______________________________________________ discuss mailing list [email protected] https://lists.argobots.org/mailman/listinfo/discuss
participants (2)
-
Halim Amer -
Phil Carns