constraining overall stack memory allocation
Hi all, I was rummaging around in the code looking for ideas just now and figured I might save myself some time by asking on the list to see if anyone else has encountered this. A quick review of the use case: we are using large stack sizes (2 MiB right now, though we could probably go lower but it will still be much larger than the ABT default). We also create, execute, and complete a large number of detached ULTs. Only a very few are intentionally long lived. Our current strategy is that a central producer (who drives network progress) creates ULTs that may be placed on other pools/ESs depending on configuration. I had *thought* that the ULT stacks were not allocated until the ULT was selected for execution by a scheduler, but I see now that's not the case. The stack is allocated up front at ABT_thread_create() time. I'm kicking myself for not understanding that sooner. It didn't matter so much when we used to use small stack sizes. At any rate, at this point this strategy has a few implications. If the ES schedulers don't retire old ULTs fast enough (even if they are very "close" to completion) then we can balloon memory consumption even if it doesn't look like our actual concurrency is all that high, simply because we are greedily taking more memory for stacks without regard to ULT completion. Secondly, the one producer is always paying the allocation cost, and the memory is always local to that one core. What would be ideal for me would be if ABT_thread_create() would defer stack allocation somehow. Ideally not consuming so much memory for a thread until a) it can really be executed and b) the scheduler thinks it is a good idea to do so. Even better if the the allocation were in the context of the ES that popped the thread, rather than the ES that spawned the thread. Is this possible? It would be neat if this could be done internal to Argobots somehow for generality for my use case, but walking through the code I have the sinking feeling that we need to do this above Argobots (explicitly queueing up work and letting the "worker" execution streams create their own ULTs to perform that work a needed, rather than letting the ULT pools within Argobots serve double duty as our work queue). I'm comfortable with custom pools and schedulers, but it looks like the key step is already out of our hands at ULT creation time so there isn't much a custom pool or scheduler could do. Thanks for hearing me out, and thanks in advance for feedback (even if it takes the form of "that's a silly idea" :) ). thanks, -Phil
Hi Phil, Thanks for using Argobots! I believe it's about memory consumption issues regarding ULT stacks.
What would be ideal for me would be if ABT_thread_create() would defer stack allocation somehow. I believe https://github.com/pmodels/argobots/pull/356 (merged) exactly does this. This configuration is disabled by default, so please set --enable-lazy-stack-alloc at configure time.
[Background] Argobots needs to keep - "full stacks [*1]" (in this case, 2MB) per "active" (i.e., "executing" + "suspending") ULT Intuitively, Argobots must have a full ULT stack to save an intermediate ULT execution state, in addition to a stack space for a currently executing ULT. This is the minimum stack requirement for Argobots. [*1] There was a long discussion in https://github.com/pmodels/argobots/issues/274, but basically it's not possible to allocate small stack first and expand it later within Argobots) [Ideas] A ULT stack is assigned when a ULT is executed (not created). The stack is reclaimed when a ULT is finished (not freed). This can achieve the minimum stack use calculated based on [Background]. See the PR for details. The PR explains it using some figures. [Reduce More] 1. This does not include the ULT stack pool (=cache), so if you want to further reduce memory usage, please shrink the stack pool size. This pool mechanism just increases the constant amount of memory consumption, so this pool cache won't affect the memory footprint much, I believe. Shrinking this can negatively affect the performance. 2. Even if you allocate a stack in this way, still you need 2MB per "suspended ULT". If most of the ULTs launch and then immediately yield, this "enable-lazy-stack-alloc" method does not reduce memory consumption. If you need to immediately yield, instead of yielding, please create a new ULT for continuation and exit the ULT; if so, Argobots does not need to save a full ULT stack per yielded ULT. (A newly created ULT does not have a ULT stack since it has not started yet). --- I might not fully understand the use case, but hopefully this flag helps. Please let me know if you have any questions or suggestions. Thanks, Shintaro On Thu, Jun 16, 2022 at 1:57 PM Phil Carns via discuss < [email protected]> wrote:
Hi all,
I was rummaging around in the code looking for ideas just now and figured I might save myself some time by asking on the list to see if anyone else has encountered this.
A quick review of the use case: we are using large stack sizes (2 MiB right now, though we could probably go lower but it will still be much larger than the ABT default). We also create, execute, and complete a large number of detached ULTs. Only a very few are intentionally long lived.
Our current strategy is that a central producer (who drives network progress) creates ULTs that may be placed on other pools/ESs depending on configuration.
I had *thought* that the ULT stacks were not allocated until the ULT was selected for execution by a scheduler, but I see now that's not the case. The stack is allocated up front at ABT_thread_create() time. I'm kicking myself for not understanding that sooner. It didn't matter so much when we used to use small stack sizes.
At any rate, at this point this strategy has a few implications. If the ES schedulers don't retire old ULTs fast enough (even if they are very "close" to completion) then we can balloon memory consumption even if it doesn't look like our actual concurrency is all that high, simply because we are greedily taking more memory for stacks without regard to ULT completion. Secondly, the one producer is always paying the allocation cost, and the memory is always local to that one core.
What would be ideal for me would be if ABT_thread_create() would defer stack allocation somehow. Ideally not consuming so much memory for a thread until a) it can really be executed and b) the scheduler thinks it is a good idea to do so. Even better if the the allocation were in the context of the ES that popped the thread, rather than the ES that spawned the thread.
Is this possible?
It would be neat if this could be done internal to Argobots somehow for generality for my use case, but walking through the code I have the sinking feeling that we need to do this above Argobots (explicitly queueing up work and letting the "worker" execution streams create their own ULTs to perform that work a needed, rather than letting the ULT pools within Argobots serve double duty as our work queue).
I'm comfortable with custom pools and schedulers, but it looks like the key step is already out of our hands at ULT creation time so there isn't much a custom pool or scheduler could do.
Thanks for hearing me out, and thanks in advance for feedback (even if it takes the form of "that's a silly idea" :) ).
thanks,
-Phil
_______________________________________________ discuss mailing list [email protected] https://lists.argobots.org/mailman/listinfo/discuss
This is fantastic, Shintaro. That sounds like exactly what I was hoping for :) We have a benchmark that I think will work well for isolating this behavior; we'll try some experiments and let you know what we find. I'll modify our benchmark to track memory consumption first so that we have the metric ready when we do parameter sweeps. We do actually already constrain the stack cache size (this was necessary early on for us because of the large stacks we use), so we should be all set there. We also have a custom pool that prioritizes completing existing ULTs before presenting new ones to the scheduler. I think that might help us get a little more benefit out of the lazy stack allocation as well. If this proves to be helpful for our workload, is this something that could plausibly be a run-time rather than compile-time option? thank you! -Phil On 6/16/22 7:43 PM, Shintaro Iwasaki wrote:
Hi Phil,
Thanks for using Argobots! I believe it's about memory consumption issues regarding ULT stacks.
What would be ideal for me would be if ABT_thread_create() would defer stack allocation somehow. I believe https://github.com/pmodels/argobots/pull/356 (merged) exactly does this. This configuration is disabled by default, so please set --enable-lazy-stack-alloc at configure time.
[Background] Argobots needs to keep - "full stacks [*1]" (in this case, 2MB) per "active" (i.e., "executing" + "suspending") ULT Intuitively, Argobots must have a full ULT stack to save an intermediate ULT execution state, in addition to a stack space for a currently executing ULT. This is the minimum stack requirement for Argobots. [*1] There was a long discussion in https://github.com/pmodels/argobots/issues/274, but basically it's not possible to allocate small stack first and expand it later within Argobots)
[Ideas] A ULT stack is assigned when a ULT is executed (not created). The stack is reclaimed when a ULT is finished (not freed). This can achieve the minimum stack use calculated based on [Background]. See the PR for details. The PR explains it using some figures.
[Reduce More] 1. This does not include the ULT stack pool (=cache), so if you want to further reduce memory usage, please shrink the stack pool size. This pool mechanism just increases the constant amount of memory consumption, so this pool cache won't affect the memory footprint much, I believe. Shrinking this can negatively affect the performance. 2. Even if you allocate a stack in this way, still you need 2MB per "suspended ULT". If most of the ULTs launch and then immediately yield, this "enable-lazy-stack-alloc" method does not reduce memory consumption. If you need to immediately yield, instead of yielding, please create a new ULT for continuation and exit the ULT; if so, Argobots does not need to save a full ULT stack per yielded ULT. (A newly created ULT does not have a ULT stack since it has not started yet).
---
I might not fully understand the use case, but hopefully this flag helps. Please let me know if you have any questions or suggestions.
Thanks, Shintaro
On Thu, Jun 16, 2022 at 1:57 PM Phil Carns via discuss <[email protected]> wrote:
Hi all,
I was rummaging around in the code looking for ideas just now and figured I might save myself some time by asking on the list to see if anyone else has encountered this.
A quick review of the use case: we are using large stack sizes (2 MiB right now, though we could probably go lower but it will still be much larger than the ABT default). We also create, execute, and complete a large number of detached ULTs. Only a very few are intentionally long lived.
Our current strategy is that a central producer (who drives network progress) creates ULTs that may be placed on other pools/ESs depending on configuration.
I had *thought* that the ULT stacks were not allocated until the ULT was selected for execution by a scheduler, but I see now that's not the case. The stack is allocated up front at ABT_thread_create() time. I'm kicking myself for not understanding that sooner. It didn't matter so much when we used to use small stack sizes.
At any rate, at this point this strategy has a few implications. If the ES schedulers don't retire old ULTs fast enough (even if they are very "close" to completion) then we can balloon memory consumption even if it doesn't look like our actual concurrency is all that high, simply because we are greedily taking more memory for stacks without regard to ULT completion. Secondly, the one producer is always paying the allocation cost, and the memory is always local to that one core.
What would be ideal for me would be if ABT_thread_create() would defer stack allocation somehow. Ideally not consuming so much memory for a thread until a) it can really be executed and b) the scheduler thinks it is a good idea to do so. Even better if the the allocation were in the context of the ES that popped the thread, rather than the ES that spawned the thread.
Is this possible?
It would be neat if this could be done internal to Argobots somehow for generality for my use case, but walking through the code I have the sinking feeling that we need to do this above Argobots (explicitly queueing up work and letting the "worker" execution streams create their own ULTs to perform that work a needed, rather than letting the ULT pools within Argobots serve double duty as our work queue).
I'm comfortable with custom pools and schedulers, but it looks like the key step is already out of our hands at ULT creation time so there isn't much a custom pool or scheduler could do.
Thanks for hearing me out, and thanks in advance for feedback (even if it takes the form of "that's a silly idea" :) ).
thanks,
-Phil
_______________________________________________ discuss mailing list [email protected] https://lists.argobots.org/mailman/listinfo/discuss
Hi Phil, Thanks! I am looking forward to the results.
If this proves to be helpful for our workload, is this something that could plausibly be a run-time rather than compile-time option?
1. It's technically difficult to make it a run-time option because it can make the ULT logic very complicated. 2. There's no performance concern regarding this lazy-stack allocation option, and this should be beneficial for most use cases, so I originally planned to turn this on by default. Only the problem is OOM handling. https://github.com/pmodels/argobots/pull/356/commits/c4f01cf46ba2d9adbb4ae16... Now Argobots strictly checks the resource-related error; all the functions cleanly return an error upon any resource allocation errors (e.g., malloc(), pthread_create(), ...). There are dedicated tests for this. This applies to ABT_thread_create(), but this resource check works only if we allocate a stack on creation. In other words, when lazy-stack allocation is enabled, Argobots cannot return a memory allocation failure on ABT_thread_create(). It is not trivial to fix this issue. We'd like to first confirm its merits in real workloads. If there'd be some merits, we'd like to turn it on by default. (if so, I'd appreciate it if you could create a PR that changes the default value below.) https://github.com/pmodels/argobots/pull/356/commits/f0b22d480c05866d0f3b801... Shintaro On Mon, Jun 20, 2022 at 6:27 AM Phil Carns <[email protected]> wrote:
This is fantastic, Shintaro. That sounds like exactly what I was hoping for :)
We have a benchmark that I think will work well for isolating this behavior; we'll try some experiments and let you know what we find. I'll modify our benchmark to track memory consumption first so that we have the metric ready when we do parameter sweeps.
We do actually already constrain the stack cache size (this was necessary early on for us because of the large stacks we use), so we should be all set there. We also have a custom pool that prioritizes completing existing ULTs before presenting new ones to the scheduler. I think that might help us get a little more benefit out of the lazy stack allocation as well.
If this proves to be helpful for our workload, is this something that could plausibly be a run-time rather than compile-time option?
thank you!
-Phil On 6/16/22 7:43 PM, Shintaro Iwasaki wrote:
Hi Phil,
Thanks for using Argobots! I believe it's about memory consumption issues regarding ULT stacks.
What would be ideal for me would be if ABT_thread_create() would defer stack allocation somehow. I believe https://github.com/pmodels/argobots/pull/356 (merged) exactly does this. This configuration is disabled by default, so please set --enable-lazy-stack-alloc at configure time.
[Background] Argobots needs to keep - "full stacks [*1]" (in this case, 2MB) per "active" (i.e., "executing" + "suspending") ULT Intuitively, Argobots must have a full ULT stack to save an intermediate ULT execution state, in addition to a stack space for a currently executing ULT. This is the minimum stack requirement for Argobots. [*1] There was a long discussion in https://github.com/pmodels/argobots/issues/274, but basically it's not possible to allocate small stack first and expand it later within Argobots)
[Ideas] A ULT stack is assigned when a ULT is executed (not created). The stack is reclaimed when a ULT is finished (not freed). This can achieve the minimum stack use calculated based on [Background]. See the PR for details. The PR explains it using some figures.
[Reduce More] 1. This does not include the ULT stack pool (=cache), so if you want to further reduce memory usage, please shrink the stack pool size. This pool mechanism just increases the constant amount of memory consumption, so this pool cache won't affect the memory footprint much, I believe. Shrinking this can negatively affect the performance. 2. Even if you allocate a stack in this way, still you need 2MB per "suspended ULT". If most of the ULTs launch and then immediately yield, this "enable-lazy-stack-alloc" method does not reduce memory consumption. If you need to immediately yield, instead of yielding, please create a new ULT for continuation and exit the ULT; if so, Argobots does not need to save a full ULT stack per yielded ULT. (A newly created ULT does not have a ULT stack since it has not started yet).
---
I might not fully understand the use case, but hopefully this flag helps. Please let me know if you have any questions or suggestions.
Thanks, Shintaro
On Thu, Jun 16, 2022 at 1:57 PM Phil Carns via discuss < [email protected]> wrote:
Hi all,
I was rummaging around in the code looking for ideas just now and figured I might save myself some time by asking on the list to see if anyone else has encountered this.
A quick review of the use case: we are using large stack sizes (2 MiB right now, though we could probably go lower but it will still be much larger than the ABT default). We also create, execute, and complete a large number of detached ULTs. Only a very few are intentionally long lived.
Our current strategy is that a central producer (who drives network progress) creates ULTs that may be placed on other pools/ESs depending on configuration.
I had *thought* that the ULT stacks were not allocated until the ULT was selected for execution by a scheduler, but I see now that's not the case. The stack is allocated up front at ABT_thread_create() time. I'm kicking myself for not understanding that sooner. It didn't matter so much when we used to use small stack sizes.
At any rate, at this point this strategy has a few implications. If the ES schedulers don't retire old ULTs fast enough (even if they are very "close" to completion) then we can balloon memory consumption even if it doesn't look like our actual concurrency is all that high, simply because we are greedily taking more memory for stacks without regard to ULT completion. Secondly, the one producer is always paying the allocation cost, and the memory is always local to that one core.
What would be ideal for me would be if ABT_thread_create() would defer stack allocation somehow. Ideally not consuming so much memory for a thread until a) it can really be executed and b) the scheduler thinks it is a good idea to do so. Even better if the the allocation were in the context of the ES that popped the thread, rather than the ES that spawned the thread.
Is this possible?
It would be neat if this could be done internal to Argobots somehow for generality for my use case, but walking through the code I have the sinking feeling that we need to do this above Argobots (explicitly queueing up work and letting the "worker" execution streams create their own ULTs to perform that work a needed, rather than letting the ULT pools within Argobots serve double duty as our work queue).
I'm comfortable with custom pools and schedulers, but it looks like the key step is already out of our hands at ULT creation time so there isn't much a custom pool or scheduler could do.
Thanks for hearing me out, and thanks in advance for feedback (even if it takes the form of "that's a silly idea" :) ).
thanks,
-Phil
_______________________________________________ discuss mailing list [email protected] https://lists.argobots.org/mailman/listinfo/discuss
participants (2)
-
Phil Carns -
Shintaro Iwasaki