Loading…

LAPT: A locality-aware page table for thread and data mapping

•We detect the memory access patterns in shared memory applications.•Using the detected access patterns, we map the threads and data to improve performance.•Provide a better usage of hardware resources.•We reduce execution time, cache misses and traffic on interconnections.•No need to modify applica...

Full description

Saved in:
Bibliographic Details
Published in:Parallel computing 2016-05, Vol.54, p.59-71
Main Authors: Cruz, Eduardo H.M., Diener, Matthias, Alves, Marco A.Z., Pilla, Laércio L., Navaux, Philippe O.A.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•We detect the memory access patterns in shared memory applications.•Using the detected access patterns, we map the threads and data to improve performance.•Provide a better usage of hardware resources.•We reduce execution time, cache misses and traffic on interconnections.•No need to modify applications or runtime environment. The performance and energy efficiency of current systems is influenced by accesses to the memory hierarchy. One important aspect of memory hierarchies is the introduction of different memory access times, depending on the core that requested the transaction, and which cache or main memory bank responded to it. In this context, the locality of the memory accesses plays a key role for the performance and energy efficiency of parallel applications. Accesses to remote caches and NUMA nodes are more expensive than accesses to local ones. With information about the memory access pattern, pages can be migrated to the NUMA nodes that access them (data mapping), and threads that communicate can be migrated to the same node (thread mapping). In this paper, we present LAPT, a hardware-based mechanism to store the memory access pattern of parallel applications in the page table. The operating system uses the detected memory access pattern to perform an optimized thread and data mapping during the execution of the parallel application. Experiments with a wide range of parallel applications (from the NAS and PARSEC Benchmark Suites) on a NUMA machine showed significant performance and energy efficiency improvements of up to 19.2% and 15.7%, respectively, (6.7% and 5.3% on average).
ISSN:0167-8191
1872-7336
DOI:10.1016/j.parco.2015.12.001