c++ - Performance decreases with a higher number of threads (no synchronization) -
i have data structure (a vector) elements have parsed function, elements can parsed different threads.
following parsing method:
void consumerpool::parse(size_t n_threads, size_t id) { (size_t idx = id; idx < nodes.size(); idx += n_threads) { // parse node //parse(nodes[idx]); parse(idx); } }
where:
n_threads
total number of threadsid
(univocal) index of current thread
and threads created follow:
std::vector<std::thread> threads; (size_t = 0; < n_threads; i++) threads.emplace_back(&consumerpool::parse, this, n_threads, i);
unfortunately, if method works, performance of application decreases if number of threads too high. understand why performance decreases if there's no synchronization between these threads.
following elapsed times (between threads start , last join() return) according number of threads used:
- 2 threads: 500 ms
- 3 threads: 385 ms
- 4 threads: 360 ms
- 5 threads: 475 ms
- 6 threads: 580 ms
- 7 threads: 635 ms
- 8 threads: 660 ms
the time necessary threads creation between 1/2 ms. software has been tested using release build. following configuration:
2x intel(r) xeon(r) cpu e5507 @ 2.27ghz maximum speed: 2.26 ghz sockets: 2 cores: 8 logical processors: 8 virtualization: enabled l1 cache: 512 kb l2 cache: 2.0 mb l3 cache: 8.0 mb
edit:
what parse()
function following:
// data shared between threads (around 300k elements) std::vector<std::unique_ptr<foo>> vfoo; std::vector<rapidxml::xml_node<>*> nodes; std::vector<std::string> layers; void parse(int idx) { auto p = vfoo[idx]; // p->parse() allocate memory according content of xml node if (!p->parse(nodes[idx], layers)) vfoo[idx].reset(); }
update:
we still don't have lot of info memory access patterns of parse()
, , how time spends reading input data memory vs. how time spent writing/reading private scratch memory.
you p->parse()
"allocates memory according content of xml node". if frees again, may see big speedup keeping big-enough scratch buffer allocated in each thread. memory allocation/deallocation "global" thing requires synchronization between threads. thread-aware allocator can handle allocate/free / allocate/free pattern satisfying allocations memory freed that thread, it's still hot in private l1 or l2 cache on core.
use kind of profiling find real hotspots. might memory allocation/deallocation, or might code reads memory.
your dual-socket nehalem xeon doesn't have hyperthreading, can't running issues threads slowing each other down if non-ht-aware os schedules 2 on 2 logical cores of same physical core.
you should investigate performance counters (e.g. linux perf stat
, or intel's vtune) whether you're getting more cache misses per thread once pass 4 threads. nehalem uses large shared (for whole socket) l3 (aka last-level) caches, more threads running on same socket creates more pressure on that. relevant perf events llc_something, iirc.
you should @ l1/l2 misses, , see how scale number of threads, , how changes strided vs. contiguous access node[]
.
there other perf counters can check false sharing (one thread's private variable sharing cache line thread's private variable, cache line bounces between cores). perf events change number of threads; point way towards explanation.
a multi-socket system 2-socket nehalem have numa (non-uniform_memory_access). numa-aware os try allocate memory that's fast core doing allocation.
so presumably buffer has of physical pages in memory attached 1 of 2 sockets. in case it's not can or should avoid, since assume you're filling array in single-threaded way before handing off multiple threads parsing. in general, though, try allocate memory (especially scratch buffers) in thread use most, when that's convenient.
this may partially explain less-than-perfect scaling number of threads. although more has nothing things, if @antonmalyshev's answer didn't help. having each thread work on contiguous range, instead of striding through array stride of n_threads
, should better l2 / l1 cache efficiency.
node[]
vector of pointers (so 8 threads, each thread uses 8 bytes of each 64 byte cache line touches in node[]
). however, each thread presumably touches way more memory in pointed-to data structures , strings. if node
entries point monotonically-increasing positions in other data structures , string, strided access node[]
creates non-contiguous access patterns of memory touched thread.
one possible benefit of strided access pattern: strided means if threads run @ more or less same speed, they're looking @ same part of memory @ same time. threads ahead slow down l3 misses, while other threads catch because see l3 hits. (unless happens lets 1 thread far behind, os de-scheduling time slice.)
so maybe l3 vs. ram bandwidth / latency more of issue efficient use of per-core l2/l1. maybe more threads, l3 bandwidth can't keep requests same cache lines l2 caches of multiple cores. (l3 isn't fast enough satisfy constant l2 misses cores @ once, if hit in l3.)
this argument applies pointed node[]
if contiguous ranges of node[]
point contiguous ranges of other memory.
Comments
Post a Comment