c++ - Performance decreases with a higher number of threads (no synchronization) -

- May 15, 2014

i have data structure (a vector) elements have parsed function, elements can parsed different threads.

following parsing method:

void consumerpool::parse(size_t n_threads, size_t id) {     (size_t idx = id; idx < nodes.size(); idx += n_threads)     {         // parse node         //parse(nodes[idx]);         parse(idx);     } }

where:

n_threads total number of threads
id (univocal) index of current thread

and threads created follow:

std::vector<std::thread> threads;  (size_t = 0; < n_threads; i++)     threads.emplace_back(&consumerpool::parse, this, n_threads, i);

unfortunately, if method works, performance of application decreases if number of threads too high. understand why performance decreases if there's no synchronization between these threads.

following elapsed times (between threads start , last join() return) according number of threads used:

2 threads: 500 ms
3 threads: 385 ms
4 threads: 360 ms
5 threads: 475 ms
6 threads: 580 ms
7 threads: 635 ms
8 threads: 660 ms

the time necessary threads creation between 1/2 ms. software has been tested using release build. following configuration:

2x intel(r) xeon(r) cpu e5507 @ 2.27ghz  maximum speed:  2.26 ghz sockets:    2 cores:  8 logical processors: 8 virtualization: enabled l1 cache:   512 kb l2 cache:   2.0 mb l3 cache:   8.0 mb

edit:

what parse() function following:

// data shared between threads (around 300k elements) std::vector<std::unique_ptr<foo>> vfoo; std::vector<rapidxml::xml_node<>*> nodes; std::vector<std::string> layers;  void parse(int idx) {     auto p = vfoo[idx];      // p->parse() allocate memory according content of xml node     if (!p->parse(nodes[idx], layers))         vfoo[idx].reset(); }

update:

we still don't have lot of info memory access patterns of parse(), , how time spends reading input data memory vs. how time spent writing/reading private scratch memory.

you p->parse() "allocates memory according content of xml node". if frees again, may see big speedup keeping big-enough scratch buffer allocated in each thread. memory allocation/deallocation "global" thing requires synchronization between threads. thread-aware allocator can handle allocate/free / allocate/free pattern satisfying allocations memory freed that thread, it's still hot in private l1 or l2 cache on core.

use kind of profiling find real hotspots. might memory allocation/deallocation, or might code reads memory.

your dual-socket nehalem xeon doesn't have hyperthreading, can't running issues threads slowing each other down if non-ht-aware os schedules 2 on 2 logical cores of same physical core.

you should investigate performance counters (e.g. linux perf stat, or intel's vtune) whether you're getting more cache misses per thread once pass 4 threads. nehalem uses large shared (for whole socket) l3 (aka last-level) caches, more threads running on same socket creates more pressure on that. relevant perf events llc_something, iirc.

you should @ l1/l2 misses, , see how scale number of threads, , how changes strided vs. contiguous access node[].

there other perf counters can check false sharing (one thread's private variable sharing cache line thread's private variable, cache line bounces between cores). perf events change number of threads; point way towards explanation.

a multi-socket system 2-socket nehalem have numa (non-uniform_memory_access). numa-aware os try allocate memory that's fast core doing allocation.

so presumably buffer has of physical pages in memory attached 1 of 2 sockets. in case it's not can or should avoid, since assume you're filling array in single-threaded way before handing off multiple threads parsing. in general, though, try allocate memory (especially scratch buffers) in thread use most, when that's convenient.

this may partially explain less-than-perfect scaling number of threads. although more has nothing things, if @antonmalyshev's answer didn't help. having each thread work on contiguous range, instead of striding through array stride of n_threads, should better l2 / l1 cache efficiency.

node[] vector of pointers (so 8 threads, each thread uses 8 bytes of each 64 byte cache line touches in node[]). however, each thread presumably touches way more memory in pointed-to data structures , strings. if node entries point monotonically-increasing positions in other data structures , string, strided access node[] creates non-contiguous access patterns of memory touched thread.

one possible benefit of strided access pattern: strided means if threads run @ more or less same speed, they're looking @ same part of memory @ same time. threads ahead slow down l3 misses, while other threads catch because see l3 hits. (unless happens lets 1 thread far behind, os de-scheduling time slice.)

so maybe l3 vs. ram bandwidth / latency more of issue efficient use of per-core l2/l1. maybe more threads, l3 bandwidth can't keep requests same cache lines l2 caches of multiple cores. (l3 isn't fast enough satisfy constant l2 misses cores @ once, if hit in l3.)

this argument applies pointed node[] if contiguous ranges of node[] point contiguous ranges of other memory.

Search This Blog

Image

c++ - Performance decreases with a higher number of threads (no synchronization) -

Comments

Post a Comment

Popular posts from this blog

Spring Boot + JPA + Hibernate: Unable to locate persister -

go - Golang: panic: runtime error: invalid memory address or nil pointer dereference using bufio.Scanner -

c - double free or corruption (fasttop) -