As a programmer, we are often lead to use libraries or even Middleware. They are useful; and sometime, for a big amount of money: they do what you need and they do it right!
This time, I simply choose the STL so; okay, right: the STL in Visual Studio 8 (dinkumware I think)
please consider this simple yet useless code snippet
#pragma omp parallel { #pragma omp for for(int i = 0; i < 10; ++i ) { for(int j = 0; j < 10000; ++j ) { std::stringstream K; for(int k = 0; k < 25; ++k ) K << fibo(i) << std::endl; totalcount += K.str().size(); } } }
The reader will notice the subtle #pragma omp, which means the code uses OpenMP and which means: the code is PARALLEL (hurrah!)
Actually, just
#pragma omp for for(int i = 0; i < 10; ++i )
is parallelized, OpenMP creates 4 threads (on my machine), each of them will process what is inside
And then, my goal is reached: A lot of call are made in parallel to std::stringstream
How does this behave on a Q6600 ?
Single threaded: 2.312s taken
Multi threaded: 69.609s taken
WTF?
My first reaction was: yes but those stringstream are making allocations all over the place, allocations clutter performances
Right, allocation is the root of all evil… But there's still hope… The STL offers a proper way to define custom allocations, which in turn can dump sub-calls to malloc/free
You can consider such pool allocator :
template< typename T, size_t defaultSize > class pool_allocator : public std::allocator<T> { T vector[defaultSize]; size_t pos; public: typedef size_t size_type; typedef ptrdiff_t difference_type; typedef T* pointer; typedef const T* const_pointer; typedef T & reference; typedef const T & const_reference; typedef T value_type; pool_allocator() { pos = 0; } void deallocate(pointer p, size_type n) {} size_type max_size() const throw() { return defaultSize - pos; } pointer allocate(size_type amount) { if( pos + amount <= defaultSize ) { size_t oldpos = pos; pos += amount; return &vector[oldpos]; } return 0; // throw(); } void construct(pointer p, const value_type& x) { new(p) value_type(x); } void destroy(pointer p) { p->~value_type(); } template <class U> struct rebind { typedef pool_allocator< U, defaultSize> other; }; };
Sorry, I think this code is a) full of mistake b) incompatible with GCC
Then if you remplace std::stringstream in the first code snippet by this :
typedef std::basic_stringstream< char, std::char_traits< char >, pool_allocator< char, 1024 > > customstringstream; customstringstream K;
You end up with a code that do no allocation at all…
But still, it takes 69s to perform everything
A quick break during the 69s of processing showed exactly where the issue is:
_BEGIN_LOCK(_LOCK_LOCALE)
xlocale Line 497 ( std::use_facet<std::numpunct<char> >(const std::locale & _Loc={…}) )
the STL has locks internally, which is normal for a thread-safe library but sometime can prevent you from being thread-efficient
Bottom Line:
As programmers, we are often leaded to use various tool with various efficiency. Some of them are gracefully thread safe, the STL is a good example.
Depending the implementation, that thread safety might imply locks and critical sections. The abuse of locks and critical sections may lead to a "trashing" of performances.
In our particular case, if for any reason, two or more threads start using std::stringstream very intensively, it will lead to stalling
As an alternative, you can consider using snprintf which matches the performance improvement scaling.