Little Concurrency Fun With STL

As a programmer, we are often lead to use libraries or even Middleware. They are useful; and sometime, for a big amount of money: they do what you need and they do it right!

This time, I simply choose the STL so; okay, right: the STL in Visual Studio 8 (dinkumware I think)

please consider this simple yet useless code snippet

#pragma omp parallel      
    #pragma omp for
    for(int i = 0; i < 10; ++i )
        for(int j = 0; j < 10000; ++j )
            std::stringstream K;
            for(int k = 0; k < 25; ++k ) K << fibo(i) << std::endl;
            totalcount += K.str().size();

The reader will notice the subtle #pragma omp, which means the code uses OpenMP and which means: the code is PARALLEL (hurrah!)

Actually, just

#pragma omp for
for(int i = 0; i < 10; ++i )

is parallelized, OpenMP creates 4 threads (on my machine), each of them will process what is inside

And then, my goal is reached: A lot of call are made in parallel to std::stringstream

How does this behave on a Q6600 ?

Single threaded: 2.312s taken
Multi threaded: 69.609s taken


My first reaction was: yes but those stringstream are making allocations all over the place, allocations clutter performances
Right, allocation is the root of all evil… But there's still hope… The STL offers a proper way to define custom allocations, which in turn can dump sub-calls to malloc/free

You can consider such pool allocator :

template< typename T, size_t defaultSize >
class pool_allocator : public std::allocator<T>
    T vector[defaultSize];
    size_t pos;
    typedef size_t    size_type;
    typedef ptrdiff_t difference_type;
    typedef T*        pointer;
    typedef const T*  const_pointer;
    typedef T &        reference;
    typedef const T &  const_reference;
    typedef T         value_type;
        pos = 0;
    void deallocate(pointer p, size_type n) {}
    size_type max_size() const throw() { return defaultSize - pos; }
    pointer allocate(size_type amount)
        if( pos + amount <= defaultSize )
            size_t oldpos = pos;
            pos += amount;
            return &vector[oldpos];
        return 0;        //            throw();
    void construct(pointer p, const value_type& x) { 
        new(p) value_type(x); 
    void destroy(pointer p) { p->~value_type(); }
    template <class U> 
    struct rebind { typedef pool_allocator< U, defaultSize> other; };

Sorry, I think this code is a) full of mistake b) incompatible with GCC

Then if you remplace std::stringstream in the first code snippet by this :

typedef std::basic_stringstream< char, std::char_traits< char >, pool_allocator< char, 1024 > > customstringstream;
customstringstream K;

You end up with a code that do no allocation at all…

But still, it takes 69s to perform everything

A quick break during the 69s of processing showed exactly where the issue is:


xlocale Line 497 ( std::use_facet<std::numpunct<char> >(const std::locale & _Loc={…}) )

the STL has locks internally, which is normal for a thread-safe library but sometime can prevent you from being thread-efficient

Bottom Line:

As programmers, we are often leaded to use various tool with various efficiency. Some of them are gracefully thread safe, the STL is a good example.

Depending the implementation, that thread safety might imply locks and critical sections. The abuse of locks and critical sections may lead to a "trashing" of performances.

In our particular case, if for any reason, two or more threads start using std::stringstream very intensively, it will lead to stalling

As an alternative, you can consider using snprintf which matches the performance improvement scaling.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License