Manoj Rao bio photo

Manoj Rao

Your Average Common Man

Email Twitter Github

Edit:

I published this post a couple of days ago. I have figured since then the content of this post has flaws. So I am making a follow up post to this instead of simply changing this as a reminder to future self of why I went wrong here.

tl;dr:

  • I didn’t understand / use the Google Benchmark correctly.
  • This flaw in the logic led to a few serial errors.
  • I tried to take a shorter route to benchmarking with Google Benchmark. Specifically, I did not realize that the benchmark loops for several thousands of iterations until confidence values are achieved. helper_() and clear_up_() were bad shortcuts.
  • My original intent was muddled as I tried to quickfix “minor” errors.

Firstly, this post is a short walk through in C++ code, but feel free to jump to the conclusion for my New Year’s Resolution (Spoiler: It’s not an “Aha!” moment)

Finding Unique elements in you vector

So, here I was minding my business on a nice weekend. I just wanted to find the unique elements in my vector of integers. So I started reaching for the nearest std::set data structure. Turns out, this is likely implemented using RB Trees according to CPP Reference

What If … ?

I was wondering if this is in critical path of my code, then should I blindly rely on it? So I just looked up the sample impl on wikipedia linked above. My head spun for a few seconds to see that the impl invokes every family member for each node including GetUncle(Node* n). This is usually a bad sign, it smells.

So, instead of looking at the gargantuan implementation details in the STDLIB codebase, I decided to simply measure.

Setup Benchmark

So I setup a quick Google Benchmark harness:

cmake_minimum_required(VERSION 3.16)

# Compiler and Standard
set(CMAKE_C_STANDARD 99)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_C_COMPILER clang-8)
set(CMAKE_CXX_COMPILER clang++-8)

# set the project name
set(this_project project_name)

# RUST, Erlang, what else
project(${this_project} C CXX)

add_subdirectory(src)

find_package(benchmark REQUIRED)
target_link_libraries(${this_project} benchmark::benchmark)

set(CMAKE_BUILD_TYPE Release)
set(CMAKE_ENABLE_EXPORTS ON)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

Measure the Truth

So I got the following code cranked up in no time.

Setup:

  • Get a few (1000) random numbers
  • Insert numbers into a set (RB Tree)
std::vector<int> nums_;
std::vector<int> nums_cp_;
auto helper_(int n) -> std::vector<int> {
    auto gen_node_ = [=]() -> int { return rand() % 100; };
    while (n--) { nums_.emplace_back(gen_node_()); }
    std::copy(nums_.begin(), nums_.end(), std::back_inserter(nums_cp_));
    return nums_;
}

auto clear_up_() -> void {
    nums_.clear();
    nums_cp_.clear();
}

static void BM_Overhead(benchmark::State& state) {
    for (auto _ : state) { 
        helper_(1000);
        clear_up_();
    }
}
BENCHMARK(BM_Overhead);

static void BM_Set(benchmark::State& state) {
    for (auto _ : state) {
        helper_(1000);
        std::set<int> s;
        for (const auto& e_ : nums_) { s.insert(e_); }
        clear_up_();
    }
}
BENCHMARK(BM_Set);

BENCHMARK_MAIN();

Result:

2021-01-04 20:57:17
Running ./src/project_name
Run on (16 X 3700 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 64 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 8192 KiB (x2)
Load Average: 0.48, 0.75, 0.79
------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
BM_StringCreation       32.7 ns         32.6 ns     21474767
BM_StringCopy           79.9 ns         79.9 ns      8852194
BM_Overhead      45110 ns        45102 ns        15699
BM_Set                310303 ns       310241 ns         2283

If a string copy takes under 80ns a RB Tree insertion of 1000 numbers takes tens of thousands of times longer? My intuition was that something was not right. But it’s the hallowed stdlib itself! You know, it comes with all that claim of:

"It's been optimized and tested over decades, so you dummies cannot outdo it in reasonable time"

To be fair, I also remember the famous:

"It's not magic, you could muck it up as easily as another language"

If there’s one skill I have built over time, it’s running into situations that could only be described as: “you are using it akk wrong, it won’t work well in this case”

I’m somewhat of an expert in this myself :D

Am I doing something wrong?

There’s only one way to find out, what if I could swap all the unique elements to the front side of the vector and leave all the duplicates to the back? There’s an algorithm just for this in the standard library - it’s called std::unique()

Allegedly, one such impl of this algo is:

template <class ForwardIterator>
  ForwardIterator unique (ForwardIterator first, ForwardIterator last)
{
  if (first==last) return last;

  ForwardIterator result = first;
  while (++first != last)
  {
    if (!(*result == *first))  // or: if (!pred(*result,*first)) for version (2)
      *(++result)=*first;
  }
  return ++result;
}

Now, further if the array is already sorted then it must be very easy.

std::vector<int> nums_;
std::vector<int> nums_sorted;
std::vector<int> nums_cp_;
auto helper_(int n) -> std::vector<int> {
    auto gen_node_ = [=]() -> int { return rand() % 100; };
    while (n--) { nums_.emplace_back(gen_node_()); }
    std::copy(nums_.begin(), nums_.end(), std::back_inserter(nums_sorted));
    std::sort(nums_sorted.begin(), nums_sorted.end());
    std::copy(nums_.begin(), nums_.end(), std::back_inserter(nums_cp_));
    return nums_;
}

auto clear_up_() -> void {
    nums_.clear();
    nums_cp_.clear();
	nums_sorted.clear();
}

static void BM_Overhead(benchmark::State& state) {
    for (auto _ : state) { 
        helper_(1000);
        clear_up_();
    }
}
BENCHMARK(BM_Overhead);

static void BM_Set(benchmark::State& state) {
    for (auto _ : state) {
        helper_(1000);
        std::set<int> s;
        for (const auto& e_ : nums_) { s.insert(e_); }
        clear_up_();
    }
}
BENCHMARK(BM_Set);

static void BM_UnorderedSet(benchmark::State& state) {
    for (auto _ : state) {
        helper_(1000);
        std::unordered_set<int> s;
        for (const auto& e_ : nums_) { s.insert(e_); }
        clear_up_();
    }
}
BENCHMARK(BM_UnorderedSet);

static void BM_UniqueUnSort(benchmark::State& state) {
    for (auto _ : state) {
        helper_(1000);
        auto uniq_iter = std::unique(nums_.begin(), nums_.end());
        nums_.erase(uniq_iter, nums_.end());
        clear_up_();
    }
}
BENCHMARK(BM_UniqueUnSort);

static void BM_UniqueSort(benchmark::State& state) {
    for (auto _ : state) {
        helper_(1000);
        auto uniq_iter = std::unique(nums_.begin(), nums_.end());
        nums_.erase(uniq_iter, nums_.end());
        clear_up_();
    }
}
BENCHMARK(BM_UniqueSort);

BENCHMARK_MAIN();

Results:

So the simple stupid algo is close to 17x faster. Further, if the array is already sorted then it’s way faster! To be fair, the sorting is a significant overhead so unless your data is already sorted simply sticking with the swapping algo gives a significant speedup.

2021-01-04 20:57:17
Running ./src/project_name
Run on (16 X 3700 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 64 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 8192 KiB (x2)
Load Average: 0.48, 0.75, 0.79
------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
BM_StringCreation       32.7 ns         32.6 ns     21474767
BM_StringCopy           79.9 ns         79.9 ns      8852194
BM_Overhead      45110 ns        45102 ns        15699
BM_Set                310303 ns       310241 ns         2283
BM_UnorderedSet       146815 ns       146788 ns         4702
BM_UniqueUnSort        60466 ns        60454 ns        12010

Benchmark

  • Overhead is present in all functions:

Perf Counters (For Fun):

sudo ~/bin/perf stat -e "sched:sched_process*,task:*,L1-dcache-loads,L1-dcache-load-misses,cycles,cs,faults,migrations" -d -d -d ./src/project_name 
<-dcache-loads,L1-dcache-load-misses,cycles,cs,faults,migrations" -d -d -d ./src/project_name 
2021-01-04 21:05:17
Running ./src/project_name
Run on (16 X 3700 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 64 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 8192 KiB (x2)
Load Average: 0.69, 0.62, 0.68
------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
BM_StringCreation       31.8 ns         31.8 ns     22014537
BM_StringCopy           80.3 ns         80.2 ns      8700024
BM_Overhead      44357 ns        44334 ns        15554
BM_Set                315421 ns       315218 ns         2200
BM_UnorderedSet       151866 ns       151778 ns         4507
BM_UniqueUnSort        60268 ns        60228 ns        11901

 Performance counter stats for './src/project_name':

                 0      sched:sched_process_free                                    
                 1      sched:sched_process_exit                                    
                 0      sched:sched_process_wait                                    
                 0      sched:sched_process_fork                                    
                 1      sched:sched_process_exec                                    
                 0      sched:sched_process_hang                                    
                 0      task:task_newtask                                           
                 1      task:task_rename                                            
    30,338,551,811      L1-dcache-loads                                               (41.57%)
         2,017,065      L1-dcache-load-misses     #    0.01% of all L1-dcache hits    (41.64%)
    24,102,837,079      cycles                                                        (41.71%)
               564      cs                                                          
               146      faults                                                      
                 3      migrations                                                  
    30,422,089,436      L1-dcache-loads                                               (41.78%)
         1,842,248      L1-dcache-load-misses     #    0.01% of all L1-dcache hits    (41.85%)
   <not supported>      LLC-loads                                                   
   <not supported>      LLC-load-misses                                             
     2,444,321,168      L1-icache-loads                                               (41.82%)
         1,575,062      L1-icache-load-misses     #    0.06% of all L1-icache hits    (41.77%)
           245,849      dTLB-loads                                                    (41.70%)
            88,316      dTLB-load-misses          #   35.92% of all dTLB cache hits   (41.64%)
            70,125      iTLB-loads                                                    (41.57%)
            11,027      iTLB-load-misses          #   15.72% of all iTLB cache hits   (41.50%)
         1,039,464      L1-dcache-prefetches                                          (41.47%)
   <not supported>      L1-dcache-prefetch-misses                                   

       5.756089511 seconds time elapsed

       5.735565000 seconds user
       0.000000000 seconds sys

Conclusions:

  • Question everything in your workload,
  • Experiment for your usecase
  • Have fun. Sounds like a good New Year’s Resolution anyway!

Appendix:

This generated quite a few comments on Reddit So I will try to address a few valid points that were raised.

Quick Benchmark is a nice tool!

https://quick-bench.com/q/ekKw1U0yXiRF2-dYFmr8l_8naYc

“Do not use Debug Build”

This is valid in general while measuring performance. But in this case, my cmake is applicable only to the code I write! Remember, you won’t be building a debug version of the standard library just because you built your application with Debug flags. If you see the code, there’s hardly any code that’s not from the stdlib here.

I use Debug builds for experiments because it is helpful in profiling etc. But here, we are simply trying to illustrate that using simpler algos can be more useful. Unless, you are profiling a piece of algo/code you have come up with the compiler won’t affect the behavior of standard library code in any significant way based on the flags set in your code.

Nonetheless, this is a fair point, so I have modified accordingly and shared the numbers below.

“Why use a sorted structure”

Because it makes it easier to find duplicates so you would expect the final algo in our case Unique, to run faster.

Why set vs map?

Because intuitively you associate set as a container unique keys.

“I have a problem with the title?”

  • The Question is "What If?"
  • New Year’s Resolution is copied over from above:
Question everything in your workload,
Experiment for your usecase
Have fun.

Code Changes:

Thanks to many redditors for useful feedback: u/pi_stuff, u/FutureChrome, u/shimdar, and u/lostera


Bring Your Own Cause

If you think any info here has remotely helped you consider dropping a penny for this cause, just click me . You can visit https://www.bbc.com/news/world-asia-india-52672764 Unfortunately, there are plenty of sad things happening all over the world, if you have a different cause or charity you'd rather support please do. And if you did make a donation, please drop a note to me (annotated) or leave a comment here (anonymous is OK!) and I will use that as motivation to write more useful content here.

If you like topics such as this then please consider subscribing to my podcast. I talk to some of the stalwarts in tech and ask them what their favorite productivity hacks are:

Available on iTunes Podcast

Visit Void Star Podcast’s page on iTunes Podcast Portal. Please Click ‘Subscribe’, leave a comment.

Get it iTunes