Debugging Multi-Thread C++ Applications with BPF

In earlier post, I had written about attempting to debug C++ applications and find the stack traces that are blocked off-CPU. However, if you followed towards the end the of the post you can clearly see that whatever I had tried hadn’t worked. I concluded saying there was going to be a follow up post. So here it is…

Writing BPF Programs

In an earlier post, I had disucussed with an example on writing useful programs in C++ that allows us to gather important low level system information. We are going to continue with the trend here and try to write a simple BPF program compiled with BCC.

Why Is Off-CPU Stats?

A majority of the systems latencies that are not directly from programming or cannot be attributed to inefficient language semantics typically arise from system latency. Furthermore, a large portion of such wait times are caused by waiting for some events. If this wait is deliberate such as waiting for some event, then Kernel puts the task in an interruptible wait if the wait is not voluntary i.e., I/O wait then the task enters an uninterruptible wait. In such a state, the CPU cannot move on to doing other useful tasks but it has to wait for the action to complete. This results in higher / variable latency for the system since the wait times are not under the control of the waiting party.

What Is Off-CPU Analysis?

For a detailed analysis of off-CPU analysis, read Brendan Gregg’s article In this entry, we will simply setup the initial BPF program to be able to collect data from the system for both on and off CPU stats.

BPF Code

const std::string BPF_PROGRAM = R"(
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>

struct key_t {
    u32 pid;
    u32 tgid;
    int user_stack_id;
    int kernel_stack_id;
    char name[TASK_COMM_LEN];
};
BPF_HASH(counts, struct key_t, u64);
BPF_HASH(start, u32, u64);
BPF_STACK_TRACE(stack_traces, 16384);

int my_on_cpu(struct pt_regs *ctx, struct task_struct *prev) {
    u32 pid = prev->pid;
    u32 tgid = prev->tgid;
    u64 ts, *tsp;

    // record previous thread sleep time
    ts = bpf_ktime_get_ns();
    start.update(&pid, &ts);

    // get the current thread's start time
    pid = bpf_get_current_pid_tgid();
    tgid = bpf_get_current_pid_tgid() >> 32;
    tsp = start.lookup(&pid);
    // missed start or filtered
    if (tsp == 0) {
        return 0;
    }

    // calculate current thread's delta time
    u64 delta = bpf_ktime_get_ns() - *tsp;
    start.delete(&pid);
    delta = delta / 1000;

    // create map key
    struct key_t key = {};

    key.pid = pid;
    key.tgid = tgid;
    key.user_stack_id = stack_traces.get_stackid(ctx, BPF_F_USER_STACK);
    key.kernel_stack_id = stack_traces.get_stackid(ctx, 0);
    bpf_get_current_comm(&key.name, sizeof(key.name));

    counts.update(&key, &delta);

    return 0;
})";

What are KProbes??

According to the Linux Kernel Documentation, event tracing with krpobes uses the same infrastructure that hooks into kprobes. Here’s a blurb from the website that introduces kprobes:

These events are similar to tracepoint based events. Instead of Tracepoint, this is based on kprobes (kprobe and kretprobe). So it can probe wherever kprobes can probe (this means, all functions except those with __kprobes/nokprobe_inline annotation and those marked NOKPROBE_SYMBOL). Unlike the Tracepoint based event, this can be added and removed dynamically, on the fly.

Important Pieces of Code

The above section described some common used structure and captured a couple of important timestamps.

Compute the Delta Time

We are attaching our kprobe to the function in scheduler code that is called to finish switching the current task into the next one.

    // record previous thread sleep time
    ts = bpf_ktime_get_ns();
    start.update(&pid, &ts);

    // get the current thread's start time
    pid = bpf_get_current_pid_tgid();
    tgid = bpf_get_current_pid_tgid() >> 32;
    tsp = start.lookup(&pid);
    // missed start or filtered
    if (tsp == 0) {
        return 0;
    }

    // calculate current thread's delta time
    u64 delta = bpf_ktime_get_ns() - *tsp;
    start.delete(&pid);
    delta = delta / 1000;

Map for BCC Program

Now we store a map whose key contains the PID, TGID. We also save the user stack ID and kernel stack ID. This combo is very crucial in getting a stack trace that can tell us if a thread is waiting OFF-CPU.

    // create map key
    struct key_t key = {};

    key.pid = pid;
    key.tgid = tgid;
    key.user_stack_id = stack_traces.get_stackid(ctx, BPF_F_USER_STACK);
    key.kernel_stack_id = stack_traces.get_stackid(ctx, 0);
    bpf_get_current_comm(&key.name, sizeof(key.name));

    counts.update(&key, &delta);

Post-Processing Data

Here you can see how the data obtained from these multiple BPF_HASH tables.

    for (auto it : counts_tbl_offline) {
    std::cout << "TGID: " << it.first.tgid << " (" << it.first.name << ") "
          << "made " << it.second
          << " was off CPU on the following stack: " << std::endl;
    if (it.first.kernel_stack_id >= 0) {
        std::cout << "  Kernel Stack:" << std::endl;
        auto syms = stack_stuff.get_stack_symbol(it.first.kernel_stack_id, -1);
        for (auto sym : syms)
        std::cout << "    " << sym << std::endl;
    } else {
        // -EFAULT normally means the stack is not availiable and not an error
        if (it.first.kernel_stack_id != -EFAULT) {
        lost_stacks++;
        std::cout << "    [Lost Kernel Stack" << it.first.kernel_stack_id << "]"
              << std::endl;
        }
    }
    if (it.first.user_stack_id >= 0) {
        std::cout << "  User Stack:" << std::endl;
        auto syms = stack_stuff.get_stack_symbol(it.first.user_stack_id, it.first.pid);
        for (auto sym : syms)
        std::cout << "    " << sym << std::endl;
    } else {
        if (it.first.user_stack_id != -EFAULT) {
            lost_stacks++;
            std::cout << "    [Lost User Stack " << it.first.user_stack_id << "]"
                << std::endl;
        }
        }
    }

Demo

I compiled the program with BCC engine and produced a ELF binary output.

sudo ./MethodOffCPU

The following stacks are a tiny snippet of what I obtained.

TGID: 10 (rcu_sched) made 7998 was off CPU on the following stack:
  Kernel Stack:
    finish_task_switch
    schedule
    schedule_timeout
    rcu_gp_kthread
    kthread
    ret_from_fork
TGID: 27947 (python3) made 6 was off CPU on the following stack:
  Kernel Stack:
    finish_task_switch
    schedule
    exit_to_usermode_loop
    prepare_exit_to_usermode
    swapgs_restore_regs_and_return_to_usermode
  User Stack:
    [UNKNOWN]
    [UNKNOWN]
TGID: 17976 (gnome-terminal-) made 59 was off CPU on the following stack:
  Kernel Stack:
    finish_task_switch
    schedule
    schedule_hrtimeout_range_clock
    schedule_hrtimeout_range
    poll_schedule_timeout.constprop.11
    do_sys_poll
    __x64_sys_poll
    do_syscall_64
    entry_SYSCALL_64_after_hwframe
  User Stack:
    poll
	.....
	.....

Conclusion

We have overcome all the obstacles for getting off CPU stack trace and the time spent on the glass. I think this is a highly useful tool. BPF allows you to generate such low level information without much experience.

Source

Brendan Gregg’s website: http://www.brendangregg.com
IOVISOR/BCC Open Source Project on GitHub

My Podcast!

If you like topics such as this then please consider subscribing to my podcast. I talk to some of the stalwarts in tech and ask them what their favorite productivity hacks are:

Available on iTunes Podcast

Visit Void Star Podcast’s page on iTunes Podcast Portal. Please Click ‘Subscribe’, leave a comment.

Manoj Rao