Why is BCC?
Writing BPF Programs can be hard, so we needed a toolkit that makes writing them easier, at least, to a certain extent.
What is BCC?
BCC is a toolkit for creating frontend programs that can efficiently perform Kernel and User Level tracing. It comes with several useful tools and examples. BCC makes writing BPF programs less painful (and includes a C wrapper around LLVM), and front-ends in Python, Lua, and C++
But … Perf?
With perf
you can Debug
Applications or capture events for
the fun of it. However, it is a sampling
profiler which collects events periodically. It then “estimates” the system
performance statistics based on collected samples and Hardware Based Performance
Monitoring Counters. Perf (and any sampling profiler) can add non-trivial
amounts of overhead, and we are not even talking about compute spent
post-processing the captured samples.
Pre-requisites
For you to walk through the examples, you will need to install BCC
tools. To find out the
list of Hardware Events supported in your machine you could run perf list
and you should see something like the output at the end of this page
copied verbatim from my machine (search for ‘COLLAPSE’)
To be able to successfully run the Branch Prediction example illustrated here make sure you see the support for below two hardware events:
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
I am running a spanking new Linux Kernel I built and you can too, if you have an hour.
╭─ ~/s/bcc/bcc/build master ● ? ⍟1 ✔ 26.99G RAM 0.20 L
╰─ cat /proc/version
Linux version 5.5.0-rc4+ (manoj@manoj-desktop) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #1 SMP Fri Jan 3 00:01:03 PST 2020
I also compiled the perf binary from the same Kernel Tree since perf requires you to install the Kernel headers for the same version.
Show Me The Code!
A BCC program tends to have a common structure which is mostly along the lines of:
- Define the BPF Program which is going to be run by eBPF in the Kernel
- Attach the Perf Event or Probe in the Kernel
- Run whatever code you wanted to profile
- Detach the Perf Event or Probe
- Process the Result
The Program we are going to walk through is similar to the LLCStat
which is
present in the BCC Project but does not seem to run on any hardware that I tried
on. However, I thought it would still be interesting enough to profile the
Branch Prediction (or MisPrediction) Rate on all the cores and all PIDs running
in a specified period of time.
Define BPF Program
const std::string BPF_PROGRAM = R"(
#include <linux/ptrace.h>
#include <uapi/linux/bpf_perf_event.h>
struct event_t {
int cpu;
int pid;
char name[16];
};
BPF_HASH(ref_count, struct event_t);
BPF_HASH(miss_count, struct event_t);
static inline __attribute__((always_inline)) void get_key(struct event_t* key) {
key->cpu = bpf_get_smp_processor_id();
key->pid = bpf_get_current_pid_tgid();
bpf_get_current_comm(&(key->name), sizeof(key->name));
}
int on_branch_miss(struct bpf_perf_event_data *ctx) {
struct event_t key = {};
get_key(&key);
u64 zero = 0, *val;
val = miss_count.lookup_or_try_init(&key, &zero);
if (val) {
(*val) += ctx->sample_period;
}
return 0;
}
int on_branch_ref(struct bpf_perf_event_data *ctx) {
struct event_t key = {};
get_key(&key);
u64 zero = 0, *val;
val = ref_count.lookup_or_try_init(&key, &zero);
if (val) {
(*val) += ctx->sample_period;
}
return 0;
}
)";
The gist of the above code is that we define two data structures - both BPF Hash
Tables. They are Hash Tables in the Kernel, however, it is not important to know
exactly how they are used. We just need to know that they basically function
like Hash Tables. Their Key is of type struct event_t
and they store a value
of uint64_t
type. In addition, there are a couple of functions defined which
seem to look up some sort of key
which is of type struct event_t
. You
have probably connected the dots here :)
But who’s calling these functions? And why are they enclosed in double quotes
and stored like String
objects?
Init and Attach Events:
ebpf::BPF bpf;
auto init_res = bpf.init(BPF_PROGRAM);
// ...
auto attach_ref_res =
bpf.attach_perf_event(PERF_TYPE_HARDWARE, PERF_COUNT_HW_BRANCH_INSTRUCTIONS,
"on_branch_ref", 100, 0);
// ...
auto attach_miss_res = bpf.attach_perf_event(PERF_TYPE_HARDWARE,
PERF_COUNT_HW_BRANCH_MISSES, "on_branch_miss", 100, 0);
// ...
BPF
is the Class defined by the BCC toolkit which provides a lot of
utilities for completing BPF related tasks effectively and efficiently. This is
the biggest value of BCC framework in my opinion. We initialize this object
bpf
with the BPF Program which we wrote above, this code is going to be run
in the Kernel’s BPF JIT VM and it is injected by the BPF
object. “How
exactly?” you ask, through the attach_perf_event()
API. If you notice, the
string literals on_branch_ref
and on_branch_miss
passed as an arg to
this API are the same as the two functions within the BPF Program. So, in
effect, we are attaching two perf events and passing the callbacks as the
functions defined within the BPF program (again, which runs in the Kernel on
each event. What are these events that we talk about? They are
PERF_COUNT_HW_BRANCH_INSTRUCTIONS
and PERF_COUNT_HW_BRANCH_MISSES
both
are definied in the linux/perf_events.h
header file included in this
program. Note that we are specifying that these two events are of the type
PERF_TYPE_HARDWARE
so the Kernel knows to fetch the values for these
requests from the PMC Registers just like it does when perf
utlitiy requests
for these numbers. So we registered our handlers in the BPF program for the
events we are interested in.
Run the code to profile
In this case, we are simply capturing all the branch instructions and branch
misses throughout a specified duration (note the sleep()
for an amount of
time.
Detach our handlers
After the specified amout of time is finished, we detach our handlers by calling:
bpf.detach_perf_event(PERF_TYPE_HARDWARE, PERF_COUNT_HW_BRANCH_INSTRUCTIONS);
bpf.detach_perf_event(PERF_TYPE_HARDWARE, PERF_COUNT_HW_BRANCH_MISSES);
Process the output
If our program has reached until this point then most hurdles associated with running BPF programs have been crossed. We only need to make sure the output is meaningful or even there. In this example, we don’t automate the processing of this output, instead we simply pretty-print the stats on stdout.
auto instrns = bpf.get_hash_table<event_t, uint64_t>("ref_count");
auto misses = bpf.get_hash_table<event_t, uint64_t>("miss_count");
for (auto it : instrns.get_table_offline()) {
uint64_t hit;
try {
auto miss = misses[it.first];
hit = miss <= it.second ? it.second - miss : 0;
} catch (...) {
hit = it.second;
}
double ratio = (double(hit) / double(it.second)) * 100.0;
// ...
// pretty-print the results
In the above code, we are retrieving the hash tables via BPF’s
get_hash_table()
API. Note that the arg passed to this API contains the
string literal which were the names of the hash tables in our BPF program which
was attached earlier. Now we iterate through the hash tables and calculate the
Branch Hit Ratio and Branch Miss Ratio from the given counters. Refer to this
post for a better understanding
of the iteration.
The complete code can be downloaded from here
The output from the above program on my machine as-is is:
╭─ ~/s/bcc/bcc/build master ● ? ⍟1 ✔ 26.89G RAM 0.26 L
╰─ sudo ./examples/cpp/BranchPrediction 1
Probing for 1 seconds
PID 27835 (tmux: server) on CPU 4 Hit Rate 58.82% (12000/20400)
PID 0 (swapper/6) on CPU 6 Hit Rate 96.15% (274600/285600)
PID 4935 (tmux: server) on CPU 4 Hit Rate 0% (0/20400)
PID 4935 (tmux: server) on CPU 15 Hit Rate 0% (0/20400)
PID 27840 (python3) on CPU 2 Hit Rate 0.4902% (400/81600)
PID 0 (swapper/2) on CPU 2 Hit Rate 95.93% (450100/469200)
PID 27980 (kworker/4:5) on CPU 4 Hit Rate 88.73% (18100/20400)
PID 0 (swapper/15) on CPU 15 Hit Rate 96.28% (432100/448800)
PID 4711 (gnome-terminal-) on CPU 15 Hit Rate 2.451% (500/20400)
PID 0 (swapper/13) on CPU 13 Hit Rate 95.68% (527000/550800)
PID 20674 (firefox) on CPU 8 Hit Rate 83.82% (17100/20400)
PID 20856 (WebExtensions) on CPU 4 Hit Rate 55.39% (11300/20400)
PID 0 (swapper/4) on CPU 4 Hit Rate 95.48% (448000/469200)
PID 27836 (byobu-status) on CPU 6 Hit Rate 0% (0/20400)
PID 27850 (mv) on CPU 0 Hit Rate 5.392% (1100/20400)
PID 0 (swapper/8) on CPU 8 Hit Rate 89.12% (90900/102000)
PID 20680 (Gecko_IOThread) on CPU 1 Hit Rate 14.71% (3000/20400)
PID 28632 (kworker/5:1) on CPU 5 Hit Rate 88.97% (36300/40800)
PID 14423 (Web Content) on CPU 8 Hit Rate 0% (0/20400)
PID 0 (swapper/3) on CPU 3 Hit Rate 89.57% (127900/142800)
PID 27817 (BranchPredictio) on CPU 12 Hit Rate 94.12% (38400/40800)
PID 0 (swapper/14) on CPU 14 Hit Rate 91.93% (806400/877200)
PID 0 (swapper/1) on CPU 1 Hit Rate 93.34% (457000/489600)
PID 0 (swapper/7) on CPU 7 Hit Rate 95.59% (468000/489600)
PID 27836 (byobu-status) on CPU 1 Hit Rate 0% (0/20400)
#...
Happy Whatever!
Source:
- http://www.brendangregg.com
- https://github.com/iovisor/bcc
- http://www.mycpu.org/kernel-n00b-howto/
- http://www.mycpu.org/flamegraphs-on-c++/
- http://www.mycpu.org/perf-events/
Disclaimer: [Brendan Gregg](http://www.brendangregg.com) is a leading expert on
this topic. This post is an attempt to try and simplify things for a mortal like
myself. If you want authoritative content on anything BCC, BPF, or tracing
please visit Brendan's site.
CLICK COLLAPSE/UNCOLLAPSE FULL PERF LIST
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
cache-references [Hardware event]
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
stalled-cycles-backend OR idle-cycles-backend [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]
alignment-faults [Software event]
bpf-output [Software event]
context-switches OR cs [Software event]
cpu-clock [Software event]
cpu-migrations OR migrations [Software event]
dummy [Software event]
emulation-faults [Software event]
major-faults [Software event]
minor-faults [Software event]
page-faults OR faults [Software event]
task-clock [Software event]
duration_time [Tool event]
L1-dcache-load-misses [Hardware cache event]
L1-dcache-loads [Hardware cache event]
L1-dcache-prefetches [Hardware cache event]
L1-icache-load-misses [Hardware cache event]
L1-icache-loads [Hardware cache event]
branch-load-misses [Hardware cache event]
branch-loads [Hardware cache event]
dTLB-load-misses [Hardware cache event]
dTLB-loads [Hardware cache event]
iTLB-load-misses [Hardware cache event]
iTLB-loads [Hardware cache event]
amd_iommu_0/cmd_processed/ [Kernel PMU event]
amd_iommu_0/cmd_processed_inv/ [Kernel PMU event]
amd_iommu_0/ign_rd_wr_mmio_1ff8h/ [Kernel PMU event]
amd_iommu_0/int_dte_hit/ [Kernel PMU event]
amd_iommu_0/int_dte_mis/ [Kernel PMU event]
amd_iommu_0/mem_dte_hit/ [Kernel PMU event]
amd_iommu_0/mem_dte_mis/ [Kernel PMU event]
amd_iommu_0/mem_iommu_tlb_pde_hit/ [Kernel PMU event]
amd_iommu_0/mem_iommu_tlb_pde_mis/ [Kernel PMU event]
amd_iommu_0/mem_iommu_tlb_pte_hit/ [Kernel PMU event]
amd_iommu_0/mem_iommu_tlb_pte_mis/ [Kernel PMU event]
amd_iommu_0/mem_pass_excl/ [Kernel PMU event]
amd_iommu_0/mem_pass_pretrans/ [Kernel PMU event]
amd_iommu_0/mem_pass_untrans/ [Kernel PMU event]
amd_iommu_0/mem_target_abort/ [Kernel PMU event]
amd_iommu_0/mem_trans_total/ [Kernel PMU event]
amd_iommu_0/page_tbl_read_gst/ [Kernel PMU event]
amd_iommu_0/page_tbl_read_nst/ [Kernel PMU event]
amd_iommu_0/page_tbl_read_tot/ [Kernel PMU event]
amd_iommu_0/smi_blk/ [Kernel PMU event]
amd_iommu_0/smi_recv/ [Kernel PMU event]
amd_iommu_0/tlb_inv/ [Kernel PMU event]
amd_iommu_0/vapic_int_guest/ [Kernel PMU event]
amd_iommu_0/vapic_int_non_guest/ [Kernel PMU event]
branch-instructions OR cpu/branch-instructions/ [Kernel PMU event]
branch-misses OR cpu/branch-misses/ [Kernel PMU event]
cache-references OR cpu/cache-references/ [Kernel PMU event]
cpu-cycles OR cpu/cpu-cycles/ [Kernel PMU event]
instructions OR cpu/instructions/ [Kernel PMU event]
msr/aperf/ [Kernel PMU event]
msr/irperf/ [Kernel PMU event]
msr/mperf/ [Kernel PMU event]
msr/tsc/ [Kernel PMU event]
stalled-cycles-backend OR cpu/stalled-cycles-backend/ [Kernel PMU event]
stalled-cycles-frontend OR cpu/stalled-cycles-frontend/ [Kernel PMU event]
branch:
bp_l1_btb_correct
[L1 BTB Correction]
bp_l2_btb_correct
[L2 BTB Correction]
cache:
bp_l1_tlb_miss_l2_hit
[The number of instruction fetches that miss in the L1 ITLB but hit in
the L2 ITLB]
bp_l1_tlb_miss_l2_miss
[The number of instruction fetches that miss in both the L1 and L2 TLBs]
bp_snp_re_sync
[The number of pipeline restarts caused by invalidating probes that hit
on the instruction stream currently being executed. This would happen
if the active instruction stream was being modified by another
processor in an MP system - typically a highly unlikely event]
bp_tlb_rel
[The number of ITLB reload requests]
ic_cache_fill_l2
[The number of 64 byte instruction cache line was fulfilled from the L2
cache]
ic_cache_fill_sys
[The number of 64 byte instruction cache line fulfilled from system
memory or another cache]
ic_cache_inval.fill_invalidated
[IC line invalidated due to overwriting fill response]
ic_cache_inval.l2_invalidating_probe
[IC line invalidated due to L2 invalidating probe (external or LS)]
ic_fetch_stall.ic_stall_any
[IC pipe was stalled during this clock cycle for any reason (nothing
valid in pipe ICM1)]
ic_fetch_stall.ic_stall_back_pressure
[IC pipe was stalled during this clock cycle (including IC to OC
fetches) due to back-pressure]
ic_fetch_stall.ic_stall_dq_empty
[IC pipe was stalled during this clock cycle (including IC to OC
fetches) due to DQ empty]
ic_fw32
[The number of 32B fetch windows transferred from IC pipe to DE
instruction decoder (includes non-cacheable and cacheable fill
responses)]
ic_fw32_miss
[The number of 32B fetch windows tried to read the L1 IC and missed in
the full tag]
l2_cache_req_stat.ic_fill_hit_s
[IC Fill Hit Shared]
l2_cache_req_stat.ic_fill_hit_x
[IC Fill Hit Exclusive Stale]
l2_cache_req_stat.ic_fill_miss
[IC Fill Miss]
l2_cache_req_stat.ls_rd_blk_c
[LS Read Block C S L X Change to X Miss]
l2_cache_req_stat.ls_rd_blk_cs
[LS ReadBlock C/S Hit]
l2_cache_req_stat.ls_rd_blk_l_hit_s
[LsRdBlkL Hit Shared]
l2_cache_req_stat.ls_rd_blk_l_hit_x
[LS Read Block L Hit X]
l2_cache_req_stat.ls_rd_blk_x
[LsRdBlkX/ChgToX Hit X. Count RdBlkX finding Shared as a Miss]
l2_fill_pending.l2_fill_busy
[Total cycles spent with one or more fill requests in flight from L2]
l2_latency.l2_cycles_waiting_on_fills
[Total cycles spent waiting for L2 fills to complete from L3 or memory,
divided by four. Event counts are for both threads. To calculate
average latency, the number of fills from both threads must be used]
l2_request_g1.cacheable_ic_read
[Requests to L2 Group1]
l2_request_g1.change_to_x
[Requests to L2 Group1]
l2_request_g1.l2_hw_pf
[Requests to L2 Group1]
l2_request_g1.ls_rd_blk_c_s
[Requests to L2 Group1]
l2_request_g1.other_requests
[Events covered by l2_request_g2]
l2_request_g1.prefetch_l2
[Requests to L2 Group1]
l2_request_g1.rd_blk_l
[Requests to L2 Group1]
l2_request_g1.rd_blk_x
[Requests to L2 Group1]
l2_request_g2.bus_locks_originator
[Multi-events in that LS and IF requests can be received simultaneous]
l2_request_g2.bus_locks_responses
[Multi-events in that LS and IF requests can be received simultaneous]
l2_request_g2.group1
[All Group 1 commands not in unit0]
l2_request_g2.ic_rd_sized
[Multi-events in that LS and IF requests can be received simultaneous]
l2_request_g2.ic_rd_sized_nc
[Multi-events in that LS and IF requests can be received simultaneous]
l2_request_g2.ls_rd_sized
[RdSized, RdSized32, RdSized64]
l2_request_g2.ls_rd_sized_nc
[RdSizedNC, RdSized32NC, RdSized64NC]
l2_request_g2.smc_inval
[Multi-events in that LS and IF requests can be received simultaneous]
l2_wcb_req.cl_zero
[LS (Load/Store unit) to L2 WCB (Write Combining Buffer) cache line
zeroing requests]
l2_wcb_req.wcb_close
[LS to L2 WCB close requests]
l2_wcb_req.wcb_write
[LS to L2 WCB write requests]
l2_wcb_req.zero_byte_store
[LS to L2 WCB zero byte store requests]
l3_comb_clstr_state.other_l3_miss_typs
[Other L3 Miss Request Types. Unit: amd_l3]
l3_comb_clstr_state.request_miss
[L3 cache misses. Unit: amd_l3]
l3_lookup_state.all_l3_req_typs
[All L3 Request Types. Unit: amd_l3]
l3_request_g1.caching_l3_cache_accesses
[Caching: L3 cache accesses. Unit: amd_l3]
xi_ccx_sdp_req1.all_l3_miss_req_typs
[All L3 Miss Request Types. Ignores SliceMask and ThreadMask. Unit:
amd_l3]
xi_sys_fill_latency
[L3 Cache Miss Latency. Total cycles for all transactions divided by
16. Ignores SliceMask and ThreadMask. Unit: amd_l3]
core:
ex_div_busy
[Div Cycles Busy count]
ex_div_count
[Div Op Count]
ex_ret_brn
[Retired Branch Instructions]
ex_ret_brn_far
[Retired Far Control Transfers]
ex_ret_brn_ind_misp
[Retired Indirect Branch Instructions Mispredicted]
ex_ret_brn_misp
[Retired Branch Instructions Mispredicted]
ex_ret_brn_resync
[Retired Branch Resyncs]
ex_ret_brn_tkn
[Retired Taken Branch Instructions]
ex_ret_brn_tkn_misp
[Retired Taken Branch Instructions Mispredicted]
ex_ret_cond
[Retired Conditional Branch Instructions]
ex_ret_cond_misp
[Retired Conditional Branch Instructions Mispredicted]
ex_ret_cops
[Retired Uops]
ex_ret_fus_brnch_inst
[The number of fused retired branch instructions retired per cycle. The
number of events logged per cycle can vary from 0 to 3]
ex_ret_instr
[Retired Instructions]
ex_ret_mmx_fp_instr.mmx_instr
[MMX instructions]
ex_ret_mmx_fp_instr.sse_instr
[SSE instructions (SSE, SSE2, SSE3, SSSE3, SSE4A, SSE41, SSE42, AVX)]
ex_ret_mmx_fp_instr.x87_instr
[x87 instructions]
ex_ret_near_ret
[Retired Near Returns]
ex_ret_near_ret_mispred
[Retired Near Returns Mispredicted]
ex_tagged_ibs_ops.ibs_count_rollover
[Number of times an op could not be tagged by IBS because of a previous
tagged op that has not retired]
ex_tagged_ibs_ops.ibs_tagged_ops
[Number of Ops tagged by IBS]
ex_tagged_ibs_ops.ibs_tagged_ops_ret
[Number of Ops tagged by IBS that retired]
floating point:
fp_num_mov_elim_scal_op.opt_potential
[Number of Ops that are candidates for optimization (have Z-bit either
set or pass)]
fp_num_mov_elim_scal_op.optimized
[Number of Scalar Ops optimized]
fp_num_mov_elim_scal_op.sse_mov_ops
[Number of SSE Move Ops]
fp_num_mov_elim_scal_op.sse_mov_ops_elim
[Number of SSE Move Ops eliminated]
fp_ret_sse_avx_ops.all
[All FLOPS]
fp_ret_sse_avx_ops.dp_add_sub_flops
[Double precision add/subtract FLOPS]
fp_ret_sse_avx_ops.dp_div_flops
[Double precision divide/square root FLOPS]
fp_ret_sse_avx_ops.dp_mult_add_flops
[Double precision multiply-add FLOPS. Multiply-add counts as 2 FLOPS]
fp_ret_sse_avx_ops.dp_mult_flops
[Double precision multiply FLOPS]
fp_ret_sse_avx_ops.sp_add_sub_flops
[Single-precision add/subtract FLOPS]
fp_ret_sse_avx_ops.sp_div_flops
[Single-precision divide/square root FLOPS]
fp_ret_sse_avx_ops.sp_mult_add_flops
[Single precision multiply-add FLOPS. Multiply-add counts as 2 FLOPS]
fp_ret_sse_avx_ops.sp_mult_flops
[Single-precision multiply FLOPS]
fp_retired_ser_ops.sse_bot_ret
[SSE bottom-executing uOps retired]
fp_retired_ser_ops.sse_ctrl_ret
[SSE control word mispredict traps due to mispredictions in RC, FTZ or
DAZ, or changes in mask bits]
fp_retired_ser_ops.x87_bot_ret
[x87 bottom-executing uOps retired]
fp_retired_ser_ops.x87_ctrl_ret
[x87 control word mispredict traps due to mispredictions in RC or PC,
or changes in mask bits]
fp_retx87_fp_ops.add_sub_ops
[Add/subtract Ops]
fp_retx87_fp_ops.all
[All Ops]
fp_retx87_fp_ops.div_sqr_r_ops
[Divide and square root Ops]
fp_retx87_fp_ops.mul_ops
[Multiply Ops]
fp_sched_empty
[This is a speculative event. The number of cycles in which the FPU
scheduler is empty. Note that some Ops like FP loads bypass the
scheduler]
fpu_pipe_assignment.dual
[Total number multi-pipe uOps]
fpu_pipe_assignment.total
[Total number uOps]
memory:
ls_dc_accesses
[The number of accesses to the data cache for load and store
references. This may include certain microcode scratchpad accesses,
although these are generally rare. Each increment represents an
eight-byte access, although the instruction may only be accessing a
portion of that. This event is a speculative event]
ls_dispatch.ld_dispatch
[Counts the number of operations dispatched to the LS unit. Unit Masks
ADDed]
ls_dispatch.ld_st_dispatch
[Load-op-Stores]
ls_dispatch.store_dispatch
[Counts the number of operations dispatched to the LS unit. Unit Masks
ADDed]
ls_inef_sw_pref.data_pipe_sw_pf_dc_hit
[The number of software prefetches that did not fetch data outside of
the processor core]
ls_inef_sw_pref.mab_mch_cnt
[The number of software prefetches that did not fetch data outside of
the processor core]
ls_l1_d_tlb_miss.all
[L1 DTLB Miss or Reload off all sizes]
ls_l1_d_tlb_miss.tlb_reload_1g_l2_hit
[L1 DTLB Reload of a page of 1G size]
ls_l1_d_tlb_miss.tlb_reload_1g_l2_miss
[L1 DTLB Miss of a page of 1G size]
ls_l1_d_tlb_miss.tlb_reload_2m_l2_hit
[L1 DTLB Reload of a page of 2M size]
ls_l1_d_tlb_miss.tlb_reload_2m_l2_miss
[L1 DTLB Miss of a page of 2M size]
ls_l1_d_tlb_miss.tlb_reload_32k_l2_hit
[L1 DTLB Reload of a page of 32K size]
ls_l1_d_tlb_miss.tlb_reload_32k_l2_miss
[L1 DTLB Miss of a page of 32K size]
ls_l1_d_tlb_miss.tlb_reload_4k_l2_hit
[L1 DTLB Reload of a page of 4K size]
ls_l1_d_tlb_miss.tlb_reload_4k_l2_miss
[L1 DTLB Miss of a page of 4K size]
ls_locks.bus_lock
[Bus lock when a locked operations crosses a cache boundary or is done
on an uncacheable memory type]
ls_misal_accesses
[Misaligned loads]
ls_not_halted_cyc
[Cycles not in Halt]
ls_pref_instr_disp.load_prefetch_w
[Prefetch, Prefetch_T0_T1_T2]
ls_pref_instr_disp.prefetch_nta
[Software Prefetch Instructions (PREFETCHNTA instruction) Dispatched]
ls_pref_instr_disp.store_prefetch_w
[Software Prefetch Instructions (3DNow PREFETCHW instruction)
Dispatched]
ls_stlf
[Number of STLF hits]
ls_tablewalker.perf_mon_tablewalk_alloc_dside
[Tablewalker allocation]
ls_tablewalker.perf_mon_tablewalk_alloc_iside
[Tablewalker allocation]
other:
de_dis_dispatch_token_stalls0.agsq_token_stall
[AGSQ Tokens unavailable]
de_dis_dispatch_token_stalls0.alsq1_token_stall
[ALSQ 1 Tokens unavailable]
de_dis_dispatch_token_stalls0.alsq2_token_stall
[ALSQ 2 Tokens unavailable]
de_dis_dispatch_token_stalls0.alsq3_0_token_stall
[Cycles where a dispatch group is valid but does not get dispatched due
to a token stall]
de_dis_dispatch_token_stalls0.alsq3_token_stall
[ALSQ 3 Tokens unavailable]
de_dis_dispatch_token_stalls0.alu_token_stall
[ALU tokens total unavailable]
de_dis_dispatch_token_stalls0.retire_token_stall
[RETIRE Tokens unavailable]
ic_oc_mode_switch.ic_oc_mode_switch
[IC to OC mode switch]
ic_oc_mode_switch.oc_ic_mode_switch
[OC to IC mode switch]
rNNN [Raw hardware event descriptor]
cpu/t1=v1[,t2=v2,t3 ...]/modifier [Raw hardware event descriptor]
mem:
My Podcast!
If you like topics such as this then please consider subscribing to my podcast. I talk to some of the stalwarts in tech and ask them what their favorite productivity hacks are:
Available on iTunes Podcast
Visit Void Star Podcast’s page on iTunes Podcast Portal. Please Click ‘Subscribe’, leave a comment.