Skip to main content

Command Palette

Search for a command to run...

Don't vibe-code, embrace learning primitives

Implementing various tricks and performance optimizations and learning how the things are laid out in the lowest level possible

Updated
8 min read
Don't vibe-code, embrace learning primitives
K
I love Comp Sci, I think its one of the coolest subjects out there, That's exactly why i wanna explore all the domains and subdomains out there, and learn new things, keep getting better and overall become a better software engineer.

Disclaimer

These are all the things i am still currently learning and exploring, Although i am sure most of the things addressed here are true, I am bound to make mistakes, that's why having a community is so amazing because i can stand corrected, learn new things, better things and we all can grow. Feel free to trash me in the comments.

Additionally i know compilers can optimize a lot of these “tricks” I mention but, I am here to learn “what” problems exist and “how” does a compiler solve them for us.

I plan to run all of these things with no compiler optimization, because i wanted to see how the kernel / hardware / memory is laid out in bare metal.

Concepts

My plan is to learn new concepts, tricks / optimizations but mainly how things are laid out in memory / hardware level as much as possible.

End goal would be to develop the intuition to be strong enough so that when i code stuff even in other languages i keep things like this in mind and overall become a better developer all together.

Cache lines

  • So one thing i learnt is that whenever you fetch something that is not in the cache, you get the sequential elements (the entire cache line) for free (it doesn’t fetch one value it fetches the entire cache line), so the follow up elements access would be really fast.

  • That's why something like an array would be way faster than linked list. (But i could think of a linked list which is light enough to be cache friendly and the addresses are somewhat stored contiguous?)(Basically arena allocated linked lists.)

  • As shown in the diagram, when i fetch A[0], i will get up to A[5] and even the garbage values (which may or may not be used) for free up to 16 elements (64 bytes assuming 4 byte ints) (Again assuming modern systems).

  • That's why when you access things on a stride (that is you exactly miss by the cache line or more) you pay the cost of fetching from the DRAM (or slower cache).

  • That's why Spatial Locality is very important, because if your data that you are accessing live close to each other, then you get the neighboring elements for essentially free.

Row order vs Col order access

#include "stdlib.h"
#include <stdio.h>

#define N 10000
int matrix[N][N] = {15};

int main(int argc, char *argv[]) {
  long long sum = 0;

  // row order access => 0% cache miss
  for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
      sum += matrix[i][j];
    }
  }

  // col order access => messes up the cache, but its not 100% cache miss (or something close to that? why?)
  // for (int j = 0; j < N; j++) {
  //   for (int i = 0; i < N; i++) {
  //     sum += matrix[i][j];
  //   }
  // }
  printf("[%lld]\n", sum);

  return EXIT_SUCCESS;
}

column order

row order

Observation

  • This was mostly in order and according to theory, row order matrix access is way faster than the column order.

  • But the interesting thing was that the L1 cache miss was around 12%, hm why is that though? if i am not wrong, with the size of the matrix the cache line should have been evicted for every single access.

Similar observation when using matrix transposition. If you do the correct set of loop ordering, you get the best read/write cache utilization.

wrong loop ordering

correct loop ordering

Loop ordering

Basically if you keep your loops in the correct order, you can make such that your element access are sequential with respect to how the memory is laid out and get very high cache hit %.

Additional Things

  • Here i also learnt about dirty cache writes.

  • If you want to write the things (modify) to memory. If its not there in the cache, you fetch the ENTIRE CACHE LINE?? and then update only the VALUE YOU WANT TO UPDATE. and then write the *ENTIRE CACHE LINE BACK???? (You don't write it immediately but you mark it as dirty and then later it can be flushed whenever, I am the stupid one xD).

  • That is so expensive. maybe there are some optimizations like batching these writes (Just mentioned). but i have to look at how it does these things.

  • Feel free to write out things you know about this.

  • Tiling will help with the cache. Because when you tile things you can limit the traversal to only certain BYTES between each other (working set), and the cache wont really be evicted by then and the jump is shorter.

  • Even though the code gets uglier with extra FOR loops. Maybe i am just assuming, the compiler does some of these optimizations for us? or maybe something even better!

Branch Prediction

  • This is one of the most interesting aspects i have seen the CPU does.

  • Predicting what part of the branch will be executed and then performing speculative execution is very interesting.

  • Basically the CPU will pre-execute the branch that is most likely to be executed before the condition has been evaluated.

  • If your branch predictions are deterministic for example (T, F, T, F, T, F). Then the CPU can predict which branch will be executed based on this pattern and perform speculative execution.

  • I also clearly see the cost, if the branch prediction is wrong, then i have to throw away all the speculative execution i have done till now and then go back to the branch and execute the other.

  • If misses like this happen very often it will be very expensive.

Branched AND random branch conditions (50% probability of T / F)

Branch less version of the same code

#include <stdio.h>
#include <stdlib.h>

#define SIZE 65536
#define ITERS 20000

int data[SIZE];

void runBranched(void) {
  long long sum = 0;
  for (int step = 0; step < ITERS; step++) {
    for (int i = 0; i < SIZE; i++) {
      if (data[i] >= 128) {
        sum += data[i];
      } else {
        sum -= data[i];
      }
    }
  }
  printf("Branched Sum: %lld\n", sum);
}

  /* 
   Basically convert the values of 0 = 255 => 0 = 1 => -1 = +1
   So that if the value is >= 128 we get +1, < 128 we get -1
   then we can just * data[i] and add it to sum
  */
void runBranchless(void) {
  long long sum = 0;
  for (int step = 0; step < ITERS; step++) {
    for (int i = 0; i < SIZE; i++) {
      int value = (data[i] / 128) * 2 - 1;
      sum += data[i] * value;
    }
  }
  printf("Branchless Sum: %lld\n", sum);
}

int main(int argc, char *argv[]) {
  /* 
  fixed seed
  */
  srand(1337);
  /* 
  generate numbers between the range 0 - 255 
  */
  for (int i = 0; i < SIZE; i++) {
    data[i] = rand() % 256;
  }

  // runBranched();
  runBranchless();

  return 0;

This is not to suggest don't use branches at all, (attributed to the entire article not just to this topic) quite the opposite, well predicted branches run way faster than branch less code. If your code has well predicted branches which the CPU can predict, go for it.

Hardware Pre-fetching

  • I think one of the reason as to why we did not hit 100% cache miss on our col order access of matrix, was because the striding pattern is “guessable?”. LOL I AM STUPID AF, THIS WAS NOTHING RELATED TO HARDWARE PREFETCHERS LOL.

  • I think the CPU learnt from the pattern and did pre-fetching hardware level and hence the cache was still being hit for the majority of the run. (IT WAS NOT,(maybe not the whole picture but yeah) ALSO I ALLOCATED EVERYTHING IN THE STACK, SO ITS PROBABLY SOMETHING RELATED TO THAT?)

  • That's very neat, This shows how far the modern CPU’s have come. You think AI is cool? the CPU can literally learn the access pattern and do things auto-magically to speed up the program, “AGI HAS BEEN ACHIEVED” “Neural Networks inside CPU confirmed???”.

  • Pre-fetchers are special circuits, that are present near the cache that keeps seeing the memory access pattern, it can recognize patterns like [ADDRESS X] => [ADDRESS X + 40KB] => [ADDRESS X + 80KB]. and this sends prefetch signals to the memory to keep those addresses ready for next access.

Conclusion

  • My next steps would be to convert them to ASM so that i can see the exact instructions being generated. I also know compiler does so many other cool optimizations, i will look into those as well.

  • I have just started, and i think you should too. This is the best and the easiest time to learn how to code / understand fundamentals, So that we can produce some amazing software. Why not learn now and become a better engineer later?

  • I would also greatly appreciate any nice books i could read regarding these topics. Thanks for reading.

Systems Programming

Part 1 of 1

Systems programming, Learning about kernel, operating systems and performance optimizations. Seeing how things are structured in the lowest level possible.