Problem is with high-level languages but I wonder, how low-level, intermediate LLVM code will do in this case... (I we would write this in LLVM code) is special opcodes for this, is flexible enough and LLVM can optimize for processor type?