"James Van Buskirk" wrote in message
> > In your loop the 4th xorps is dependent on the 1st xorps etc.
> > In throughput they should probably be all independent,
> > but I don't know how to do that.
> > Probably it's somehwere in Agner Fog's do***ents.
>
> Nope. Both AF & Intel say latency 1 cycle, throughput 3 per cycle.
> Following that, we should in principle be able to issue three
> mutually independent xorps instructions in cycle 1, three mutually
> independent instructions in cycle 2 (but dependent on the operations
> in cycle 1 because there latency period has elapsed) 3 in cycle 3
> and so on. If I could find out why not in the docs, I wouldn't
> be asking here. In Intel 64 and IA-32 Architectures Optimization
> Reference Manual, section 3.5.2.1, ROB read ****t stalls are
> discussed, but they aren't relevant here because all operands
> will be in flight because they were modified within the last two
> instructions.
>
That they should be independent is in Agner Fogs optimizing_assembly.pdf.
But as mentioned I don't know how to measure it, perhaps ask Agner
himself.


|