*Bounty: 250*

*Bounty: 250*

I recently found the GitHub repository used to make the measurements in the AES-GCM-SIV paper where they implement polynomial hashing using POLYVAL.

This means in this context to compute the usual $tau=sum_{i=0}^nm_iH^{n-i}$ for 16-byte message blocks $m_i$ and a 128-bit key $H$ in the usual way as $H_{i+1}=(m_i+H_i)cdot H$ with messages and $H$ being interpreted as polynomials over $mathbb F_2[x]/(x^{128}+x^{127}+x^{126}+x^{121}+1)$.

Now the actual computation uses x86 hardware intrinsics (which can be looked up here).

The code in question is `Polyval_Horner`

(`polyval.c`

, line 137) in particular the following extract (adopted and commented from the repo):

```
__m128i TMP0, TMP1, TMP2, TMP3, TMP4, T, POLY, H;
H = _mm_loadu_si128(((__m128i*)pH));
T = _mm_loadu_si128(((__m128i*)TAG));
// ordering of the inputs is reversed, last is most significant
// 0xc2000000 corresponds to the top 3 POLYVAL coefficients
POLY = _mm_setr_epi32(0x1,0,0,0xc2000000);
T = _mm_xor_si128(T, _mm_loadu_si128((__m128i*)inp));
// This instruction takes two 64-bit halves and carrylessly multiplies them
// If the lower nibble is 0, take the lower half of the first input, else the upper half
// likewise with the upper nibble for the second input
TMP1 = _mm_clmulepi64_si128(T, H, 0x00);
TMP4 = _mm_clmulepi64_si128(T, H, 0x11);
TMP2 = _mm_clmulepi64_si128(T, H, 0x10);
TMP3 = _mm_clmulepi64_si128(T, H, 0x01);
// TMP2 and 3 contain the range of coefficients from 64 to 191, add them
TMP2 = _mm_xor_si128(TMP2, TMP3);
// now extract the upper and lower halves of these coefficients and add them
// into either TMP1 or 4 depending on whether they are the lower or the upper coefficients
TMP3 = _mm_slli_si128(TMP2, 8);
TMP2 = _mm_srli_si128(TMP2, 8);
TMP1 = _mm_xor_si128(TMP3, TMP1);
TMP4 = _mm_xor_si128(TMP4, TMP2);
// reduction starts here
// multiply the lower half of TMP1 with the upper half of POLY
TMP2 = _mm_clmulepi64_si128(TMP1, POLY, 0x10);
// This re-orders the 32-bit subwords
// 78 should exactly swap the 64-bit halves
TMP3 = _mm_shuffle_epi32(TMP1, 78);
TMP1 = _mm_xor_si128(TMP3, TMP2);
TMP2 = _mm_clmulepi64_si128(TMP1, POLY, 0x10);
TMP3 = _mm_shuffle_epi32(TMP1, 78);
TMP1 = _mm_xor_si128(TMP3, TMP2);
T = _mm_xor_si128(TMP4, TMP1);
```

(I have not converted to pseudo-code as to preserve any specific instruction behavior I may get wrong).

This code loads a key `pH`

and a previous iteration `TAG`

adds it with some 16-byte input `inp`

and multiplies it with `pH`

in a carryless way and then reduces it modulo the Polyval polynomial into the new value of `T`

.

My reading of the above code is that

- Before the reduction
`TMP4`

holds the 128 most-significant polynomial coefficients of the multiplication result - Before the reduction
`TMP1`

holds the 128 least-significant polynomial coefficient of the multiplication result

Now my question is:

**How does this reduction algorithm work?**

Because for me, if I try it on paper with $x^{127}$ and $x$ I *should* get back $x^{127}+x^{126}+x^{121}+1$ but instead I *think* the algorithm returns me $1$.

Here’s my interpretation of how to read the reduction intrinsics:

- Take the lower 64-bit of the lower 128-bit of the multiplication result, multiply them with the upper 64-bit of the polynomial (in my example that’s $0$ times the upper bits), call it
`TMP2`

- Swap the 64-bit halves of the original 128-bit result, call it
`TMP3`

($0$ in my example because`TMP1`

would be 0) - Add
`TMP2`

and`TMP3`

, call the result`TMP1`

($0+0$) - Repeat the previous three steps once
- Return the addition of the current
`TMP1`

and`TMP4`

(the upper 128-bit) which in my case would be $0+1=1$