# #StackBounty: #implementation #universal-hash How does this POLYVAL modular reduction algorithm work?

### Bounty: 250

I recently found the GitHub repository used to make the measurements in the AES-GCM-SIV paper where they implement polynomial hashing using POLYVAL.

This means in this context to compute the usual $$tau=sum_{i=0}^nm_iH^{n-i}$$ for 16-byte message blocks $$m_i$$ and a 128-bit key $$H$$ in the usual way as $$H_{i+1}=(m_i+H_i)cdot H$$ with messages and $$H$$ being interpreted as polynomials over $$mathbb F_2[x]/(x^{128}+x^{127}+x^{126}+x^{121}+1)$$.

Now the actual computation uses x86 hardware intrinsics (which can be looked up here).
The code in question is `Polyval_Horner` (`polyval.c`, line 137) in particular the following extract (adopted and commented from the repo):

``````__m128i TMP0, TMP1, TMP2, TMP3, TMP4, T, POLY, H;
// ordering of the inputs is reversed, last is most significant
// 0xc2000000 corresponds to the top 3 POLYVAL coefficients
POLY = _mm_setr_epi32(0x1,0,0,0xc2000000);

// This instruction takes two 64-bit halves and carrylessly multiplies them
// If the lower nibble is 0, take the lower half of the first input, else the upper half
// likewise with the upper nibble for the second input
TMP1 = _mm_clmulepi64_si128(T, H, 0x00);
TMP4 = _mm_clmulepi64_si128(T, H, 0x11);
TMP2 = _mm_clmulepi64_si128(T, H, 0x10);
TMP3 = _mm_clmulepi64_si128(T, H, 0x01);
// TMP2 and 3 contain the range of coefficients from 64 to 191, add them
TMP2 = _mm_xor_si128(TMP2, TMP3);
// now extract the upper and lower halves of these coefficients and add them
// into either TMP1 or 4 depending on whether they are the lower or the upper coefficients
TMP3 = _mm_slli_si128(TMP2, 8);
TMP2 = _mm_srli_si128(TMP2, 8);
TMP1 = _mm_xor_si128(TMP3, TMP1);
TMP4 = _mm_xor_si128(TMP4, TMP2);
// reduction starts here
// multiply the lower half of TMP1 with the upper half of POLY
TMP2 = _mm_clmulepi64_si128(TMP1, POLY, 0x10);
// This re-orders the 32-bit subwords
// 78 should exactly swap the 64-bit halves
TMP3 = _mm_shuffle_epi32(TMP1, 78);
TMP1 = _mm_xor_si128(TMP3, TMP2);
TMP2 = _mm_clmulepi64_si128(TMP1, POLY, 0x10);
TMP3 = _mm_shuffle_epi32(TMP1, 78);
TMP1 = _mm_xor_si128(TMP3, TMP2);
T = _mm_xor_si128(TMP4, TMP1);
``````

(I have not converted to pseudo-code as to preserve any specific instruction behavior I may get wrong).
This code loads a key `pH` and a previous iteration `TAG` adds it with some 16-byte input `inp`and multiplies it with `pH` in a carryless way and then reduces it modulo the Polyval polynomial into the new value of `T`.

My reading of the above code is that

• Before the reduction `TMP4` holds the 128 most-significant polynomial coefficients of the multiplication result
• Before the reduction `TMP1` holds the 128 least-significant polynomial coefficient of the multiplication result

Now my question is:

How does this reduction algorithm work?

Because for me, if I try it on paper with $$x^{127}$$ and $$x$$ I should get back $$x^{127}+x^{126}+x^{121}+1$$ but instead I think the algorithm returns me $$1$$.

Here’s my interpretation of how to read the reduction intrinsics:

1. Take the lower 64-bit of the lower 128-bit of the multiplication result, multiply them with the upper 64-bit of the polynomial (in my example that’s $$0$$ times the upper bits), call it `TMP2`
2. Swap the 64-bit halves of the original 128-bit result, call it `TMP3` ($$0$$ in my example because `TMP1` would be 0)
3. Add `TMP2` and `TMP3`, call the result `TMP1` ($$0+0$$)
4. Repeat the previous three steps once
5. Return the addition of the current `TMP1` and `TMP4` (the upper 128-bit) which in my case would be $$0+1=1$$

Get this bounty!!!

This site uses Akismet to reduce spam. Learn how your comment data is processed.