#StackBounty: #implementation #universal-hash How does this POLYVAL modular reduction algorithm work?

Bounty: 250

I recently found the GitHub repository used to make the measurements in the AES-GCM-SIV paper where they implement polynomial hashing using POLYVAL.

This means in this context to compute the usual $tau=sum_{i=0}^nm_iH^{n-i}$ for 16-byte message blocks $m_i$ and a 128-bit key $H$ in the usual way as $H_{i+1}=(m_i+H_i)cdot H$ with messages and $H$ being interpreted as polynomials over $mathbb F_2[x]/(x^{128}+x^{127}+x^{126}+x^{121}+1)$.

Now the actual computation uses x86 hardware intrinsics (which can be looked up here).
The code in question is Polyval_Horner (polyval.c, line 137) in particular the following extract (adopted and commented from the repo):

__m128i TMP0, TMP1, TMP2, TMP3, TMP4, T, POLY, H;
H = _mm_loadu_si128(((__m128i*)pH));
T = _mm_loadu_si128(((__m128i*)TAG));
// ordering of the inputs is reversed, last is most significant
// 0xc2000000 corresponds to the top 3 POLYVAL coefficients
POLY = _mm_setr_epi32(0x1,0,0,0xc2000000);

T = _mm_xor_si128(T, _mm_loadu_si128((__m128i*)inp));
// This instruction takes two 64-bit halves and carrylessly multiplies them
// If the lower nibble is 0, take the lower half of the first input, else the upper half
// likewise with the upper nibble for the second input
TMP1 = _mm_clmulepi64_si128(T, H, 0x00);
TMP4 = _mm_clmulepi64_si128(T, H, 0x11);
TMP2 = _mm_clmulepi64_si128(T, H, 0x10);
TMP3 = _mm_clmulepi64_si128(T, H, 0x01);
// TMP2 and 3 contain the range of coefficients from 64 to 191, add them
TMP2 = _mm_xor_si128(TMP2, TMP3);
// now extract the upper and lower halves of these coefficients and add them
// into either TMP1 or 4 depending on whether they are the lower or the upper coefficients
TMP3 = _mm_slli_si128(TMP2, 8);
TMP2 = _mm_srli_si128(TMP2, 8);
TMP1 = _mm_xor_si128(TMP3, TMP1);
TMP4 = _mm_xor_si128(TMP4, TMP2);
// reduction starts here
// multiply the lower half of TMP1 with the upper half of POLY
TMP2 = _mm_clmulepi64_si128(TMP1, POLY, 0x10);
// This re-orders the 32-bit subwords
// 78 should exactly swap the 64-bit halves
TMP3 = _mm_shuffle_epi32(TMP1, 78);
TMP1 = _mm_xor_si128(TMP3, TMP2);
TMP2 = _mm_clmulepi64_si128(TMP1, POLY, 0x10);
TMP3 = _mm_shuffle_epi32(TMP1, 78);
TMP1 = _mm_xor_si128(TMP3, TMP2);
T = _mm_xor_si128(TMP4, TMP1);

(I have not converted to pseudo-code as to preserve any specific instruction behavior I may get wrong).
This code loads a key pH and a previous iteration TAG adds it with some 16-byte input inpand multiplies it with pH in a carryless way and then reduces it modulo the Polyval polynomial into the new value of T.

My reading of the above code is that

  • Before the reduction TMP4 holds the 128 most-significant polynomial coefficients of the multiplication result
  • Before the reduction TMP1 holds the 128 least-significant polynomial coefficient of the multiplication result

Now my question is:

How does this reduction algorithm work?

Because for me, if I try it on paper with $x^{127}$ and $x$ I should get back $x^{127}+x^{126}+x^{121}+1$ but instead I think the algorithm returns me $1$.

Here’s my interpretation of how to read the reduction intrinsics:

  1. Take the lower 64-bit of the lower 128-bit of the multiplication result, multiply them with the upper 64-bit of the polynomial (in my example that’s $0$ times the upper bits), call it TMP2
  2. Swap the 64-bit halves of the original 128-bit result, call it TMP3 ($0$ in my example because TMP1 would be 0)
  3. Add TMP2 and TMP3, call the result TMP1 ($0+0$)
  4. Repeat the previous three steps once
  5. Return the addition of the current TMP1 and TMP4 (the upper 128-bit) which in my case would be $0+1=1$

Get this bounty!!!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.