Undocumented Intrinsics

The following intrinsics are undocumented: they are not supported by either compiler or assembler. You have to use .word directive in assembly to use them.

__m256 __lasx_xvfscaleb_s (__m256 a, __m256i b)

Synopsis

__m256 __lasx_xvfscaleb_s (__m256 a, __m256i b)
#include <lasxintrin.h>
Instruction: xvfscaleb.s xr, xr, xr
CPU Flags: LASX

Description

Compute IEEE754 scaleB of single precision floating point elements in a by integer elements in b. Currently undocumented.

Operation

for (int i = 0; i < 8; i++) {
  dst.fp32[i] = __builtin_scalbn(a.fp32[i], b.word[i]);
}

Tested on real machine.

Latency and Throughput

CPU Latency Throughput (IPC)
3C6000 4 2

__m256d __lasx_xvfscaleb_d (__m256d a, __m256i b)

Synopsis

__m256d __lasx_xvfscaleb_d (__m256d a, __m256i b)
#include <lasxintrin.h>
Instruction: xvfscaleb.d xr, xr, xr
CPU Flags: LASX

Description

Compute IEEE754 scaleB of double precision floating point elements in a by integer elements in b. Currently undocumented.

Operation

for (int i = 0; i < 4; i++) {
  dst.fp64[i] = __builtin_scalbn(a.fp64[i], b.dword[i]);
}

Tested on real machine.

Latency and Throughput

CPU Latency Throughput (IPC)
3C6000 4 2

__m256i __lasx_xvmepatmsk_v (int mode, int uimm5)

Synopsis

__m256i __lasx_xvmepatmsk_v (int mode, int uimm5)
#include <lasxintrin.h>
Instruction: xvmepatmsk.v xr, mode, uimm5
CPU Flags: LASX

Description

Compute pattern according to mode, then add uimm5 to each element.

Operation

if (mode == 0b00) {
  for (int i = 0; i < 16; i++) {
    dst.byte[i + 16] = dst.byte[i] =
        uimm5 + (i % 4); // [0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3]
  }
} else if (mode == 0b01) {
  for (int i = 0; i < 16; i++) {
    dst.byte[i + 16] = dst.byte[i] =
        uimm5 + (i / 4) + (i % 4); // [0 1 2 3 1 2 3 4 2 3 4 5 3 4 5 6]
  }
} else if (mode == 0b10) {
  for (int i = 0; i < 16; i++) {
    dst.byte[i + 16] = dst.byte[i] =
        uimm5 + (i / 4) + (i % 4) + 4; // [4 5 6 7 5 6 7 8 6 7 8 9 7 8 9 10]
  }
} else if (mode == 0b11) {
  for (int i = 0; i < 16; i++) {
    dst.byte[i + 16] = dst.byte[i] =
        uimm5 + i; // [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
  }
} else {
  // illegal instruction
}

Tested on real machine.

Latency and Throughput

CPU Latency Throughput (IPC)
3C6000 N/A 4