x86 FPU
x86mftop
Synopsis
Instruction: x86mftop rd
CPU Flags: LBT
Description
Move from x87 FPU stack top pointer: read the current x87 TOP (stack top index, 3-bit) into rd.
Latency and Throughput
| CPU | µarch | Latency | Throughput (IPC) |
|---|---|---|---|
| 3C5000 | LA464 | N/A | 4 |
| 3C6000 | LA664 | N/A | 4 |
x86mttop
Synopsis
Instruction: x86mttop imm
CPU Flags: LBT
Description
Move to x87 FPU stack top pointer: set the x87 TOP (stack top index, 3-bit) to imm.
Latency and Throughput
| CPU | µarch | Latency | Throughput (IPC) |
|---|---|---|---|
| 3C5000 | LA464 | N/A | 4 |
| 3C6000 | LA664 | N/A | 1 |
x86inctop
Synopsis
Instruction: x86inctop
CPU Flags: LBT
Description
Increment x87 FPU stack top pointer (FINCSTP): increment TOP by 1, wrapping modulo 8.
Latency and Throughput
| CPU | µarch | Latency | Throughput (IPC) |
|---|---|---|---|
| 3C5000 | LA464 | N/A | 4 |
| 3C6000 | LA664 | N/A | 1 |
x86dectop
Synopsis
Instruction: x86dectop
CPU Flags: LBT
Description
Decrement x87 FPU stack top pointer (FDECSTP): decrement TOP by 1, wrapping modulo 8.
Latency and Throughput
| CPU | µarch | Latency | Throughput (IPC) |
|---|---|---|---|
| 3C5000 | LA464 | N/A | 4 |
| 3C6000 | LA664 | N/A | 1 |
x86settm
Synopsis
Instruction: x86settm
CPU Flags: LBT
Description
Set x87 FPU stack translation mode: enable TOP-based remapping of FPR registers (bits 0-7 are offset by TOP when TM is set).
Latency and Throughput
| CPU | µarch | Latency | Throughput (IPC) |
|---|---|---|---|
| 3C5000 | LA464 | N/A | 0.14(1/7) |
| 3C6000 | LA664 | N/A | 0.14(1/7) |
x86clrtm
Synopsis
Instruction: x86clrtm
CPU Flags: LBT
Description
Clear x87 FPU stack translation mode: disable TOP-based remapping of FPR registers.
Latency and Throughput
| CPU | µarch | Latency | Throughput (IPC) |
|---|---|---|---|
| 3C5000 | LA464 | N/A | 0.14(1/7) |
| 3C6000 | LA664 | N/A | 0.14(1/7) |
x86settag
Synopsis
Instruction: x86settag rd, imm1, imm2
CPU Flags: LBT
Description
Set x87 tag word bit in rd according to imm1 and imm2. imm2 selects the target bit position. imm1%8 determines the operation: 0=set bit (0→1 only), 1=clear bit (1→0 only), 2-4=check tag byte then modify. Raises a BTE (Binary Translation Exception) on invalid state transitions.
fcvt.ud.d
Synopsis
Instruction: fcvt.ud.d fd, fj
CPU Flags: LBT
Description
Convert double-precision floating point value in fj to the upper 16 bits (sign and exponent) of x87 80-bit extended precision format, store in fd.
Operation
uint64_t d = a;
unsigned sign = (d >> 63) & 1;
unsigned exp_d = (d >> 52) & 0x7FF;
uint64_t man_d = d & 0x000FFFFFFFFFFFFFULL;
// Re-bias exponent from double (1023) to x87 extended (16383).
// Double uses an implied integer bit (J=1 for normal numbers).
// x87 extended uses an explicit integer bit.
int64_t exp_x87;
if (exp_d == 0x7FF) {
// Infinity or NaN: x87 exponent is also all-ones.
exp_x87 = 0x7FFF;
} else if (exp_d == 0 && man_d == 0) {
// Signed zero: x87 exponent 0, integer bit 0.
exp_x87 = 0;
} else if (exp_d == 0) {
// Denormal double: value = man_d * 2^(-1074).
// Normalize by finding leading bit position p (0-indexed from LSB).
// Resulting x87 exponent: (-1074 + p) biased by 16383.
int p = 63 - __builtin_clzll(man_d);
exp_x87 = -1074 + p + 16383;
} else {
// Normal double → normal x87 extended.
exp_x87 = (int64_t)exp_d - 1023 + 16383;
}
// 16 bits: sign[15] | exponent[14:0]
dst = ((uint64_t)sign << 15) | (exp_x87 & 0x7FFF);
Tested on real machine.
Latency and Throughput
| CPU | µarch | Latency | Throughput (IPC) |
|---|---|---|---|
| 3C5000 | LA464 | 2 | 2 |
| 3C6000 | LA664 | 2 | 2 |
fcvt.ld.d
Synopsis
Instruction: fcvt.ld.d fd, fj
CPU Flags: LBT
Description
Convert double-precision floating point value in fj to the lower 64 bits (integer bit and fraction) of x87 80-bit extended precision format, store in fd.
Operation
uint64_t d = a;
unsigned exp_d = (d >> 52) & 0x7FF;
uint64_t man_d = d & 0x000FFFFFFFFFFFFFULL;
// Integer bit (J): 1 for normal (implied leading 1), 1 for infinity/NaN,
// 0 for zero (both exponent fields are zero).
unsigned j_bit;
uint64_t man_x87;
if (exp_d == 0 && man_d == 0) {
j_bit = 0;
man_x87 = 0;
} else if (exp_d == 0) {
// Denormal double: normalize mantissa so leading 1 becomes J bit.
// Find leading bit position p, shift mantissa left by (63 - p).
int p = 63 - __builtin_clzll(man_d);
j_bit = 1;
man_x87 = (man_d << (63 - p)) & 0x7FFFFFFFFFFFFFFFULL;
} else if (exp_d == 0x7FF && man_d == 0) {
// Infinity: J=1, fraction=0.
j_bit = 1;
man_x87 = 0;
} else if (exp_d == 0x7FF) {
// NaN: map 52-bit double mantissa to 63-bit x87 fraction.
// Hardware converts SNaN to QNaN on format conversion: forces both
// J_bit=1 and fraction bit 62 to 1.
j_bit = 1;
man_x87 = (man_d << 11) | (1ULL << 62);
} else {
// Normal: explicit J=1, shift 52-bit fraction to 63-bit.
j_bit = 1;
man_x87 = man_d << 11;
}
// Lower 64 bits: J[63] | fraction[62:0]
dst = ((uint64_t)j_bit << 63) | man_x87;
Tested on real machine.
Latency and Throughput
| CPU | µarch | Latency | Throughput (IPC) |
|---|---|---|---|
| 3C5000 | LA464 | 2 | 2 |
| 3C6000 | LA664 | 2 | 2 |
fcvt.d.ld
Synopsis
Instruction: fcvt.d.ld fd, fj, fk
CPU Flags: LBT
Description
Convert x87 80-bit extended precision value (upper 16 bits in fj, lower 64 bits in fk) to double-precision floating point, store in fd.
Operation
uint64_t lo = a; // J[63] | fraction[62:0]
uint64_t hi = b; // sign[15] | exponent[14:0]
unsigned sign = (hi >> 15) & 1;
int64_t exp_x87 = hi & 0x7FFF;
unsigned j_bit = (lo >> 63) & 1;
uint64_t man_x87 = lo & 0x7FFFFFFFFFFFFFFFULL;
// Build the full 64-bit significand: J_bit.fraction
uint64_t full = ((uint64_t)j_bit << 63) | man_x87;
// Re-bias exponent from x87 extended (16383) to double (1023).
// Round 63-bit fraction to 52 bits with round-to-nearest-even.
unsigned exp_d;
uint64_t man_d;
if (exp_x87 == 0x7FFF && j_bit == 1) {
// True infinity or NaN with J=1. Hardware always produces a
// quiet NaN by setting bit 51 of the mantissa.
man_d = man_x87 >> 11;
man_d |= (1ULL << 51); // ensure QNaN
if (man_x87 == 0) {
// x87 infinity (fraction is zero) → double infinity (mantissa zero)
man_d = 0;
}
exp_d = 0x7FF;
} else if (exp_x87 == 0 && j_bit == 0) {
// Pseudo-denormal or zero: J=0 with zero exponent. Hardware
// treats all of these as zero (pseudo-denormals flush to zero).
exp_d = 0;
man_d = 0;
} else if (j_bit == 0) {
// Unnormal or pseudo-NaN: J=0 with non-zero exponent is an
// invalid x87 encoding. Hardware produces the default quiet
// NaN (sign=0, mantissa=0x8000000000000, exponent=0x7FF).
sign = 0;
exp_d = 0x7FF;
man_d = 0x8000000000000ULL;
} else {
int64_t rebias = exp_x87 - 16383 + 1023;
if (rebias >= 0x7FF) {
// Overflow to infinity.
exp_d = 0x7FF;
man_d = 0;
} else if (rebias <= 0) {
// Underflow: produce double denormal.
// Right-shift the full 64-bit significand, then RNE to 52 bits.
int shift = 1 - (int)rebias;
exp_d = 0;
int rshift =
11 + shift; // total right-shift for 64-bit -> denormal mantissa
// Handle rshift >= 64 separately: all bits discarded.
if (rshift >= 64) {
// Result before rounding is zero. RNE rounds up to the smallest
// denormal (man_d=1) only when rshift == 64 and full > 2^63
// (guard=1, sticky=1 -> round up). Ties (full == 2^63, guard=1,
// sticky=0, LSB=0) and all rshift > 64 cases round down to 0.
man_d = (rshift == 64 && full > (1ULL << 63)) ? 1 : 0;
} else {
uint64_t result = full >> rshift;
uint64_t lost = full & ((1ULL << rshift) - 1);
uint64_t half = 1ULL << (rshift - 1);
// Round-to-nearest-even on the discarded bits
if (lost > half || (lost == half && (result & 1))) {
result++;
}
// If rounding overflowed past the denormal boundary, it becomes
// the smallest normal number (biased exponent 1, mantissa 0).
if (result & ((uint64_t)1 << 52)) {
// overflow to smallest normal
exp_d = 1;
man_d = 0;
} else {
man_d = result & 0x000FFFFFFFFFFFFFULL;
}
}
} else {
// Normal case: round 64-bit significand to 53 bits, then strip
// the implicit leading 1 to get the 52-bit double mantissa.
exp_d = (unsigned)rebias;
uint64_t result53 = full >> 11; // top 53 bits (bit 52 = implicit 1)
uint64_t lost = full & 0x7FFULL; // bottom 11 bits
uint64_t half = 0x400ULL;
// Round-to-nearest-even
if (lost > half || (lost == half && (result53 & 1))) {
result53++;
if (result53 & (1ULL << 53)) {
// Carry into bit 53 → significand wrapped to 2.0.
// Renormalize and adjust exponent.
result53 >>= 1;
exp_d++;
if (exp_d >= 0x7FF) {
// Overflowed to infinity.
exp_d = 0x7FF;
result53 = 0;
}
}
}
man_d = result53 & 0x000FFFFFFFFFFFFFULL;
}
}
// Recombine: sign[63] | exponent[62:52] | fraction[51:0]
dst = ((uint64_t)sign << 63) | ((uint64_t)exp_d << 52) | man_d;
Tested on real machine.
Latency and Throughput
| CPU | µarch | Latency | Throughput (IPC) |
|---|---|---|---|
| 3C5000 | LA464 | 4 | 2 |
| 3C6000 | LA664 | 4 | 4 |