Streaming SIMD Extensions (SSE)

SSE — An Overview

SSE is a newer SIMD extension to the Intel Pentium III and AMD AthlonXP microprocessors. Unlike MMX and 3DNow! extensions, which occupy the same register space as the normal FPU registers, SSE adds a separate register space to the microprocessor. Because of this, SSE can only be used on operating systems that support it. Fortunately, most recent operating systems have support built in. All versions of Windows since Windows98 support SSE, as do Linux kernels since 2.2.

SSE was introduced in 1999, and was also known as "Katmai New Instructions" (or KNI) after the Pentium III's core codename.

SSE adds 8 new 128-bit registers, divided into 4 32-bit (single precision) floating point values. These registers are called

XMM0 - XMM7

. An additional control register,

MXCSR

, is also available to control and check the status of SSE instructions.

SSE gives us access to 70 new instructions that operate on these 128bit registers, MMX registers, and sometimes even regular 32bit registers.

SSE — MXCSR

The

MXCSR

register is a 32-bit register containing flags for control and status information regarding SSE instructions. As of SSE3, only bits 0-15 have been defined.

Pnemonic	Bit Location	Description
FZ	bit 15	Flush To Zero
R+	bit 14	Round Positive
R-	bit 13	Round Negative
RZ	bits 13 and 14	Round To Zero
RN	bits 13 and 14 are 0	Round To Nearest
PM	bit 12	Precision Mask
UM	bit 11	Underflow Mask
OM	bit 10	Overflow Mask
ZM	bit 9	Divide By Zero Mask
DM	bit 8	Denormal Mask
IM	bit 7	Invalid Operation Mask
DAZ	bit 6	Denormals Are Zero
PE	bit 5	Precision Flag
UE	bit 4	Underflow Flag
OE	bit 3	Overflow Flag
ZE	bit 2	Divide By Zero Flag
DE	bit 1	Denormal Flag
IE	bit 0	Invalid Operation Flag

FZ

mode causes all underflowing operations to simply go to zero. This saves some processing time, but loses precision.

The

R+

R-

RN

, and

RZ

rounding modes determine how the lowest bit is generated. Normally,

RN

is used.

PM

UM

MM

ZM

DM

, and

IM

are masks that tell the processor to ignore the exceptions that happen, if they do. This keeps the program from having to deal with problems, but might cause invalid results.

DAZ

tells the CPU to force all Denormals to zero. A Denormal is a number that is so small that FPU can't renormalize it due to limited exponent ranges. They're just like normal numbers, but they take considerably longer to process. Note that not all processors support

DAZ

PE

UE

ME

ZE

DE

, and

IE

are the exception flags that are set if they happen, and aren't unmasked. Programs can check these to see if something interesting happened. These bits are "sticky", which means that once they're set, they stay set forever until the program clears them. This means that the indicated exception could have happened several operations ago, but nobody bothered to clear it.

DAZ

wasn't available in the first version of SSE. Since setting a reserved bit in

MXCSR

causes a general protection fault, we need to be able to check the availability of this feature without causing problems. To do this, one needs to set up a 512-byte area of memory to save the SSE state to, using

fxsave

, and then one needs to inspect bytes 28 through 31 for the

MXCSR_MASK

value. If bit 6 is set,

DAZ

is supported, otherwise, it isn't.

SSE — OpCode List

(still under construction) (lowest = bits 0-31, not smallest of set.)(byte, word, 8bit, 16bit, need to regularize...)

Arithmetic:

addps

- Adds 4 single-precision (32bit) floating-point values to 4 other single-precision floating-point values.

addss

- Adds the lowest single-precision values, top 3 remain unchanged.

subps

- Subtracts 4 single-precision floating-point values from 4 other single-precision floating-point values.

subss

- Subtracts the lowest single-precision values, top 3 remain unchanged.

mulps

- Multiplies 4 single-precision floating-point values with 4 other single-precision values.

mulss

- Multiplies the lowest single-precision values, top 3 remain unchanged.

divps

- Divides 4 single-precision floating-point values by 4 other single-precision floating-point values.

divss

- Divides the lowest single-precision values, top 3 remain unchanged.

rcpps

- Reciprocates (1/x) 4 single-precision floating-point values.

rcpss

- Reciprocates the lowest single-precision values, top 3 remain unchanged.

sqrtps

- Square root of 4 single-precision values.

sqrtss

- Square root of lowest value, top 3 remain unchanged.

rsqrtps

- Reciprocal square root of 4 single-precision floating-point values.

rsqrtss

- Reciprocal square root of lowest single-precision value, top 3 remain unchanged.

maxps

- Returns maximum of 2 values in each of 4 single-precision values.

maxss

- Returns maximum of 2 values in the lowest single-precision value. Top 3 remain unchanged.

minps

- Returns minimum of 2 values in each of 4 single-precision values.

minss

- Returns minimum of 2 values in the lowest single-precision value, top 3 remain unchanged.

pavgb

- Returns average of 2 values in each of 8 bytes.

pavgw

- Returns average of 2 values in each of 4 words.

psadbw

- Returns sum of absolute differences of 8 8bit values. Result in bottom 16 bits.

pextrw

- Extracts 1 of 4 words.

pinsrw

- Inserts 1 of 4 words.

pmaxsw

- Returns maximum of 2 values in each of 4 signed word values.

pmaxub

- Returns maximum of 2 values in each of 8 unsigned byte values.

pminsw

- Returns minimum of 2 values in each of 4 signed word values.

pminub

- Returns minimum of 2 values in each of 8 unsigned byte values.

pmovmskb

- builds mask byte from top bit of 8 byte values.

pmulhuw

- Multiplies 4 unsigned word values and stores the high 16bit result.

pshufw

- Shuffles 4 word values. Complex.

Logic:

andnps

- Logically ANDs 4 single-precision values with the logical inverse (NOT) of 4 other single-precision values.

andps

- Logically ANDs 4 single-precision values with 4 other single-precision values.

orps

- Logically ORs 4 single-precision values with 4 other single-precision values.

xorps

- Logically XORs 4 single-precision values with 4 other single-precision values.

Compare:

cmpxxps

- Compares 4 single-precision values.

cmpxxss

- Compares lowest 2 single-precision values.

comiss

- Compares lowest 2 single-recision values and stores result in

EFLAGS

ucomiss

- Compares lowest 2 single-precision values and stores result in

EFLAGS

. ( QNaNs don't throw exceptions with

ucomiss

, unlike

comiss

Compare Codes (the

xx

parts above):

eq

- Equal to.

lt

- Less than.

le

- Less than or equal to.

ne

- Not equal.

nlt

- Not less than.

nle

- Not less than or equal to.

ord

- Ordered.

unord

- Unordered.

Conversion:

cvtpi2ps

- Converts 2 32bit integers to 32bit floating-point values. Top 2 values remain unchanged.

cvtps2pi

- Converts 2 32bit floating-point values to 32bit integers.

cvtsi2ss

- Converts 1 32bit integer to 32bit floating-point value. Top 3 values remain unchanged.

cvtss2si

- Converts 1 32bit floating-point value to 32bit integer.

cvttps2pi

- Converts 2 32bit floating-point values to 32bit integers using truncation.

cvttss2si

- Converts 1 32bit floating-point value to 32bit integer using truncation.

State:

fxrstor

- Restores FP and SSE State.

fxsave

- Stores FP and SSE State.

ldmxcsr

- Loads the

mxcsr

stmxcsr

- Stores the

mxcsr

Load/Store:

movaps

- Moves a 128bit value.

movhlps

- Moves high half to a low half.

movlhps

- Moves low half to upper halves.?

movhps

- Moves 64bit value into top half of an

xmm

movlps

- Moves 64bit value into bottom half of an

xmm

movmskps

- Moves top bits of single-precision values into bottom 4 bits of a 32bit register.

movss

- Moves the bottom single-precision value, top 3 remain unchanged.

movups

- Moves a 128bit value. Address can be unaligned.

maskmovq

- Moves a 64bit value according to a mask.

movntps

- Moves a 128bit value directly to memory, skipping the cache. (NT stands for "Non Temporal".)

movntq

- Moves a 64bit value directly to memory, skipping the cache.

Shuffling:

shufps

- Shuffles 4 single-precision values. Complex.

unpckhps

- Unpacks single-precision values from high halves.

unpcklps

- Unpacks single-precision values from low halves.

Cache Control:

prefetchT0

- Fetches a cache-line of data into all levels of cache.

prefetchT1

- Fetches a cache-line of data into all but the highest levels of cache.

prefetchT2

- Fetches a cache-line of data into all but the two highest levels of cache.

prefetchNTA

- Fetches data into only the highest level of cache, not the lower levels.

sfence

- Guarantees that all memory writes issued before the

sfence

instruction are completed before any writes after the

sfence

instruction.

Streaming SIMD Extensions (SSE)

SSE — An Overview

SSE — MXCSR

SSE — OpCode List

繼續閱讀

一個不錯的 js 校驗

Generic P2P Architecture, Tutorial and Example - CodeProject

Hibernate 中的延遲初始化……Lazy Initialization

完美解決php+mysql亂碼（utf-8）

Java中Integer和int的差別

jbpm學習筆記(2)_DB

[引]在Oracle中如何利用Rowid查找和删除表中的重複記錄

Java網絡程式設計（30）：定制accept方法

Communications link failure Last packet sent to the server was 0 ms ago

華為筆試軟體

LoadRunner參數化功能詳解

Spring MVC+Ajax建立執行個體

probe()函數是什麼時候被調用，裝置和驅動是怎麼聯系起來的

React中less的引用初窺

使用Windbg調試.Net應用程式

什麼是BNF範式