barrier(wmb,mb,rmb)和cache coherence

http://www.linuxforum.net/forum/gshowflat.php?Cat=&Board=linuxK&Number=428239&page=5&view=collapsed&sb=5&o=all&fpart=

注: 這裡的barrier 指的是wmb, rmb, mb.

一直找不到合适的資料說明barrier和 Cache coherence 之間的關系. 在<<ldd>> <<ULK>>等書中說明了barrier的基本用法. Ldd 着重于在和外設打交道的時候barrier所起的作用. 還有一個例子是使用barrier實作無鎖的連結清單操作.

迷惑于這種使用barrier實作的無鎖操作. 另外的例子就是big read lock(brlock.c brlock.h). 找不到一些理由或者條件, 指出必須使用barrier的情景. 還有就是softirq.c 中的void init_bh(int nr, void (*routine)(void)), 也使用了barrier.

應該是這樣: barrier強制cpu實作strong order而cache conherence 關注memory在多個cpu的cache中的copy的一緻性問題. 問題是,

cache conhernece需要barrier的參與嗎?

引起cpu reorder記憶體讀寫的技術有那些? Write buffer算一個, 但是資料在writer buffer的時候cache是否是一緻的?(應該是一緻的, 如果是這樣,說明barrier和cache conherence根本就是兩碼事, cache conherence 對軟體完全透明但是barrier需要軟體的參與)

我想在下面的情景下需要考慮使用barrier: 兩個cpu都會操作同一個資料(寫的情形需要互斥), 或者讀寫兩個資料, 但是這兩個資料有某種關系, 比如根據一個資料的值決定另一資料的操作.

這個描述很不讓人滿意. 也純粹是推理(當然也有道理, per cpu的data永遠不用barrier, reoreder 的問題不影響單個cpu.(fix me)).

另外, 涉及到使用Barrier 的時候, 有的書籍說: 讓改變立即對所有的cpu可見. 這種說法也許不妥, 我們應該怎樣了解這個問題?

barrier(wmb,mb,rmb)和cache coherence

這裡為什麼有mb()

我的了解和你差不多:

1.cache coherence不需要barrier參與，完全由硬體協定保證cache coherence.

2. 引起cpu reorder的技術還有: load forwarding和load scheduling.

3. 隻有涉及到至少兩個記憶體位址和兩個對記憶體通路的功能單元(CPU或外設)時才有memory ordering問題。

4. “讓改變立即對所有的cpu可見”不妥, 應該是讓兩個記憶體操作的被其它CPU可見的順序符合某種要求(release, acquire or fence)。

reoreder 的問題也會影響單cpu。

上學期學的一門《進階計算機系統結構》課程就講過一個經典例子。

PowerPC機的store指令緊跟着一個load指令就會發生颠倒順序的執行。對于下述程式，如果不用memory barrier，就會發生錯誤：

1 while ( TDRE == 0 );

2 TDR = char1;

3 asm("eieio"); // memory barrier

4 while ( TDRE == 0 );

5 TDR = char2;

程式說明：假設上述的TDRE是某外設的狀态寄存器，為0表示外設忙，為1表示外設可以接收一個字元并處理，這時使用者可向外設的TDR寄存器寫入一個待處理的字元，寫入之後TDRE變為0，外設處理字元需要一段時間，在這段時間過去後TDRE又從0變為1。

假設将上述程式的第三行去掉，那麼程式在執行第一行時會等到TDRE為1時繼續向下執行，然而後面有一條store指令（TDR = char1;）之後緊跟一條load指令( while ( TDRE == 0 ); )，這時第4句中的load指令會先執行，然後再執行第二行的store指令，這樣第四行load出來的寄存器值肯定為1，這樣就會立刻執行第五行，結果造成外設在忙的狀态下接收到第二個字元，這樣肯定會出錯。

是以必須叫加入第三行，以確定store指令在load指令前執行。

reoreder所引起的這個外設(非smp cpu)的問題是很好了解的, 關鍵是對smp環境的程式設計産生的影響不那麼直覺.

如果我們的想法正确. 就應該看看init_bh, ksoftirqd的問題了. 但是好像不太直覺啊.

我也急迫地想了解這方面的東西。

另外ldd的中斷一章也看不懂。

This code, though simple, represents the typical job of an interrupt handler. It, in

turn, calls short_incr_bp, which is defined as follows:

static inline void short_incr_bp(volatile unsigned long *index,

int delta)

{

unsigned long new = *index + delta;

barrier ();

這裡為什麼用barrier？

*index = (new >= (short_buffer + PAGE_SIZE)) ? short_buffer : new;

}

This function has been carefully written to wrap a pointer into the circular buffer

without ever exposing an incorrect value. By assigning only the final value and

placing a barrier to keep the compiler from optimizing things, it is possible to

manipulate the circular buffer pointers safely without locks.

高手指點。

在linux 中所有的原子操作都是帶有mb的, 并且帶有barrier(防止gcc對指令進行reorder ). 對應x86 , 就是像 test_and_setbit之類的操作都是以 volatile /lock /:memory的方式實作的.

原因也是明顯的:

spin_lock(lock);

some read/write

………..

如果沒有mb, 有可能 some read/write 會跑到 spin_lock 之前去執行, 這當然是不容許的.

這是一篇說明實作 lock free searching的文章, 對了解barrier很有裨益, 研究它遠比翻譯它有價值

Data Dependencies and wmb()

Version 1.0 







Goal



Support lock-free algorithms without inefficient and ugly read-side code. 

Obstacle Some CPUs do not support synchronous invalidation in hardware. 





Example Code 
Insertion into an unordered lock-free circular singly linked list, 

while allowing concurrent searches. 





Data Structures 


The data structures used in all these examples are 

a list element, a header element, and a lock. 





 struct el {

  struct el *next;

  long key;

  long data;

 };

 struct el head;

 spinlock_t mutex;





Search and Insert Using Global Locking



The familiar globally locked implementations of search() and insert() are as follows: 



 struct el *insert(long key, long data)

 {

  struct el *p;

  p = kmalloc(sizeof(*p), GPF_ATOMIC);

  spin_lock(&mutex);

  p->next = head.next;

  p->key = key;

  p->data = data;

  head.next = p;

  spin_unlock(&mutex);

 }



 struct el *search(long key)

 {

  struct el *p;

  p = head.next;

  while (p != &head) {

   if (p->key == key) {

    return (p);

   }

   p = p->next;

  }

  return (NULL);

 }



 /* Example use. */



 spin_lock(&mutex);

 p = search(key);

 if (p != NULL) {

  /* do stuff with p */

 }

 spin_unlock(&mutex);



These implementations are quite straightforward, but are subject to locking bottlenecks. 



Search and Insert Using wmb() and rmb()  


The existing wmb() and rmb() primitives can be used to do lock-free insertion. The 

searching task will either see the new element or not, depending on the exact timing, 

just like the locked case. In any case, we are guaranteed not to see an invalid pointer, 

regardless of timing, again, just like the locked case. The problem is that wmb() is 

guaranteed to enforce ordering only on the writing CPU --

 the reading CPU must use rmb() to keep the ordering. 





 struct el *insert(long key, long data)

 {

  struct el *p;

  p = kmalloc(sizeof(*p), GPF_ATOMIC);

  spin_lock(&mutex);

  p->next = head.next;

  p->key = key;

  p->data = data;

  wmb();

  head.next = p;

  spin_unlock(&mutex);

 }



 struct el *search(long key)

 {

  struct el *p;

  p = head.next;

  while (p != &head) {

   rmb();

   if (p->key == key) {

    return (p);

   }

   p = p->next;

  };

  return (NULL);

 }



(Note: see read-copy update for information on how to delete elements from this list 

while still permitting lock-free searches.) 





The rmb()s in search() cause unnecessary performance degradation on CPUs (such as the 

i386, IA64, PPC, and SPARC) where data dependencies result in an implied rmb(). In 

addition, code that traverses a chain of pointers would have to be broken up in order to 

insert the needed rmb()s. For example: 



 d = p->next->data;



would have to be rewritten as: 

 q = p->next;

 rmb();

 d = q->data;



One could imagine much uglier code expansion where there are more dereferences in a 

single expression. The inefficiencies and code bloat could be avoided if there were a 

primitive like wmb() that allowed read-side data dependencies to act as implicit rmb() 

invocations. 





Why do You Need the rmb()? 


Many CPUs have single instructions that cause other CPUs to see preceding stores before 

subsequent stores, without the reading CPUs needing an explicit rmb() if a data dependency 

forces the ordering. 



However, some CPUs have no such instruction, the Alpha being a case in point. On these 

CPUs, a wmb() only guarantees that the invalidate requests corresponding to the writes 

will be emitted in order. The wmb() does not guarantee that the reading CPU will process 

these invalidates in order. 



For example, consider a CPU with a partitioned cache, as shown in the following diagram: 




          barrier(wmb,mb,rmb)和cache coherence             




Here, even-numbered cachelines are maintained in cache bank 0, and odd-numbered cache 

lines are maintained in cache bank 1. Suppose that head was maintained in cache bank 0, 

and that a newly allocated element was maintained in cache bank 1. The insert() code's 

wmb() would guarantee that the invalidates corresponding to the writes to the next, key, 

and data fields would appear on the bus before the write to head->next. 

But suppose that the reading CPU's cache bank 1 was extremely busy, with lots of pending 

invalidates and outstanding accesses, and that the reading CPU's cache bank 0 was idle. 

The invalidation corresponding to head->next could then be processed before that of the 



three fields. If search() were to be executing just at that time, it would pick up the 

new value of head->next, but, since the invalidates corresponding to the three fields

had not yet been processed, it could pick up the old (garbage!) value corresponding to 

these fields, possibly resulting in an oops or worse. 

Placing an rmb() between the access to head->next and the three fields fixes this 

problem. The rmb() forces all outstanding invalidates to be processed before any 

subsequent reads are allowed to proceed. Since the invalidate corresponding to the three 

fields arrived before that of head->next, this will guarantee that if the new value of 

head->next was read, then the new value of the three fields will also be read. 

No oopses (or worse). 



However, all the rmb()s add complexity, are easy to leave out, and hurt performance of 

all architectures. And if you forget a needed rmb(), you end up with very intermittent 

and difficult-to-diagnose memory-corruption errors. Just what we don't need in Linux! 



So, there is strong motivation for a way of eliminating the need for these rmb()s. 

Solutions for lockfree search and insertions



Search and Insert Using wmbdd()



It would much nicer (and faster, on many architectures) to have a primitive similar to 

wmb(), but that allowed read-side data dependencies to substitute for an explicit rmb(). 



It is possible to do this (see patch). With such a primitive, the code looks as follows: 



 struct el *insert(long key, long data)

 {

  struct el *p;

  p = kmalloc(sizeof(*p), GPF_ATOMIC);

  spin_lock(&mutex);

  p->next = head.next;

  p->key = key;

  p->data = data;

  wmbdd();

  head.next = p;

  spin_unlock(&mutex);

 }



 struct el *search(long key)

 {

  struct el *p;

  p = head.next;

  while (p != &head) {

   if (p->key == key) {

    return (p);

   }

   p = p->next;

  }

  return (NULL);

 }





This code is much nicer: no rmb()s are required, searches proceed 

fully in parallel with no locks or writes, and no intermittent data corruption. 



Search and Insert Using read_barrier_depends()



  Introduce a new primitive read_barrier_depends() that is defined to be an rmb() on 

Alpha, and a nop on other architectures. This removes the read-side performance 

problem for non-Alpha architectures, but still leaves the read-side 

read_barrier_depends(). It is almost possible for the compiler to do a good job of 

generating these (assuming that a "lockfree" gcc struct-field attribute is created 

and used), but, unfortunately, the compiler cannot reliably tell when the relevant lock

 is held. (If the lock is held, the read_barrier_depends() calls should not be generated.) 



After discussions in lkml about this, it was decided that putting an explicit 

read_barrier_depends() is the appropriate thing to do in the linux kernel. Linus also 

suggested that the barrier names be made more explict. With such a primitive, 

the code looks as follows: 



 struct el *insert(long key, long data)

 {

  struct el *p;

  p = kmalloc(sizeof(*p), GPF_ATOMIC);

  spin_lock(&mutex);

  p->next = head.next;

  p->key = key;

  p->data = data;

  write_barrier();

  head.next = p;

  spin_unlock(&mutex);

 }



 struct el *search(long key)

 {

  struct el *p;

  p = head.next;

  while (p != &head) {

   read_barrier_depends();

   if (p->key == key) {

    return (p);

   }

   p = p->next;

  }

  return (NULL);

 }





A preliminary patch for this is barriers-2.5.7-1.patch. The future releases of this 

patch can be found along with the RCU package here. 





Other Approaches Considered





Just make wmb() work like wmbdd(), so that data dependencies act as implied rmb()s. 

Although wmbdd()'s semantics are much more intuitive, there are a number of uses of 

wmb() in Linux that do not require the stronger semantics of wmbdd(), and strengthening 

the semantics would incur unnecessary overhead on many CPUs--or require many changes to 

the code, and thus a much larger patch. 



Just drop support for Alpha. After all, Compaq seems to be phasing it out, right? There 

are nonetheless a number of Alphas out there running Linux, and Compaq (or perhaps HP) 

will be manufacturing new Alphas for quite a few years to come. Microsoft would likely 

focus quite a bit of negative publicity on Linux's dropping support for anything (never 

mind that they currently support only two CPU architectures). And the code to make Alpha 

work correctly is not all that complex, and it does not impact performance of other CPUs. 



Besides, we cannot be 100% certain that there won't be some other CPU lacking a 

synchronous invalidation instruction...

barrier(wmb,mb,rmb)和cache coherence

繼續閱讀

oralce與mysql中，如何删除重複記錄

西部資料的新品PCIe4.0SSD固态硬碟WDBlueSN580，速度真的不錯！1TB順序讀取速度高達4150MB/s6

HP Proliant 系列伺服器使用 SmartStart CD光牒配置陣列卡過程

華為筆試軟體

probe()函數是什麼時候被調用，裝置和驅動是怎麼聯系起來的

二叉樹先序、中序、後序三種周遊的非遞歸算法

如何存取圖檔到資料庫(SQL Server)

C/C++頭檔案、函數使用說明

DOS常用指令的使用

DOS指令(2) 磁盤操作類指令

設定某一行背景顔色的CListCtrl

BMP檔案結構及圖像每行位元組計算方法

linux網絡程式設計----發送與接收檔案

處理PCX檔案

查找算法學習之二分查找（Python版本）——BinarySearch

Linux裝置模型（中）之上層容器