本文是 6.S081 作業系統課程學習最後一個 lab，編寫一個 intel 的 e1000 網卡的驅動在 xv6 下。需要複習知識有：作業系統知識，計算機組成原理 DMA 相關，循環緩沖區的概念，e1000 的粗略 spec 和其具備兩個環形緩沖區和其引發中斷的方式，理論上感覺做這個 lab 隻需要 livelock 課程的前面講網絡的基礎知識部分記憶和通讀 lab 的 handout（包括 hint）就能很快做出來。下面記錄以下具體過程，穿插學習 Linux 下如何編寫網卡驅動的 Real World 實作（畢竟 xv6 隻是個 toy）。希望寫完本文的時候能夠具備一些 Linux low level 的驅動知識。

PCI 标準驅動實作和規格

Peripheral Component Interconnect 外部互聯标準，PCI 總線是一種并行同步系統總線，集中式獨立請求仲裁（每個 dev 都有一條請求線和總線使用線），具體仲裁優先級和算法由 PCI 具體實作。同步擷取總線是利用REQ#和GNT#兩個信号線實作的，前者用于某一個裝置占用總線的請求，後者允許某一裝置占用總線和應答。

先看 PCI 接口 DMA 技術網卡的模型：

6.S081 lab: networking e1000 網卡驅動附 Linux 網卡驅動編寫分析PCI 标準驅動實作和規格E1000 網卡驅動實作xv6 使用者網絡棧與驅動的調用結構Linux 中的網絡驅動 Ring Buffer 資料結構分析

其中 Packet Buffer 就是 RAM 了，網卡就是右邊那塊 TX 和 RX 的 MAC 硬體，這裡網卡通過 DMA Engine 來把卡上的存儲之類的東西的内容一次複制到記憶體中去。而 PCI 的作用就是負責管理這些外設卡。

PCI 編址

PCI 的位址編碼來通路不同的裝置。

直接看 xv6 的 pci.c 我們探測 PCI 裝置的時候直接周遊 dev 和 func 等。注意這裡的 bus 是總線編号，觀察一下 window 的裝置管理器有驚喜，可以發現核顯和主機闆自帶的一些外設件都在 bus 0 下，而涉及主機闆上的 PCIE 口（筆記本pcie網卡和獨立顯示卡）都接到 PCI bus 1，2去了（當然 PCIE 和 PCI 的機制不一樣）如下圖：

我們把 PCI 的 bus 0 叫做 up-stream 總線，bus 1 到以後的（還有到 bus 20 的）涉及一些下級的橋接，他再橋接到 bus 0 上的叫做 down-stream, 具體不究太多了。我們這裡認為 Intel 的這個 e1000 是接到 bus0 上的，實際上我們需要 PCI probe 所有的裝置的，這個 lab 我們直接指定了。然後 function 的編号是因為一個 device 可能有多個 function，不過我一開始以為筆記本的 pci或者pcie網卡的 wifi 和藍牙是按這個分 function 走的，結果發現實際是 wlan 走 pcie，藍牙走 usb（minipcie、ngff 的 pci 接口都自帶相容 usb 接口），實際是兩個分開的晶片。是以 function 這方面很難舉例了。工業上比如4通道的采樣卡就是用 function 實作多通道資料并行傳輸的。

直接看代碼，我們根據上面的 PFA 來周遊 Bus Device Function 來查找我們要的卡。探測卡的資訊就涉及到一個 convention 了，PCI 約定了 pci 上的位址的低位 offset 的一部分位址空間用于登記裝置的資訊，具體實作就不是 O/S 做的了, 他大概是 PCI 相關的南橋來搞的. (這一點保留意見).

// PCI address: 
	//   |31 enable bit|30:24 Reserved|-
	//  -|23:16 Bus num|15:11 Dev num|10:8 func num|7:2 off|1:0 0|.
uint32 off = (bus << 16) | (dev << 11) | (func << 8) | (offset);

PCI 裝置中繼資料規格

我們看 Intel 的 dev manual 給的 PCI 的裝置資訊(針對本 e1000 網卡). 但是頭部的 Device ID 等内容是通用的. 是以我們很容易能讀到 0h offset 的一行的 ID 資訊來判斷并裝入驅動. 我們應當要記住，PCI 的作用就是完成 register 的 mapping 進而實作能夠讓 C 程式通過讀取記憶體（vm）來通路裝置的寄存器進而實作控制裝置，之後的資料傳輸則是通過操縱那些寄存器來實作的（即控制裝置）。一句話就是 PCI 在記憶體建了一個控制台之後程式就隻用操作控制台了。

下面結合代碼來看:

void
	pci_init()
	{
	  // we'll place the e1000 registers at this address.
	  // vm.c maps this range.
	  uint64 e1000_regs = 0x40000000L; 
	  // qemu -machine virt puts PCIe config space here.
	  // vm.c maps this range.
	  uint32  *ecam = (uint32 *) 0x30000000L;
	  // look at each possible PCI device on bus 0.
	  for(int dev = 0; dev < 32; dev++){
	    int bus = 0;
	    int func = 0;
	    int offset = 0;
	    // PCI address: 
	    //   |31 enable bit|30:24 Reserved|-
	    //  -|23:16 Bus num|15:11 Dev num|10:8 func num|7:2 off|1:0 0|.
	    uint32 off = (bus << 16) | (dev << 11) | (func << 8) | (offset);
	    volatile uint32 *base = ecam + off;
	    // PCI address space header:
	    // Byte Off   |   3   |   2   |   1   |   0   |
	    //          0h|   Device ID   |   Vendor ID   |
	    uint32 id = base[0]; // read the first line.
	    // 10 0e (device id):80 86(vendor id)  is an e1000
	    if(id == 0x100e8086){
	      // PCI address space header:
	      // Byte Off   |   3   |   2    |   1     |   0    |
	      //         4h |Status register | command register |
	      // command and status register.
	      // bit 0 : I/O access enable
	      // bit 1 : memory access enable
	      // bit 2 : enable mastering
	      base[1] = 7;
	      __sync_synchronize();
	      for(int i = 0; i < 6; i++){
	        // Byte Off              |   3   |   2    |   1     |   0    |
	        // 16b/4b = 4        10h |           Base Address 0          |
	        //          5        14h |           Base Address 1          |
	        //          6        18h |           Base Address 2          |
	        //          7    1ch~24h |          .... 3, 4, 5             |
	        uint32 old = base[4+i];
	        // writing all 1's to the BAR causes it to be
	        // replaced with its size.
	        base[4+i] = 0xffffffff;
	        __sync_synchronize();
	        // if we need a dynamic allocation, we can read the base[4+i] again, remove the low bits
	        // and calc it's one's complement then plus 1 to get it's BAR size (a dma area).
	        base[4+i] = old;
	      }
	      // tell the e1000 to reveal its registers at
	      // physical address 0x40000000.
	      base[4+0] = e1000_regs;
	      e1000_init((uint32*)e1000_regs);
	    }
	  }
	}

計算的部分和用 uint32 讀 bytes 我看代碼注釋都很清楚了，下面講解其中幾個要點。第一個是 0xffffffff 的意義。網卡内部有 flash 的而且拷貝資料不可能一個 bit 或者 byte 地拷貝效率太慢了，他的編位址機制應該是整數對齊的，是以會有一部分 bit 必須是0，我們寫的時候無論你 low bits 填了1還是0，之後再 load 就會發現 low bits 始終 hard-wired to be 0b（b是二進制計數的意思…）（見下面表格的 Description）. 這裡 base address 的意思是注冊一個記憶體位址給 PCI 裝置，讓他把 register 和NIC 的 flash 緩存内容往這個記憶體位址去 map。下面看一下 Intel 的 Manual 裡面是怎麼說這個 0xffffffff 的意思的。

這是上面的那個表格的部分詳細版，然後看具體的字段意思。

這就很好了解了這個東西了。順便摘錄一段友善了解：

The Base Address Registers (or BARs) are used to map the Ethernet con-troller’s register space and flash to system memory space. In PCI-X mode or in PCI mode when the BAR32 bit of the EEPROM is 0b, two registers are used for each of the register space and the flash memory in order to map 64-bit addresses. In PCI mode, if the BAR32 bit in the EEPROM is 1b, one register is used for each to map 32-bit addresses.

初始化完 PCI 完成了一些 vm 的 mapping，之後就能夠通過通路 vm address 來通路 register 了（注意看上面的表格，我們10h offset 即代碼中的 base+4是 register 的地方放入 e1000_regs），之後我們轉到 e1000_init() 去看怎麼完成另一部分的初始化。

E1000 網卡驅動實作

e1000_init 做的事情主要有以下億點：

reset 網卡，關閉網卡中斷。（記住 e1000 是通過 interrupt 來告知作業系統他的一個 DMA 操作完成了）。
lazy allocate 地把 tx_ring 裡面的全部狀态設定為 done（即可以支援 OS 傳來新的 tx 任務）。
allocate 所有的 rx_mbuf 以及設定 rx_ring 對應的 addr。
設定網卡的記錄 rx_ring 和 tx_ring 的 register，以及登記各種和循環緩沖區有關的 control registers 值。
設定 MAC 位址
在網卡上做一個空的多點傳播表
通過設定控制位來啟動網卡 Transmit 部分和 Receive 部分（開機）。
允許接受 Interrupt，即開網卡中斷。

具體的這些到底是怎麼樣的詳細機制我們下文再議，這裡網卡的 init 完成之後，就會啟動網卡。之後 lab 需要編寫的 transmit 和 recv 函數到底在哪被調用呢？我們上面講了 e1000 是 DMA 到 buf 之後引發一個中斷的，是以對！我們需要回到 trap.c 。

e1000_intr 要做的也很簡單，就是調用 recv 來把 buf 的内容拿走。讓 buffer 能夠滿足一個流動的條件。那麼還有一個問題 transmit 是誰調用的？這就是涉及我們的網絡棧的部分了。我們從 lab 提供的 nettests.c 自頂向下來看。

首先看 ping 函數，該函數通過調用一個 syscall 來建立一個 file descriptor 來讀寫。

connect(dst, sport, dport))

我們進入看他作為 syscall 就是調用了 sockalloc 來建立 socket 接口。然後再 read 和 write 的時候調用 sockread 或者 sockwrite（sysnet.c 下）。sockwrite 将會調用 net_tx_udp 來完成一個 buffer 資料的寫入。結論是 xv6 的 write 作用于 socket file 隻支援 udp 調用（net.c 隻實作了 udp）。我們再來看 net_tx_udp 不過是 encapsulate 一些 udp 頭，還是通過 net_tx_ip 封裝下層，然後 net_tx_eth 封裝 ethernet frame，進入 net_tx_eth 就看到了 e1000_transmit() 的調用了。

xv6 使用者網絡棧與驅動的調用結構

摘要 xv6 代碼結構的圖以下友善了解：

Linux 中的網絡驅動

這裡我要講一個問題，這裡 network stack 裡 udp 怎麼能直接調用 e1000 的函數呢，這對于計算機多樣性（思考支援多種網卡的系統應該使用一種抽象封裝的通用函數調用方案）而言是不好的。我們事實上 Linux 的實作必須用一套驅動管理系統。下面就來分析 Linux 的 RealWorld 版本的 network device driver。

由于這裡我不打算 dive deep into the linux kernel，這裡我們假定某些 infrastructure 已經給好了。我們需要提供一個驅動檔案給 kernel 用。首先是對于 linux 的一些給 driver 用的 api 說明以下。

Linux 核心子產品簡介

首先是核心子產品的概念，對于驅動我們是以核心子產品的形式加載進入的，每個驅動的程式就程式設計層一個核心子產品。Linux 在運作的時候 start_kernel 時會加載那些核心子產品，其通過一個 do_initcalls 函數把一系列的 module_init() / init_module() 函數給調用了（他們兩的差別暫且不管，涉及東西太多了，實際就是宏和入口的差別而已）。下面給出一個子產品的例子（Linux Kernel Development 3rd）：

這裡的 module_init(hello_init) 就是把一個函數注冊為子產品的入口。當然也可以直接編寫一個 init_module() 函數作為入口（這一點對于 main 函數經過 C runtime 包裝後作為入口異曲同工）。至于怎麼加載核心子產品則太 technical 這裡不講了。當然這個 hello module 隻有在加載和解除安裝的時候 print 一些東西。（至于學網卡驅動有什麼用考慮虛拟網卡的好處）對于驅動而言，我們需要提供更多注冊動作。

Linux 網卡驅動的層次結構

我們需要注冊 net_device 結構體登記網卡資訊，在不同的 Linux 核心版本中，這些結構體的内容多種多樣，我選取其中一種來講解。思想實驗可以想到我們規定一個結構體來存儲一些網卡資訊同時存儲一些在子產品裡的函數指針即可，然後利用訂閱機制來給核心添加一個網卡。我寫一部分僞代碼在這裡：

struct net{ // in kernel.
	  struct info some_info;
	  struct pointer some_pointer;
	}
	struct net my_net;
	void send(){
	  do_send();
	}
	void recv(){
	  do_recv();
	}
	int init_module(){
	  // PCI api 探測出網卡的位址
	  my_card = pci_probe(id, vendor);
	  // 進行上面提到的那些 register 的 vm mapping
	  map_registers(my_card);
	  //寫入一些資訊如 MAC 位址混淆模式，多點傳播廣播資訊等
	  set_info(my_net);
	  // 注冊事件處理器（發送和接受）
	  my_net.some_info.send =  send;
	  my_net.some_info.recv =  recv;
	  // 把網卡注冊到核心裡
	  register_netdev(my_net);
	  return 0;
	}
	void exit_module(){
	  unregister_netdev(my_net);
	}

當然具體還會涉及一些資料結構（如 xv6 的 mbuf），但是這些程式設計太 dirty 太多 spec 内容（而且不同 linux 版本千差萬别，比如你可以把一個 net_device 來存所有的 info 和 function pointers 或者分開來（net_device_ops），對 interrupt recv 的實作可以規定一個預設入口，也可以同樣使用 function pointer 等等等等）了，我們還要做 lab，這部分就不看下去了。講解 Linux 的具體實作思路是因為 xv6 的過于簡陋了思想實驗就無法令人接受，也順帶幫助了解一下 Linux kernel module 的知識。

上文我們說具體的這些到底是怎麼樣的詳細機制我們下文再議，好現在就來做這個 lab 了。本質上還是練習一個 lock 資料結構的通路的程式設計練習。是以這下我們的重點回到資料結構上。目前對那個循環的 buffer 實際上是有一個模糊的印象而已。我們必須分開來分析和程式設計。先從 tx 開始吧。

Ring Buffer 資料結構分析

lecture 上已經講過了 network stack 的内容了，我這裡也不想再做筆記了。下面給出 circular buffer 的結構以及要用的 register 指針的宏定義（紅色字樣為相應寄存器在 regs 數組的索引宏别名）。

我們這裡要用到 TDT，因為 TDT 是他發送出去的一個空位置。正常來說全程由我們軟體跟蹤（因為他負責把包發送出去，是以硬體遞增的隻有 Head，Tail 隻是标記讓硬體暫停 transmitting 的一個 flag）是以看到 init 的時候把 TDT 和 TDH 都設定為 0.

然後我們讀這裡的操作 HINT 。

First ask the E1000 for the TX ring index at which it's expecting the next packet, by reading the E1000_TDT control register.
Then check if the the ring is overflowing. If E1000_TXD_STAT_DD is not set in the descriptor indexed by E1000_TDT, the E1000 hasn't finished the corresponding previous transmission request, so return an error.
Otherwise, use mbuffree() to free the last mbuf that was transmitted from that descriptor (if there was one).
Then fill in the descriptor. m->head points to the packet's content in memory, and m->len is the packet length. Set the necessary cmd flags (look at Section 3.3 in the E1000 manual) and stash away a pointer to the mbuf for later freeing.
Finally, update the ring position by adding one to E1000_TDT modulo TX_RING_SIZE.
If e1000_transmit() added the mbuf successfully to the ring, return 0. On failure (e.g., there is no descriptor available to transmit the mbuf), return -1 so that the caller knows to free the mbuf.

解釋一下我們的資料結構，這裡由一個 status 數組來跟蹤我們的 circular buffer，他不負責資料。為了能保持跟蹤我們的 mbuf，還要設定一個 mbuf 指針數組，這是回想我們 transmit 的 api 是上層使用者提供一個 mbuf 給我們發的，但是我們放到到 ring buffer 的時候隻是 local comitting，隻有等到他的那個對應的 status 被網卡更新了（remote push，不過 spec 說了你可以指定網卡一 copy 到 flash 就 update status，也可以指定等到 sent 之後再 update）才能 free 掉我們的 mbuf 原件（銷毀本地備份）。這個 status 是由硬體寫進來的（handout 說的 the E1000 sets the E1000_TXD_STAT_DD bit in the descriptor to indicate this）。是以具體的資料結構如下：

其中 mbuf 指針數組 tx_mbufs 做的事情不過是做 hint 裡要求的 stash away pointers to the mbufs presented in tx_rings 而已。（感覺這部分全部不寫好讓自己寫反而更友善做這個 lab？因為 mbuf 的一些字段好像就沒用到，為了了解這個好像有點花時間，不過這樣就要涉及更多的讀 specification 的工作了）代碼如下給出：

int 
	e1000_transmit(struct mbuf* m) 
	{
	  //
	  // Your code here.
	  //
	  // the mbuf contains an ethernet frame; program it into
	  // the TX descriptor ring so that the e1000 sends it. Stash
	  // a pointer so that it can be freed after sending.
	  //
	  acquire(&e1000_lock);
	  uint32 tail = regs[E1000_TDT];
	  // overflow
	  if (tx_ring[tail].status != E1000_TXD_STAT_DD) {
	    release(&e1000_lock);
	    return -1;
	  }
	  if(tx_mbufs[tail]){
	    mbuffree(tx_mbufs[tail]);
	  }
	  tx_ring[tail].length = (uint16)m->len;
	  tx_ring[tail].addr = (uint64)m->head;
	  tx_ring[tail].cmd = 9;
	  tx_mbufs[tail] = m;
	  regs[E1000_TDT] = (tail+1)%TX_RING_SIZE;
	  release(&e1000_lock);
	  return 0;
	}

recv 的則類似這裡不贅述了，上圖，

對 Intel Spec 裡面的這幅圖我也是物語了😓，他畫錯圖了又在下面文字附上（HARDWARE OWNS ALL DESCRIPTORS BETWEEN [HEAD AND TAIL]. 浪費我還以為出 bug 了用 python 寫 socket 試了一下。這裡建議 google 學習一下 python socket 程式設計然後用來模拟 nettests 裡面的内容來測試一下（等于有一個正确的結果的程式），端口号就看 make server 的提示和 makefile 裡面顯示的以及 handout 說的那一個了。

差別的是我們要從 tail +1（HARDWARE OWNS ALL DESCRIPTORS BETWEEN [HEAD AND TAIL].）去拿包出來，把 rx_ring 的這個 buffer 空間給 hardware。更重要的是，我們需要把全部陰影部分都拿走。出現多個灰色的原因是我們為了減少 interrupt 的次數（複習前面的 lecture，receive livelock 就是因為 packet 接收速率很快，而每個收到的packet都會生成一個中斷，最後，100%的CPU時間都被消耗用來處理網卡的輸入中斷，CPU沒有任何時間用來轉發 packet 到上層，同時由于每次隻從 buffer copyout 一個也容易導緻網卡 throw away 快速到達的 packets）。Spec 裡說：

The Receive Timer Interrupt is used to signal most packet reception events (the Small Receive

Packet Detect interrupt is also used in some cases as described later in this section). In order to

minimize the interrupts per work accomplished, the Ethernet controller provides two timers to

control how often interrupts are generated.

不過我們檢視 init 發現 timer 設定為 0. 是以我們實際不需要複制全部的灰色因為已經約定了一個一個地 interrupt，出于學習目的我們還是寫一個 while 循環吧。HINT 中說：At some point the total number of packets that have ever arrived will exceed the ring size (16); make sure your code can handle that. 我的了解是我不能了解，可能看 Q&A 不知道會不會講這個。如果有人知道這個情況會出現什麼事情嗎可以告訴我。代碼如下給出：

static void 
	e1000_recv(void) 
	{
	  //
	  // Your code here.
	  //
	  // Check for packets that have arrived from the e1000
	  // Create and deliver an mbuf for each packet (using net_rx()).
	  //
	  int tail = regs[E1000_RDT];
	  int i = (tail+1)%RX_RING_SIZE; // tail is owned by Hardware!
	  while (rx_ring[i].status & E1000_RXD_STAT_DD) {
	    rx_mbufs[i]->len = rx_ring[i].length;
	    // send mbuf to upper level (the network stack in net.c).
	    net_rx(rx_mbufs[i]);
	    // get a new buffer for next recv.
	    rx_mbufs[i] = mbufalloc(0);
	    rx_ring[i].addr = (uint64)rx_mbufs[i]->head;
	    // update status for next recv.
	    rx_ring[i].status = 0;
	    i = (i + 1) % RX_RING_SIZE;
	  }
	  regs[E1000_RDT] = i - 1; // - 1 for the while loop.
	}

最後指出一個鎖的應用問題：You'll need locks to cope with the possibility that xv6 might use the E1000 from more than one process, or might be using the E1000 in a kernel thread when an interrupt arrives. 就能了解為什麼一個要用 spinlock 一個不用 spinlock。

6.S081 lab: networking e1000 網卡驅動附 Linux 網卡驅動編寫分析PCI 标準驅動實作和規格E1000 網卡驅動實作xv6 使用者網絡棧與驅動的調用結構Linux 中的網絡驅動 Ring Buffer 資料結構分析

PCI 标準驅動實作和規格

PCI 編址

PCI 裝置中繼資料規格

E1000 網卡驅動實作

xv6 使用者網絡棧與驅動的調用結構

Linux 中的網絡驅動

Linux 核心子產品簡介

Linux 網卡驅動的層次結構

Ring Buffer 資料結構分析

繼續閱讀

作業系統（python）多程序學習

Ubuntu14.04 LTS下安裝mongodb

httpd服務的部署、啟動、配置和簡單優化一、部署二、啟動三、配置檔案

配置網頁内容通路

手動安裝Intel network I217-LM網卡的Linux驅動

禁止ubuntu系統彈出報錯界面

Ubuntu Linux下Apache的配置檔案

ACS基本配置-權限等級管理

傳說FreeBSD等比Linux更穩定，更“健壯”

無人機--飛控科普

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

samba伺服器的功能

【Linux】UDP廣播封包接收速率問題

Linux裝置模型（中）之上層容器

PowerPC平台 Linux移植三

6.S081 lab: networking e1000 網卡驅動 附 Linux 網卡驅動編寫分析PCI 标準驅動實作和規格E1000 網卡驅動實作xv6 使用者網絡棧與驅動的調用結構Linux 中的網絡驅動 Ring Buffer 資料結構分析

PCI 标準驅動實作和規格

PCI 編址

PCI 裝置中繼資料規格

E1000 網卡驅動實作

xv6 使用者網絡棧與驅動的調用結構

Linux 中的網絡驅動

Linux 核心子產品簡介

Linux 網卡驅動的層次結構

Ring Buffer 資料結構分析

繼續閱讀

6.S081 lab: networking e1000 網卡驅動附 Linux 網卡驅動編寫分析PCI 标準驅動實作和規格E1000 網卡驅動實作xv6 使用者網絡棧與驅動的調用結構Linux 中的網絡驅動 Ring Buffer 資料結構分析