open vswitch研究：vswitchd

vswitchd是使用者态的daemon程序，其核心是執行ofproto的邏輯。我們知道ovs是遵從openflow交換機的規範實作的，就拿二層包轉發為例，傳統交換機(包括Linux bridge的實作)是通過查找cam表，找到dst mac對應的port；而open vswitch的實作則是根據入包skb，查找是否有對應的flow。如果有flow，說明這個skb不是流的第一個包了，那麼可以在flow->action裡找到轉發的port。這裡要說明的是，SDN的思想就是所有的包都需要對應一個flow，基于flow給出包的行為action，傳統的action無非就是轉發，接受，或者丢棄，而在SDN中，會有更多的action定義：修改skb的内容，改變包的路徑，clone多份出來發到不同路徑等等。

如果skb沒有對應的flow，說明這是flow的第一個包，需要為這個包建立一個flow，vswitchd會在一個while循環裡反複檢查有沒有ofproto的請求過來，有可能是ovs-ofctl傳過來的，也可能是openvswitch.ko通過netlink發送的upcall請求，當然大部分情況下，都是flow miss導緻的建立flow的請求，這時vswitchd會基于openflow規範建立flow, action，我們看下這個流程:

由于open vswitch是一個2層交換機模型，所有包開始都是從某個port接收進來，即調用ovs_dp_process_received_packet，該函數先基于skb通過ovs_flow_extract生成key，然後調用ovs_flow_tbl_lookup基于key查找flow，如果無法找到flow，調用ovs_dp_upcall通過netlink把一個dp_upcall_info結構發到vswitchd裡去處理(調用genlmsg_unicast)

vswitchd會在handle_upcalls裡來處理上述的netlink request，對于flow table裡miss的情況，會調用handle_miss_upcalls，繼而又調用handle_flow_miss，下面來看handle_miss_upcalls的實作

static void

handle_miss_upcalls(struct dpif_backer *backer, struct dpif_upcall *upcalls,

size_t n_upcalls)

{

hmap_init(&todo);

n_misses = 0;

注釋裡寫得很明白，下面的循環會周遊netlink傳到使用者态的struct dpif_upcall，該結構包含了miss packet，和基于封包生成的的flow key，對于flow key相同的packet，會集中處理

for (upcall = upcalls; upcall < &upcalls[n_upcalls]; upcall++) {

fitness = odp_flow_key_to_flow(upcall->key, upcall->key_len, &flow);

port = odp_port_to_ofport(backer, flow.in_port);

odp_flow_key_to_flow，先調用lib/parse_flow_nlattrs函數解析upcall->key, upcall->key_len，把解析出來的attr屬性放到一個bitmap present_attrs中，而對應類型的struct nlattr則放到struct nlattr* attrs[]中。接下來對present_attrs的每一位，從upcall->key中取得相應值并存入flow中。對于vlan的parse，特别調用了parse_8021q_onward

odp_port_to_ofport，用來把flow.in_port，即datapath的port号轉換成openflow port，即struct ofport_dpif* port

flow_extract(upcall->packet, flow.skb_priority,

&flow.tunnel, flow.in_port, &miss->flow);

這裡把packet解析到flow中，該函數和odp_flow_key_to_flow有些地方重複

hash = flow_hash(&miss->flow, 0);

existing_miss = flow_miss_find(&todo, &miss->flow, hash);

if (!existing_miss) {

hmap_insert(&todo, &miss->hmap_node, hash);

miss->ofproto = ofproto;

miss->key = upcall->key;

miss->key_len = upcall->key_len;

miss->upcall_type = upcall->type;

list_init(&miss->packets);

n_misses++;

} else {

miss = existing_miss;

}

list_push_back(&miss->packets, &upcall->packet->list_node);

}

flow_hash計算出miss->flow的哈希值，之後在todo這個hmap裡基于哈希值查找struct flow_miss*，如果為空，表示這是第一個flow_miss，初始化這個flow_miss并加入到todo中，最後把packet假如到flow_miss->packets的list中。這裡驗證了之前的結論，對于一次性的多個upcall，會把屬于同一個flow_miss的packets連結到同一個flow_miss下再一并處理。

OVS定義了facet，用來表示使用者态程式，比如vswitchd，對于一條被比對的flow的視圖。同時kernel space對于一條flow同樣有一個視圖，facet表示兩個視圖相同的部分。不同的部分用subfacet來表示，struct subfacet裡定義了action行為

如果datapath計算出的flow_key，和vswitchd基于packet計算出的flow_key完全一緻的話，facet隻會包含唯一的subfacet，如果datapath計算出的flow_key的成員比vswitchd基于packet計算出來的還要多，那麼每個多出來的部分都會成為一個subfacet

struct subfacet {

struct hmap_node hmap_node;

struct list list_node;

struct facet *facet;

enum odp_key_fitness key_fitness;

struct nlattr *key;

int key_len;

long long int used;

uint64_t dp_packet_count;

uint64_t dp_byte_count;

size_t actions_len;

struct nlattr *actions;

enum slow_path_reason slow;

enum subfacet_path path;

}

我們先來看handle_flow_miss

static void

handle_flow_miss(struct ofproto_dpif *ofproto, struct flow_miss *miss,

struct flow_miss_op *ops, size_t *n_ops)

{

struct facet *facet;

uint32_t hash;

hash = miss->hmap_node.hash;

facet = facet_lookup_valid(ofproto, &miss->flow, hash);

在表示datapath的資料結構struct ofproto_dpif* ofproto中查找flow。ofproto->facets是一個hashmap，首先計算出miss flow的hash值，之後在hash對應的hmap_node list中查找是否有比對的flow，比較的方式比較暴力，直接拿memcmp比較。。

if (!facet) {

struct rule_dpif *rule = rule_dpif_lookup(ofproto, &miss->flow);

if (!flow_miss_should_make_facet(ofproto, miss, hash)) {

handle_flow_miss_without_facet(miss, rule, ops, n_ops);

此時認為沒有必要建立flow facet，對于一些trivial的流量，建立一個flow facet反而會帶來更大的overload

return;

}

facet = facet_create(rule, &miss->flow, hash);

好吧，我們為這個flow建立一個facet

}

handle_flow_miss_with_facet(miss, facet, ops, n_ops);

}

struct flow_miss是對flow的一個封裝，用來加快miss flow的batch處理。大多數情況下，都會建立這個facet出來，

2012-10-26T07:15:43Z|22522|ofproto_dpif|INFO|[qinq] miss flow, create facet: vlan_tci 0, proto 0x806, in_port 1, src mac 0:16:3e:83:0:1, dst mac 0:25:9e:5d:62:53

2012-10-26T07:15:43Z|22529|ofproto_dpif|INFO|[qinq] miss flow, create facet: vlan_tci 0, proto 0x806, in_port 2, src mac 0:25:9e:5d:62:53, dst mac 0:16:3e:83:0:1

可以看出一個雙工通信建立了兩個flow出來，同時也建立了facet

下面來看handle_flow_miss_with_facet，裡面調用subfacet_make_actions來生成action，該函數首先調用action_xlate_ctx_init，初始化一個action_xlate_ctx結構，該結構定義如下：

struct action_xlate_ctx {

struct ofproto_dpif *ofproto;

struct flow flow;

const struct ofpbuf *packet;

bool may_learn;

struct rule_dpif *rule;

uint8_t tcp_flags;

struct ofpbuf *odp_actions;

tag_type tags;

enum slow_path_reason slow;

bool has_learn;

bool has_normal;

bool has_fin_timeout;

uint16_t nf_output_iface;

mirror_mask_t mirrors;

int recurse;

bool max_resubmit_trigger;

struct flow base_flow;

uint32_t orig_skb_priority;

uint8_t table_id;

uint32_t sflow_n_outputs;

uint16_t sflow_odp_port;

uint16_t user_cookie_offset;

bool exit;

struct flow orig_flow;

};

之後調用xlate_actions，openflow1.0定義了如下action，

enum ofp10_action_type {

OFPAT10_OUTPUT,

OFPAT10_SET_VLAN_VID,

OFPAT10_SET_VLAN_PCP,

OFPAT10_STRIP_VLAN,

OFPAT10_SET_DL_SRC,

OFPAT10_SET_DL_DST,

OFPAT10_SET_NW_SRC,

OFPAT10_SET_NW_DST,

OFPAT10_SET_NW_TOS,

OFPAT10_SET_TP_SRC,

OFPAT10_SET_TP_DST,

OFPAT10_ENQUEUE,

OFPAT10_VENDOR = 0xffff

};

對應不同的action type，其action傳入的資料結構也不同，e.g.

struct ofp_action_vlan_vid {

ovs_be16 type;

ovs_be16 len;

ovs_be16 vlan_vid;

uint8_t pad[2];

};

struct ofp_action_vlan_pcp {

ovs_be16 type;

ovs_be16 len;

uint8_t vlan_pcp;

uint8_t pad[3];

};

union ofp_action {

ovs_be16 type;

struct ofp_action_header header;

struct ofp_action_vendor_header vendor;

struct ofp_action_output output;

struct ofp_action_vlan_vid vlan_vid;

struct ofp_action_vlan_pcp vlan_pcp;

struct ofp_action_nw_addr nw_addr;

struct ofp_action_nw_tos nw_tos;

struct ofp_action_tp_port tp_port;

};

do_xlate_actions傳入一個struct ofp_action*數組，對每個struct ofp_action，執行不同的操作，e.g.

case OFPUTIL_OFPAT10_OUTPUT:

xlate_output_action(ctx, &ia->output);

break;

case OFPUTIL_OFPAT10_SET_VLAN_VID:

ctx->flow.vlan_tci &= ~htons(VLAN_VID_MASK);

ctx->flow.vlan_tci |= ia->vlan_vid.vlan_vid | htons(VLAN_CFI);

break;

case OFPUTIL_OFPAT10_SET_VLAN_PCP:

ctx->flow.vlan_tci &= ~htons(VLAN_PCP_MASK);

ctx->flow.vlan_tci |= htons(

(ia->vlan_pcp.vlan_pcp << VLAN_PCP_SHIFT) | VLAN_CFI);

break;

case OFPUTIL_OFPAT10_STRIP_VLAN:

ctx->flow.vlan_tci = htons(0);

break;

對于轉發封包，最重要的就是xlate_output_action，該函數調用的xlate_output_action__，其中傳入的port為datapath port index，或者其他控制參數，可以在ofp_port的定義中看到如下定義：

enum ofp_port {

OFPP_MAX = 0xff00,

OFPP_IN_PORT = 0xfff8,

OFPP_TABLE = 0xfff9,

OFPP_NORMAL = 0xfffa,

OFPP_FLOOD = 0xfffb,

OFPP_ALL = 0xfffc,

OFPP_CONTROLLER = 0xfffd,

OFPP_LOCAL = 0xfffe,

OFPP_NONE = 0xffff

};

在xlate_output_action__中，大部分情況都是走到OFPP_NORMAL裡面，調用xlate_normal，裡面會調用mac_learning_lookup, 查找mac表找到封包的出口port，然後調用output_normal，output_normal最終調用compose_output_action

compose_output_action__(struct action_xlate_ctx *ctx, uint16_t ofp_port,

bool check_stp)

{

const struct ofport_dpif *ofport = get_ofp_port(ctx->ofproto, ofp_port);

uint16_t odp_port = ofp_port_to_odp_port(ofp_port);

ovs_be16 flow_vlan_tci = ctx->flow.vlan_tci;

uint8_t flow_nw_tos = ctx->flow.nw_tos;

uint16_t out_port;

...

out_port = vsp_realdev_to_vlandev(ctx->ofproto, odp_port,

ctx->flow.vlan_tci);

if (out_port != odp_port) {

ctx->flow.vlan_tci = htons(0);

}

commit_odp_actions(&ctx->flow, &ctx->base_flow, ctx->odp_actions);

nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_OUTPUT, out_port);

ctx->sflow_odp_port = odp_port;

ctx->sflow_n_outputs++;

ctx->nf_output_iface = ofp_port;

ctx->flow.vlan_tci = flow_vlan_tci;

ctx->flow.nw_tos = flow_nw_tos;

}

commit_odp_actions，用來把所有action編碼車功能nlattr的格式存到ctx->odp_actions中，之後的nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_OUTPUT, out_port)把封包的出口port添加進去，這樣一條flow action差不多組合完畢了

下面來讨論下vswitchd中的cam表，代碼在lib/mac-learning.h lib/mac-learning.c中，

vswitchd内部維護了一個mac/port的cam表，其中mac entry的老化時間為300秒，cam表定義了flooding vlan的概念，即如果vlan是flooding，表示不會去學習任何位址，這個vlan的所有轉發都通過flooding完成，

struct mac_entry {

struct hmap_node hmap_node;

struct list lru_node;

time_t expires;

time_t grat_arp_lock;

uint8_t mac[ETH_ADDR_LEN];

uint16_t vlan;

tag_type tag;

union {

void *p;

int i;

} port;

};

struct mac_learning {

struct hmap table; mac_entry組成的hmap哈希表，mac_entry通過hmap_node挂載到mac_learning->table中

struct list lrus; lru的連結清單，mac_entry通過lru_node挂載到mac_learning->lrus中

uint32_t secret;

unsigned long *flood_vlans;

unsigned int idle_time; 最大老化時間

};

static uint32_t

mac_table_hash(const struct mac_learning *ml, const uint8_t mac[ETH_ADDR_LEN],

uint16_t vlan)

{

unsigned int mac1 = get_unaligned_u32((uint32_t *) mac);

unsigned int mac2 = get_unaligned_u16((uint16_t *) (mac + 4));

return hash_3words(mac1, mac2 | (vlan << 16), ml->secret);

}

mac_entry計算的hash值，由mac_learning->secret，vlan, mac位址共同通過hash_3words計算出來

mac_entry_lookup，通過mac位址，vlan來檢視是否已經對應的mac_entry

get_lru，找到lru連結清單對應的第一個mac_entry

mac_learning_create/mac_learning_destroy，建立/銷毀mac_learning表

mac_learning_may_learn，如果vlan不是flooding vlan且mac位址不是多點傳播位址，傳回true

mac_learning_insert，向mac_learning中插入一條mac_entry，首先通過mac_entry_lookup檢視mac, vlan對應的mac_entry是否存在，不存在的話如果此時mac_learning已經有了MAC_MAX條mac_entry，老化最老的那條，之後建立mac_entry并插入到cam表中。

mac_learning_lookup，調用mac_entry_lookup在cam表中查找某個vlan對應的mac位址

mac_learning_run，循環老化已經逾時的mac_entry

open vswitch研究：vswitchd

繼續閱讀

基于OVS+GRE實作Docker容器跨主機通訊

ovs的upcall及ofproto-dpif處理細節

ubuntu中安裝openvswitchopenvswitch

ovs vswitchd的啟動分析ovs vswitchd的啟動

Open vswitchOpen vSwitch概述OpenvSwitch系統架構OpenvSwitch流表分析 Open vSwitch資料包處理流程Datapath流表查詢源碼分析OVS的常用指令工具Ovs在openstack中的應用

南向接口和北向接口

【OVS2.5.0源碼分析】openflow連接配接實作分析（2）

centos7 安裝 openvswitchCentos7安裝openvswitch

openstack如何支援vlan trunk功能

各種模式虛拟化的網絡性能對比測試

OVS筆記

ovs與ovs+dpdk架構分析OVS核心态架構圖核心态與使用者态互動OVS+DPDK架構資料流向

Openstack中OVS實作GRE網絡的GRE資料包分析 http://blog.csdn.net/u010363749/article/details/17755585 Openstack中OVS實作GRE網絡的GRE資料包分析

Quantum(Grizzly) L3 agent（OVS）工作流Quantum(Grizzly) L3 agent（OVS）工作流一、初始化二、循環任務一、系統對象二、ovs agent初始化三、ovs agent循環任務四、vlan模型圖（quntum agent+dhcp agent+l3 agent）

OpenStack網絡知識片斷(持續更新)dhcp agent l3 agent

docker跨主機通信(openvswitch)超簡單可跟做