vswitchd是使用者态的daemon程序,其核心是執行ofproto的邏輯。我們知道ovs是遵從openflow交換機的規範實作的,就拿二層包轉發為例,傳統交換機(包括Linux bridge的實作)是通過查找cam表,找到dst mac對應的port;而open vswitch的實作則是根據入包skb,查找是否有對應的flow。如果有flow,說明這個skb不是流的第一個包了,那麼可以在flow->action裡找到轉發的port。這裡要說明的是,SDN的思想就是所有的包都需要對應一個flow,基于flow給出包的行為action,傳統的action無非就是轉發,接受,或者丢棄,而在SDN中,會有更多的action定義:修改skb的内容,改變包的路徑,clone多份出來發到不同路徑等等。
如果skb沒有對應的flow,說明這是flow的第一個包,需要為這個包建立一個flow,vswitchd會在一個while循環裡反複檢查有沒有ofproto的請求過來,有可能是ovs-ofctl傳過來的,也可能是openvswitch.ko通過netlink發送的upcall請求,當然大部分情況下,都是flow miss導緻的建立flow的請求,這時vswitchd會基于openflow規範建立flow, action,我們看下這個流程:
由于open vswitch是一個2層交換機模型,所有包開始都是從某個port接收進來,即調用ovs_dp_process_received_packet,該函數先基于skb通過ovs_flow_extract生成key,然後調用ovs_flow_tbl_lookup基于key查找flow,如果無法找到flow,調用ovs_dp_upcall通過netlink把一個dp_upcall_info結構發到vswitchd裡去處理(調用genlmsg_unicast)
vswitchd會在handle_upcalls裡來處理上述的netlink request,對于flow table裡miss的情況,會調用handle_miss_upcalls,繼而又調用handle_flow_miss,下面來看handle_miss_upcalls的實作
static void
handle_miss_upcalls(struct dpif_backer *backer, struct dpif_upcall *upcalls,
size_t n_upcalls)
{
hmap_init(&todo);
n_misses = 0;
注釋裡寫得很明白,下面的循環會周遊netlink傳到使用者态的struct dpif_upcall,該結構包含了miss packet,和基于封包生成的的flow key,對于flow key相同的packet,會集中處理
for (upcall = upcalls; upcall < &upcalls[n_upcalls]; upcall++) {
fitness = odp_flow_key_to_flow(upcall->key, upcall->key_len, &flow);
port = odp_port_to_ofport(backer, flow.in_port);
odp_flow_key_to_flow,先調用lib/parse_flow_nlattrs函數解析upcall->key, upcall->key_len,把解析出來的attr屬性放到一個bitmap present_attrs中,而對應類型的struct nlattr則放到struct nlattr* attrs[]中。接下來對present_attrs的每一位,從upcall->key中取得相應值并存入flow中。對于vlan的parse,特别調用了parse_8021q_onward
odp_port_to_ofport,用來把flow.in_port,即datapath的port号轉換成openflow port,即struct ofport_dpif* port
flow_extract(upcall->packet, flow.skb_priority,
&flow.tunnel, flow.in_port, &miss->flow);
這裡把packet解析到flow中,該函數和odp_flow_key_to_flow有些地方重複
hash = flow_hash(&miss->flow, 0);
existing_miss = flow_miss_find(&todo, &miss->flow, hash);
if (!existing_miss) {
hmap_insert(&todo, &miss->hmap_node, hash);
miss->ofproto = ofproto;
miss->key = upcall->key;
miss->key_len = upcall->key_len;
miss->upcall_type = upcall->type;
list_init(&miss->packets);
n_misses++;
} else {
miss = existing_miss;
}
list_push_back(&miss->packets, &upcall->packet->list_node);
}
flow_hash計算出miss->flow的哈希值,之後在todo這個hmap裡基于哈希值查找struct flow_miss*,如果為空,表示這是第一個flow_miss,初始化這個flow_miss并加入到todo中,最後把packet假如到flow_miss->packets的list中。這裡驗證了之前的結論,對于一次性的多個upcall,會把屬于同一個flow_miss的packets連結到同一個flow_miss下再一并處理。
OVS定義了facet,用來表示使用者态程式,比如vswitchd,對于一條被比對的flow的視圖。同時kernel space對于一條flow同樣有一個視圖,facet表示兩個視圖相同的部分。不同的部分用subfacet來表示,struct subfacet裡定義了action行為
如果datapath計算出的flow_key,和vswitchd基于packet計算出的flow_key完全一緻的話,facet隻會包含唯一的subfacet,如果datapath計算出的flow_key的成員比vswitchd基于packet計算出來的還要多,那麼每個多出來的部分都會成為一個subfacet
struct subfacet {
struct hmap_node hmap_node;
struct list list_node;
struct facet *facet;
enum odp_key_fitness key_fitness;
struct nlattr *key;
int key_len;
long long int used;
uint64_t dp_packet_count;
uint64_t dp_byte_count;
size_t actions_len;
struct nlattr *actions;
enum slow_path_reason slow;
enum subfacet_path path;
}
我們先來看handle_flow_miss
static void
handle_flow_miss(struct ofproto_dpif *ofproto, struct flow_miss *miss,
struct flow_miss_op *ops, size_t *n_ops)
{
struct facet *facet;
uint32_t hash;
hash = miss->hmap_node.hash;
facet = facet_lookup_valid(ofproto, &miss->flow, hash);
在表示datapath的資料結構struct ofproto_dpif* ofproto中查找flow。ofproto->facets是一個hashmap,首先計算出miss flow的hash值,之後在hash對應的hmap_node list中查找是否有比對的flow,比較的方式比較暴力,直接拿memcmp比較。。
if (!facet) {
struct rule_dpif *rule = rule_dpif_lookup(ofproto, &miss->flow);
if (!flow_miss_should_make_facet(ofproto, miss, hash)) {
handle_flow_miss_without_facet(miss, rule, ops, n_ops);
此時認為沒有必要建立flow facet,對于一些trivial的流量,建立一個flow facet反而會帶來更大的overload
return;
}
facet = facet_create(rule, &miss->flow, hash);
好吧,我們為這個flow建立一個facet
}
handle_flow_miss_with_facet(miss, facet, ops, n_ops);
}
struct flow_miss是對flow的一個封裝,用來加快miss flow的batch處理。大多數情況下,都會建立這個facet出來,
2012-10-26T07:15:43Z|22522|ofproto_dpif|INFO|[qinq] miss flow, create facet: vlan_tci 0, proto 0x806, in_port 1, src mac 0:16:3e:83:0:1, dst mac 0:25:9e:5d:62:53
2012-10-26T07:15:43Z|22529|ofproto_dpif|INFO|[qinq] miss flow, create facet: vlan_tci 0, proto 0x806, in_port 2, src mac 0:25:9e:5d:62:53, dst mac 0:16:3e:83:0:1
可以看出一個雙工通信建立了兩個flow出來,同時也建立了facet
下面來看handle_flow_miss_with_facet,裡面調用subfacet_make_actions來生成action,該函數首先調用action_xlate_ctx_init,初始化一個action_xlate_ctx結構,該結構定義如下:
struct action_xlate_ctx {
struct ofproto_dpif *ofproto;
struct flow flow;
const struct ofpbuf *packet;
bool may_learn;
struct rule_dpif *rule;
uint8_t tcp_flags;
struct ofpbuf *odp_actions;
tag_type tags;
enum slow_path_reason slow;
bool has_learn;
bool has_normal;
bool has_fin_timeout;
uint16_t nf_output_iface;
mirror_mask_t mirrors;
int recurse;
bool max_resubmit_trigger;
struct flow base_flow;
uint32_t orig_skb_priority;
uint8_t table_id;
uint32_t sflow_n_outputs;
uint16_t sflow_odp_port;
uint16_t user_cookie_offset;
bool exit;
struct flow orig_flow;
};
之後調用xlate_actions,openflow1.0定義了如下action,
enum ofp10_action_type {
OFPAT10_OUTPUT,
OFPAT10_SET_VLAN_VID,
OFPAT10_SET_VLAN_PCP,
OFPAT10_STRIP_VLAN,
OFPAT10_SET_DL_SRC,
OFPAT10_SET_DL_DST,
OFPAT10_SET_NW_SRC,
OFPAT10_SET_NW_DST,
OFPAT10_SET_NW_TOS,
OFPAT10_SET_TP_SRC,
OFPAT10_SET_TP_DST,
OFPAT10_ENQUEUE,
OFPAT10_VENDOR = 0xffff
};
對應不同的action type,其action傳入的資料結構也不同,e.g.
struct ofp_action_vlan_vid {
ovs_be16 type;
ovs_be16 len;
ovs_be16 vlan_vid;
uint8_t pad[2];
};
struct ofp_action_vlan_pcp {
ovs_be16 type;
ovs_be16 len;
uint8_t vlan_pcp;
uint8_t pad[3];
};
union ofp_action {
ovs_be16 type;
struct ofp_action_header header;
struct ofp_action_vendor_header vendor;
struct ofp_action_output output;
struct ofp_action_vlan_vid vlan_vid;
struct ofp_action_vlan_pcp vlan_pcp;
struct ofp_action_nw_addr nw_addr;
struct ofp_action_nw_tos nw_tos;
struct ofp_action_tp_port tp_port;
};
do_xlate_actions傳入一個struct ofp_action*數組,對每個struct ofp_action,執行不同的操作,e.g.
case OFPUTIL_OFPAT10_OUTPUT:
xlate_output_action(ctx, &ia->output);
break;
case OFPUTIL_OFPAT10_SET_VLAN_VID:
ctx->flow.vlan_tci &= ~htons(VLAN_VID_MASK);
ctx->flow.vlan_tci |= ia->vlan_vid.vlan_vid | htons(VLAN_CFI);
break;
case OFPUTIL_OFPAT10_SET_VLAN_PCP:
ctx->flow.vlan_tci &= ~htons(VLAN_PCP_MASK);
ctx->flow.vlan_tci |= htons(
(ia->vlan_pcp.vlan_pcp << VLAN_PCP_SHIFT) | VLAN_CFI);
break;
case OFPUTIL_OFPAT10_STRIP_VLAN:
ctx->flow.vlan_tci = htons(0);
break;
對于轉發封包,最重要的就是xlate_output_action,該函數調用的xlate_output_action__,其中傳入的port為datapath port index,或者其他控制參數,可以在ofp_port的定義中看到如下定義:
enum ofp_port {
OFPP_MAX = 0xff00,
OFPP_IN_PORT = 0xfff8,
OFPP_TABLE = 0xfff9,
OFPP_NORMAL = 0xfffa,
OFPP_FLOOD = 0xfffb,
OFPP_ALL = 0xfffc,
OFPP_CONTROLLER = 0xfffd,
OFPP_LOCAL = 0xfffe,
OFPP_NONE = 0xffff
};
在xlate_output_action__中,大部分情況都是走到OFPP_NORMAL裡面,調用xlate_normal,裡面會調用mac_learning_lookup, 查找mac表找到封包的出口port,然後調用output_normal,output_normal最終調用compose_output_action
compose_output_action__(struct action_xlate_ctx *ctx, uint16_t ofp_port,
bool check_stp)
{
const struct ofport_dpif *ofport = get_ofp_port(ctx->ofproto, ofp_port);
uint16_t odp_port = ofp_port_to_odp_port(ofp_port);
ovs_be16 flow_vlan_tci = ctx->flow.vlan_tci;
uint8_t flow_nw_tos = ctx->flow.nw_tos;
uint16_t out_port;
...
out_port = vsp_realdev_to_vlandev(ctx->ofproto, odp_port,
ctx->flow.vlan_tci);
if (out_port != odp_port) {
ctx->flow.vlan_tci = htons(0);
}
commit_odp_actions(&ctx->flow, &ctx->base_flow, ctx->odp_actions);
nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_OUTPUT, out_port);
ctx->sflow_odp_port = odp_port;
ctx->sflow_n_outputs++;
ctx->nf_output_iface = ofp_port;
ctx->flow.vlan_tci = flow_vlan_tci;
ctx->flow.nw_tos = flow_nw_tos;
}
commit_odp_actions,用來把所有action編碼車功能nlattr的格式存到ctx->odp_actions中,之後的nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_OUTPUT, out_port)把封包的出口port添加進去,這樣一條flow action差不多組合完畢了
下面來讨論下vswitchd中的cam表,代碼在lib/mac-learning.h lib/mac-learning.c中,
vswitchd内部維護了一個mac/port的cam表,其中mac entry的老化時間為300秒,cam表定義了flooding vlan的概念,即如果vlan是flooding,表示不會去學習任何位址,這個vlan的所有轉發都通過flooding完成,
struct mac_entry {
struct hmap_node hmap_node;
struct list lru_node;
time_t expires;
time_t grat_arp_lock;
uint8_t mac[ETH_ADDR_LEN];
uint16_t vlan;
tag_type tag;
union {
void *p;
int i;
} port;
};
struct mac_learning {
struct hmap table; mac_entry組成的hmap哈希表,mac_entry通過hmap_node挂載到mac_learning->table中
struct list lrus; lru的連結清單,mac_entry通過lru_node挂載到mac_learning->lrus中
uint32_t secret;
unsigned long *flood_vlans;
unsigned int idle_time; 最大老化時間
};
static uint32_t
mac_table_hash(const struct mac_learning *ml, const uint8_t mac[ETH_ADDR_LEN],
uint16_t vlan)
{
unsigned int mac1 = get_unaligned_u32((uint32_t *) mac);
unsigned int mac2 = get_unaligned_u16((uint16_t *) (mac + 4));
return hash_3words(mac1, mac2 | (vlan << 16), ml->secret);
}
mac_entry計算的hash值,由mac_learning->secret,vlan, mac位址共同通過hash_3words計算出來
mac_entry_lookup,通過mac位址,vlan來檢視是否已經對應的mac_entry
get_lru,找到lru連結清單對應的第一個mac_entry
mac_learning_create/mac_learning_destroy,建立/銷毀mac_learning表
mac_learning_may_learn,如果vlan不是flooding vlan且mac位址不是多點傳播位址,傳回true
mac_learning_insert,向mac_learning中插入一條mac_entry,首先通過mac_entry_lookup檢視mac, vlan對應的mac_entry是否存在,不存在的話如果此時mac_learning已經有了MAC_MAX條mac_entry,老化最老的那條,之後建立mac_entry并插入到cam表中。
mac_learning_lookup,調用mac_entry_lookup在cam表中查找某個vlan對應的mac位址
mac_learning_run,循環老化已經逾時的mac_entry