performance – IoSonoUnRouter

Why we must use MPLSoUDP with Contrail

Contrail makes massive usage of overlay. Its whole architecture is based on leveraging overlay to provide L2/L3 virtualization and make the underlying IP fabric transparent to virtual workloads.

Contrail supports 3 types of encapsulation:

MPLSoGRE
MPLSoUDP
VXLAN

MPLSoGRE and MPLSoUDP are used for L3 virtualization while VXLAN is used for L2 virtualization.
If we are planning to implement L2 use-case, there is not much to think…VXLAN is the way!
Instead, with L3 use-cases, a question arises: MPLS over GRE or over UDP?

As often happens in this industry the answer might be “it depends” 😊 Anyhow, here, the answer is pretty clear: MPLSoUDP!

Before understanding why we choose MPLSoUDP, let’s se when we have to use MPLSoGRE. Again, the answer is pretty self-explanatory. We use MPLSoGRE when we cannot use MPLSoUDP. This might happen because our SDN GW is running a software release that does not support MPLSoUDP.
Apart from this situation, go with MPLSoUDP!

In order to understand why MPLSoUDP is better, we need to recall how a MPLSoUDP packet is built.

The original raw packet is first added a mpls label. This label represents the service label and it is how contrail/sdn_gw associate packets to the right virtual_network/vrf.
Next, a UDP (+ IP) header is added. The UDP header includes a source and a destination port. The source port is the result of a hashing operation performed over the inner packet. As a result, this field will be extremely variable. The source port brings huge entropy!
This entropy is the reason behind choosing MPLSoUDP!

Using MPLSoUDP brings advantages at different levels.

The first benefit is seen at the SDN GW. Imagine you have a MPLSoUDP tunnel between the SDN GW and a compute node. Between the 2 endpoints there are multiple ECMP paths.

Choosing one ecmp path over another is based on a hash function performed on packets. In order to achieve a better distribution, we need high entropy and, as we have seen, MPLSoUDP provides us that!

Let’s see an example on a SN GW:

user@sdngw> show route table contrail_vrf1.inet.0 100.64.90.0/24 active-path extensive | match "Protocol Next Hop"
                Protocol next hop: 163.162.83.233
                Protocol next hop: 163.162.83.233
                        Protocol next hop: 163.162.83.233
                        Protocol next hop: 163.162.83.233

{master}
user@sdngw> show route table inet.0 163.162.83.233

inet.0: 2498 destinations, 4709 routes (2498 active, 0 holddown, 2 hidden)
+ = Active Route, - = Last Active, * = Both

163.162.83.233/32  *[BGP/170] 8w4d 04:31:10, localpref 75
                      AS path: 64520 65602 65612 I, validation-state: unverified
                      to 172.16.41.19 via ae31.1
                    > to 172.16.41.23 via ae32.1
                    [BGP/170] 8w4d 04:31:10, localpref 75
                      AS path: 64520 65601 65612 I, validation-state: unverified
                    > to 172.16.41.19 via ae31.1

{master}
user@sdngw> show route forwarding-table table default destination 163.162.83.233
Routing table: default.inet
Internet:
Enabled protocols: Bridging,
Destination        Type RtRef Next hop           Type Index    NhRef Netif
163.162.83.233/32  user     0                    ulst  1048601    26
                              172.16.41.19       ucst     1978     6 ae31.1
                              172.16.41.23       ucst     1977     6 ae32.1

As you can see there are 2 ecmp packets towards the compute node. Using MPLSoUDP will allow us to distribute packets between the two paths in a more balanced way.
As you can see there are 2 ecmp packets towards the compute node. Using MPLSoUDP will allow us to distribute packets between the two paths in a more balanced way.

Another benefit of using MPLSoUDP is seen if we look at packets being sent out the compute node.

What I’m going to say is true if we consider a setup where interface vhost0 is a bod interface (2 physical NICs bonded together).

In a such a scenario, the compute node is multihomed to two leaves (IP Fabric running evpn+vxlan, using esi to deal with multihomed CEs). As a consequence, when a packet leaves the server, it is sent over one of the 2 links of the bonds.
Now, based on bond configuration, the choice between the two links is made according to a hash. As a has is involved, again, relying on MPLSoUDP is better as it brings more entropy which means better distribution.
Distributing traffic equally across all the bond member will likely lead to traffic being distributed well across the whole fabric!

The last MPLSoUDP benefit we are going to see is a performance one on dpdk nodes. To understand this, we need to have at least a very high level view of how certain aspects of a dpdk vrouter work.

A DPDK vrouter is assigned a certain amount of cores (based on a configuration parameter). Let’s imagine we assign 4 cores to vrouter. As a consequence, through ethdev, dpdk vrouter programs 4 queues on the physical NIC (vif0/0). Then, we have a 1:1 mapping between a vrouter core and a NIC queue. For packets coming from outside the server (physical nic receives packets), vrouter cores behave as polling cores (vrouter cores can perform other roles, processing core; anyhow, here, we are not interested in a detailed understanding of a dpdk vrouter so we leave this discussion to another time). What matters here is that each vrouter core, acting as polling core, continually checks if its assigned physical nic has some packets to be polled. Before the polling action takes place, the physical nic first received the packet on the wire, then “sent” that packet to one of the queues. To do that, the physical NIC performs a hash on the packet. At this point, things should be clear. As a hash is involved, MPLSoUDP guarantees us a better distribution of traffic over the NIC queues. Better distributing packets on NIC queues means better distributing packets across vrouter cores (remember, there is a 1:1 mapping between nic queues and vrouter cores).

Why is it important to spread traffic as more balanced as possible across forwarding cores?
Each forwarding core can process up to X PPS (packets per second). PPS, indirectly, means throughput. Normally, the higher the PPS the higher the throughput.

Let’s make an example. Each forwarding core can process up to 2M PPS; this means that my vrouter can process at most 8 M PPS.
Now, suppose MPLSoGRE is used. That encapsulation does not guarantee efficient distribution. This means that, potentially, traffic might be sent to only 2 out of 4 forwarding cores (or at least the majority of traffic might land mostly on 2 out of 4 forwarding cores). If so, vrouter performance would be roughly around 4 M PPS (about 50% of the total capacity).
Instead, using MPLSoUDP, traffic will be distributed better across all the 4 forwarding cores. This means that the vrouter will be able to reach a total of 8 M PPS. In other words, performance is way better!

Summing up: better balancing at gateway, better balancing at compute nodes, better balancing inside dpdk vrouter. Unless your SDN GW only supports MPLSoGRE, there is no reason to avoid MPLSoUDP, only advantages!

Ciao
IoSonoUmberto

“Which SDN solution is better?”…what I learned

Recently, I have worked on a project where we deployed a virtual mobile inside a Contrail cluster. The mobile chain virtual network functions like Packet Gateway and TCP optimization.
The same application chain was also deployed and tested using another SDN solution.
Contrail is a L3 SDN solution while the other one was a L2 SDN solution.
Obviously, the big question came up: which one is better?
Normally, we often tend to answer that question by looking exclusively or primarily on performance; something like “well, solution A can go up to 10M pps while solution B only up to 7M pss, so…”.
That is with no doubts important; it has impacts on the overall solution, can’t argue with that. For example, if one solution is twice as fast as the other one, you might reduce by half the number of virtual machines (assuming the vm can cope with all that traffic), leading to server savings, smaller number of devices to manage, smaller licensing costs (depends on the licensing model of course…) and so on…
So, yes, pps matter and will always do…but they are not everything!
In my opinion, at a certain point, pps does not even matter anymore.
Think of a compute node. You normally connect it to the IP fabric through a LAG which is multihomed to two leaves. That LAG gives you a total bandwidth of N*X, where X is the interface type you used (typically 10G or 40G). Let’s assume we use LAGs made of 2x10G. Let’s also consider that the average packet size is 700B. Now, solution A goes up to 10M pps; with some simple math we obtain 10^7 pps * 7*10^2*8 b = 56 Gbps! Very high! Solution B only goes up to 7M pps which gives around 39 Gbps. In this case, those additional 3M pps do not play a role. I’m able to fill my LAG in both cases and that’s enough for me…at least with average packet size.
If we consider small packets (46B), solution A reaches 3.6 Gbps while solution B only 2.5Gbos. Here, we are far from LAG 100% utilization and pps for sure matters more. But do we have to think of a network with only small packets?
We should identify our average packet size (our own IMIX mix) and see how fast I can go with the two solutions. If both allow me to reach 100% LAG utilization, then comparing pps does not help me that much in identifying the best SDN solution.
“Looking at raw performance numbers is not enough” and this was my starting to think of what other aspects play a role and can tell us solution A is better than B or viceversa.
To tackle this, I went through different aspects of the architecture and looked at them from a “solution perspective”, instead of a “purely performance perspective”.
The role of overlay
Contrail relies on an overlay mesh running between compute nodes and towards SDN gateways. MPLSoUDP or VxLAN tunnel are built to carry VMs traffic.
These tunnels “hide” the actual traffic sent to/from virtual machines. Inside the vRouter, each virtual network is configured as a well-known VRF. Exactly like classic VPNs, MPLS labels are assigned to each VRF and VMs packets travel encapsulated within the DC fabric (MPLSoUDP is the new MPLSoMPLS while IP Fabric is the new backbone).
As a result, the IP Fabric does not see the real traffic; it only sees MPLSoUDP (or VxLAN traffic). Inside those packets, different virtual networks are identified through different MPLS labels but that is totally transparent to the fabric.
contrail_vlans
What does this mean from a fabric point of view? Mainly, you only need to configure a single vlan associated to contrail data plane network. Inside that vlan, we have MPLSoUDP which, as we said, hide all the virtual networks.
The other SDN solution, instead, used a different approach. Different virtual networks exit the compute node as different vlan. Compute nodes fabric-facing interface, was like a classic trunk port. This means that each virtual networks became a vlan on the IP Fabric. There was no overlay hiding this “complexity”: everything is there!
ovs_vlans
At the end of the day, end to end communication between VMs and towards SDN GWs worked in both scenarios but at which cost?
Do not think just about whether endpoints can communicate or not. Think of the provisioning effort and service creation.
With contrail, you created the control data plane vlan on the fabric at day 0. Then, when the mobile chain was created, no changes were needed on the fabric.
On the other side, without Contrail, every time you needed to create a new service (mobile chain or other…) networking had to be configured on both the SDN and the IP fabric.
This means increased complexity, higher chance to make mistakes, additional tools to automate configurations (unless you are ok configuring the fabric manually…which still you shouldn’t do with contrail) and, more in general, a slower time to market to make the service available and start making money out of it. In the end, $$$ is always what matters 😊☹
This was the first aspect that helped understand how two SDN solutions can differ as soon as we look a bit beyond the pure pps numbers.
I said compute nodes use a LAG to connect to the IP fabric.
Ideally, you want traffic to be balanced between the members of the LAG.
With Contrail, this was true and we were able to observe it during tests.
Instead, the other SDN solution used a LAG mode where outgoing traffic was hashed based on the source address. As a result, this led to traffic not being balanced equally. This means it was not possible to fully exploit all the LAG bandwidth.
Under this perspective, the actual pps number did matter even less as there was no possibility to even use the LAG 100%.
bonds
This comparison highlighted that a Contrail solution was able to provide better resource utilization which means a better ROI on the overall DC infrastructure.
As we were in a telco cloud, dealing with mobile core applications, virtual machines are not the ones we normally find in enterprises: web servers, databases, front-ends, etc… The use-cases are different, so the workloads. The virtual machines we worked with needed routing! Routing was crucial to exchange routing and provide fast convergence. For example, the P-Gateway needs routing to advertise address pools.
Contrail natively uses BGP. It uses BGP for internal communications, it uses BGP to communicate with SDN GWs and it uses BGP with virtual machines. This last use-case is commonly known as BGPaaS, where a BGP session is built between the VM and vRouter (contrail).
This is possible as Contrail is a L3 SDN controller so it is L3 capable and L3 aware.
So what changes do we have with a L2 SDN solution?
Let’s consider a use-case: the SGi network. That network is the P-Gateway egress network. This network, in turn, connects to the next element of the mobile chain: in our case, a TCP optimization VNF.
P-Gateway advertises users pool via BGP…but to who?
With Contrail P-Gateway has a BGP session with Contrail using BGPaaS. BGPaaS was also used to re-advertise those routes towards the TCP optimizer through another BGPaaS session between Contrail and TCP optimizer.
Everything was dealt and managed inside Contrail…let’s keep that in mind 😊
How was the same problem faced with the other solution?
Due to VNF constraints, it was not possible to create direct sessions between P-Gateway and TCP optimizer. As a consequence, and intermediate peer was needed. As this was a dialogue between VMs, customer did not want to exit the fabric…so the fabric had to do the routing. This meant configuring BGP on spine devices. Moreover, to optimize traffic, P-Gateway was sending thousands of “small” routes which all needed to be stored and managed on the spine.
Moving routing functionalities on the spine meant placing a huge burden on a device that is not meant for that. Spines should be switches that simply forwards packets. In this case, it became a router…but is it strong enough to be router? The answer is yes, if you buy a well-equipped router. This means increasing the CAPEX and spend more money buying more powerful, but more expensive, devices acting as spines. Is this acceptable? There is no absolute answer but I think a “no” would be the right one most of times.
On the other hand, with Contrail, no routing was seen on fabric: just packet forwarding on that single vlan bringing contrail tunnels. All those “small” routes still exist but they do exist on a vRouter, a L3 element which is a better fit than a switch. Those “small” routes will be further advertised to the SDN GW which is a router so pretty reasonable, right?
routing
As we can see, with Contrail, the fabric is not involved in the control plane. Everything is managed at the Contrail level leveraging Contrail internal signaling mechanisms.
Removing the control plane from the fabric also means an easier service creation. As we said before when talking about vlans, no additional configuration is needed on the fabric when creating the service. The whole service is created by simply defining a template listing the needed virtual resources (Heat template 😉). Without Contrail it was needed to have both a Heat template and additional configuration on the fabric…with all the consequences and considerations we have mentioned before.
Similar routing considerations can be done for the SDN GW. Here, we consider the egress side of the TCP optimizer that announces pools to the SDN GW (here Fabric provides just L2).
Without Contrail, it was needed to configure multiple BGP sessions (N TCP optimizerr and M SDN GW, N*M BGP session).
With Contrail instead, we leverage the infrastructure BGP session we have between the Control node and the SDN GWs. Over this session, Contrail advertises routes for all its virtual networks. The SDN GWs then imports them based on configured route targets in VRFs (remember? Contrail is just VPNs in a DC…).
The process would look like this: VM advertises routes to Contrail via BGPaaS and Contrail re-advertises them to SDN GW via that single BGP session. If you add more virtual networks or more BGPaaS sessions, the number of sessions with the SDN GW will not change: always one!
In this case, the difference does not lie at the configuration size level. In both cases you need to configure something on the SDN GW: be that configuration a bunch of VRFs or a bunch of routing instances and bgp sessions.
The difference here is in the number of BGP sessions running at the SDN GW and this can have scalability impacts that we cannot underestimate.
routing_sdngw
If you look at the image, there is only one BGP session between contrail and SDN GW but multiple xmpp (bgp-like protocol) between vrouter and contrail controller. This is true but that is something that is part of the contrail architecture. Once contrail is up and running, it takes care of all those xmpp things automatically, totally transparent to users.
From an operator point of view, you configure the Contrail-SDN_GW BGP session once, then you create BGPaaS objects when deploying the applications.
So far, we have seen advantages from a service provisioning and/or configuration point of view. Anyhow, SDN solutions can be compared from an operational point of view as well.
To achieve higher performance, compute nodes were deployed using DPDK.
Without Contrail it was difficult to monitor specific virtual machine interfaces.
Instead, with Contrail, there were a set of cli tools, including a tcpdump-like one, that allowed to identify a specific VM port running on a dpdk port and sniff traffic going through it.
This made a solution like contrail more friendly in a troubleshooting scenario.
Troubleshooting also means mirroring.
Contrail vRouter supports built-in mirroring. You can selectively mirror packets on a given virtual machine interface and have that traffic sent to an external analyzer (DPI).
The same was not possible without Contrail and mirroring had to rely on ad-hoc solutions configured on the IP Fabric. This meant putting another burden on the IP Fabric, increasing its overall complexity.
Operations does not only mean troubleshooting nut also monitoring what’s going on inside the cluster.
In addition to all the elements provided natively by Openstack, Contrail provides an element called Analytics which, via REST API, gives the operator a huge amount of information and the ability to receive alarms in real time so to react fast in case of issues.
This helps having a better control on the contrail infrastructure.
In this case, a SDN solution with Contrail enriched the set of tools offered by the other solution based on a L2 SDN solution.
In conclusion, comparing the 2 solutions from a wider perspective made some interesting points emerging.
Contrail allowed to a thinner IP fabric that required less configuration changes and could focus more on its main purpose: forwarding packets in a non-blocking redundant fashion.
Moreover, all the application-related routing was moved away from the fabric, avoiding the risk to purchase more expensive switches to cope with the needs imposed by routing.
This CPAEX saving translates in a OPEX savings as, having a “more stable fabric configuration”, requires less time spent on working on the fabric .
Services can be implemented faster as all the provisioning process was limited to Contrail and, if needed, the SDN GW.
At the same time, troubleshooting and monitoring the whole infrastructure resulted easier due to additional tools provided by Contrail, making operations life easier and driving costs down as it is no longer needed to build/buy custom tools for everything.
This showed me how a short question like “Which solution is better?” hides a big number of aspects and considerations that makes just looking at the raw pps number pretty misleading.
I highlighted contrail pros and how a Contrail-driven solution offered advantages. I’m on the Contrail side so…well, i’m not that impartial, I admit 🙂
Anyhow, a key concept I wanted to express is the fact the comparing complex solutions like SDN controllers that involve the whole DC, and not just that, cannot be reduced to a raw comparison of pps numbers.
Ciao
IoSonoUmberto

Pinning vrouter service cores to improve performance

We are constantly looking for better performance. That allows to better utilize our infrastructure and to provide a faster service to end users.
The same is true for virtual environments like data centers running Openstack.
A first performance boost is given by using DPDK instead of a standard kernel based solution.
Anyhow, this is not enough! DPDK is faster but DPDK itself can be optimized.
I have already talked about setting up the right DPDK environment for Contrail (coremask, bond policy, encapsulation, etc…) in other posts but, recently (end of 2019) a new possibility emerged.
As of today, we were used to divide our cpu cores as follows:
old_mask
Contrail vRouter is assigned some cores so that packet forwarding has dedicated resources that no one else can “touch”. Next, we have cores dedicated to Nova virtual machines; this way we are sure that only virtual machines will use those cores, avoiding possible pollution. Last, we have cores dedicated to standard OS processes (for instance, host OS ntp process).
Actually, here, we also have some cores that are totally unused: no one (vrouter, virtual machines, host OS) will touch them. Is this a waste? Probably yes but let’s accept it 😊
If we connect to a dpdk compute node we can easily identify the PID of the vrouter dpdk process:

[root@cpt7-dpdk-tovb-nbp ~]# ps -ef | grep dpdk
root      481592  481354 99 17:48 ?        01:31:35 /usr/bin/contrail-vrouter-dpdk --no-daemon --socket-mem 1024 1024 --vlan_tci 1074 --vlan_fwd_intf_name bond2 --vdev eth_bond_bond2,mode=4,xmit_policy=l34,socket_id=0,mac=48:df:37:3e:a8:44,lacp_rate=1,slave=0000:04:00.0,slave=0000:05:00.0

Once we have that value, we can check all the processes associated to it:

[root@cpt7-dpdk-tovb-nbp ~]# pidstat -t -p 481592
Linux 3.10.0-1062.4.1.el7.x86_64 (cpt7-dpdk-tovb-nbp)   01/10/2020      _x86_64_        (56 CPU)

06:11:26 PM   UID      TGID       TID    %usr %system  %guest    %CPU   CPU  Command
06:11:26 PM     0    481592         -    0.18    0.11    0.00    0.29    43  contrail-vroute
06:11:26 PM     0         -    481592    0.00    0.00    0.00    0.00    43  |__contrail-vroute
06:11:26 PM     0         -    481605    0.00    0.00    0.00    0.00     3  |__rte_mp_handle
06:11:26 PM     0         -    481606    0.00    0.00    0.00    0.00     3  |__rte_mp_async
06:11:26 PM     0         -    481614    0.00    0.00    0.00    0.00    15  |__eal-intr-thread
06:11:26 PM     0         -    481615    0.00    0.00    0.00    0.00    15  |__lcore-slave-1
06:11:26 PM     0         -    481616    0.00    0.00    0.00    0.00    14  |__lcore-slave-2
06:11:26 PM     0         -    481617    0.00    0.00    0.00    0.00    28  |__lcore-slave-8
06:11:26 PM     0         -    481618    0.00    0.00    0.00    0.00    15  |__lcore-slave-9
06:11:26 PM     0         -    481619    0.04    0.03    0.00    0.07     1  |__lcore-slave-10
06:11:26 PM     0         -    481620    0.04    0.03    0.00    0.07     2  |__lcore-slave-11
06:11:26 PM     0         -    481621    0.04    0.03    0.00    0.07     3  |__lcore-slave-12
06:11:26 PM     0         -    481622    0.04    0.03    0.00    0.07     4  |__lcore-slave-13
06:11:26 PM     0         -    482040    0.00    0.00    0.00    0.00    28  |__lcore-slave-9

Let’s start from lcores 10 to 13. Those are vrouter forwarding cores and, as you can see, they are pinned to cores 1-4. This comes from the configured dpdk core mask:

[root@cpt7-dpdk-tovb-nbp ~]# cat /etc/sysconfig/network-scripts/ifcfg-vhost0 | grep CPU
CPU_LIST=0x1E

You may notice that some processes (rte*) are using a vrouter forwarding thread. This comes from a dpdk bug that will be solved in newer releases. I will briefly talk about this later.
Anyhow, we also have lcores 1, 2, 8 and 9. Those cores are called service cores. This new optimization takes this second set of cores into consideration. Simply put, we will pin them as well.
As a consequence, our cpus layout will become:
new_mask
Some cores (1 physical core, 2 if considering hyperthreading) will be dedicated to vrouter service processes!
Now, we are going to see into details all these aspects and how to properly configure our setup.
Please, be aware that, right now, this is not an official procedure, not yet productized. Only use it in a lab environment for testing purposes! As we will see, procedure is not automated and affected by non-contrail bugs.
First, we have to be sure the environment is ready.
We are using a RHEL system and, in order to “isolate” cpus we do use the tuned utility:

[root@cpt7-dpdk-tovb-nbp ~]# tuned-adm active
Current active profile: cpu-partitioning

At the beginning, tuned configuration reflects this setup:
old_mask

[root@cpt7-dpdk-tovb-nbp ~]# cat /etc/tuned/cpu-partitioning-variables.conf
isolated_cores=1-13,19-27,29-41,47-55

meaning host OS has these cores available:

[root@cpt7-dpdk-tovb-nbp ~]# cat /etc/systemd/system.conf | grep Affi
#CPUAffinity=1 2
CPUAffinity=0 14 15 16 17 18 28 42 43 44 45 46

I modify tuned configuration file as follows:

[root@cpt7-dpdk-tovb-nbp ~]# cat /etc/tuned/cpu-partitioning-variables.conf
isolated_cores=1-13,15,19-27,29-41,43,47-55

and re-apply the profile:

[root@cpt7-dpdk-tovb-nbp ~]# vi /etc/tuned/cpu-partitioning-variables.conf
[root@cpt7-dpdk-tovb-nbp ~]# tuned-adm profile cpu-partitioning
CONSOLE  tuned.plugins.plugin_systemd: you may need to manualy run 'dracut -f' to update the systemd configuration in initrd image

As a result, host OS, will no longer use cores dedicated to vrouter service threads (15, 43):

[root@cpt7-dpdk-tovb-nbp ~]# cat /etc/systemd/system.conf | grep Affi
CPUAffinity=0 14 16 17 18 28 42 44 45 46

This can be confirmed this way:

[root@cpt7-dpdk-tovb-nbp ~]# taskset -pc 1
pid 1's current affinity list: 0,14,16-18,28,42,44-46

We already saw that before but I check vrouter forwarding cores mask is correct:

[root@cpt7-dpdk-tovb-nbp ~]# cat /etc/sysconfig/network-scripts/ifcfg-vhost0 | grep CPU
CPU_LIST=0x1E

Next, I verify nova pin set is right:

()[nova@cpt7-dpdk-tovb-nbp /]$ cat /etc/nova/nova.conf | grep pin_set
#     vcpu_pin_set = "4-12,^8,15"
vcpu_pin_set=5-13,33-41

At this point we need to modify how the dpdk vrouter agent container is created.
This is managed by a function inside this file:

[root@cpt7-dpdk-tovb-nbp ~]# cat /etc/sysconfig/network-scripts/network-functions-vrouter-dpdk | grep -A 5 "docker run"
    eval "docker run \
        --detach \
        --name ${container_name} \
        --net host --privileged \
        --restart always \
        -v /etc/hosts:/etc/hosts:ro \

We need to add a line setting the “cpuset-cpus” parameter:

[root@cpt7-dpdk-tovb-nbp ~]# cat /etc/sysconfig/network-scripts/network-functions-vrouter-dpdk | grep -A 5 "docker run"
    eval "docker run \
        --detach \
        --cpuset-cpus="0,1,2,3,4,14-18,28-32,42-46"\
        --name ${container_name} \
        --net host --privileged \
        --restart always \
        -v /etc/hosts:/etc/hosts:ro \

To trigger container re-creation with the new option I have to bring down/up vhost0:

[root@cpt7-dpdk-tovb-nbp ~]# ifdown vhost0
INFO: send SIGTERM to the container tim-contrail_containers-vrouter-agent-dpdk
tim-contrail_containers-vrouter-agent-dpdk
INFO: wait container tim-contrail_containers-vrouter-agent-dpdk finishes
0
INFO: send SIGTERM to the container tim-contrail_containers-vrouter-agent-dpdk
Error response from daemon: Cannot kill container tim-contrail_containers-vrouter-agent-dpdk: Container 062d999e9af99d3c3009104166bcb529f991c59e42295c3fc537cbbeef20bb2d is not running
INFO: wait container tim-contrail_containers-vrouter-agent-dpdk finishes
0
INFO: remove the container tim-contrail_containers-vrouter-agent-dpdk
tim-contrail_containers-vrouter-agent-dpdk
INFO: rebind device 0000:04:00.0 from vfio-pci to driver ixgbe
INFO: unbind 0000:04:00.0 from vfio-pci
INFO: bind 0000:04:00.0 to ixgbe
INFO: rebind device 0000:05:00.0 from vfio-pci to driver ixgbe
INFO: unbind 0000:05:00.0 from vfio-pci
INFO: bind 0000:05:00.0 to ixgbe
INFO: restore bind interface bond2
/etc/sysconfig/network-scripts /etc/sysconfig/network-scripts
/etc/sysconfig/network-scripts
[root@cpt7-dpdk-tovb-nbp ~]# ifup vhost0
8892db315c5188785a3500c236812a5636349a59bef20977258a14106b9abc1a
INFO: wait DPDK agent to run... 1
INFO: wait DPDK agent to run... 2
INFO: wait vhost0 to be initilaized... 0/60
INFO: wait vhost0 to be initilaized... 1/60
INFO: wait vhost0 to be initilaized... 2/60
INFO: vhost0 is ready.
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists

We check the new property was taken into consideration:

[root@cpt7-dpdk-tovb-nbp ~]# docker inspect 8892db3 | grep CpusetCpus
            "CpusetCpus": "0,1,2,3,4,14-18,28-32,42-46",

Everything looks alright!
Now it is time to pin service cores!
Let’s check again current pinning:

[root@cpt7-dpdk-tovb-nbp ~]# pidstat -t -p 481592
Linux 3.10.0-1062.4.1.el7.x86_64 (cpt7-dpdk-tovb-nbp)   01/10/2020      _x86_64_        (56 CPU)

06:11:26 PM   UID      TGID       TID    %usr %system  %guest    %CPU   CPU  Command
06:11:26 PM     0    481592         -    0.18    0.11    0.00    0.29    43  contrail-vroute
06:11:26 PM     0         -    481592    0.00    0.00    0.00    0.00    43  |__contrail-vroute
06:11:26 PM     0         -    481605    0.00    0.00    0.00    0.00     3  |__rte_mp_handle
06:11:26 PM     0         -    481606    0.00    0.00    0.00    0.00     3  |__rte_mp_async
06:11:26 PM     0         -    481614    0.00    0.00    0.00    0.00    15  |__eal-intr-thread
06:11:26 PM     0         -    481615    0.00    0.00    0.00    0.00    15  |__lcore-slave-1
06:11:26 PM     0         -    481616    0.00    0.00    0.00    0.00    14  |__lcore-slave-2
06:11:26 PM     0         -    481617    0.00    0.00    0.00    0.00    28  |__lcore-slave-8
06:11:26 PM     0         -    481618    0.00    0.00    0.00    0.00    15  |__lcore-slave-9
06:11:26 PM     0         -    481619    0.04    0.03    0.00    0.07     1  |__lcore-slave-10
06:11:26 PM     0         -    481620    0.04    0.03    0.00    0.07     2  |__lcore-slave-11
06:11:26 PM     0         -    481621    0.04    0.03    0.00    0.07     3  |__lcore-slave-12
06:11:26 PM     0         -    481622    0.04    0.03    0.00    0.07     4  |__lcore-slave-13
06:11:26 PM     0         -    482040    0.00    0.00    0.00    0.00    28  |__lcore-slave-9

Right now, what matters is that forwarding cores are consistent with vhost0 coremask.
This is also confirmed via this command:

[root@cpt7-dpdk-tovb-nbp ~]# taskset -cp -a 481592
pid 481592's current affinity list: 0-4,14-18,28-32,42-46
pid 481605's current affinity list: 3,4,14-18,28-32,42-46
pid 481606's current affinity list: 3,4,14-18,28-32,42-46
pid 481614's current affinity list: 3,4,14-18,28-32,42-46
pid 481615's current affinity list: 0-4,14-18,28-32,42-46
pid 481616's current affinity list: 0-4,14-18,28-32,42-46
pid 481617's current affinity list: 0-4,14-18,28-32,42-46
pid 481618's current affinity list: 0-4,14-18,28-32,42-46
pid 481619's current affinity list: 1
pid 481620's current affinity list: 2
pid 481621's current affinity list: 3
pid 481622's current affinity list: 4
pid 482040's current affinity list: 0-4,14-18,28-32,42-46

Only forwarding cores are pinned to a specific cores; service threads are currently not pinned.
Please notice PIDs 481605 and 481606. Those are rte* processes. As pointed out before, they are using vrouter forwarding cores. Those are standard dpdk control cores and should not use datapath cores. Anyhow, right now, due to a dpdk library bug, we do not have full control on how control threads can be pinned.
This Bugzilla is related to this issue https://bugzilla.redhat.com/show_bug.cgi?id=1687316 and will be fixed in future releases.
Potentially, this represents a pollution that might impact performance. However, “rumors” say this did not happen during tests.
And here comes the most important part: we ping service cores.
This is done via taskset:

taskset -cp 15 481615
taskset -cp 15 481616
taskset -cp 15 481617
taskset -cp 15 481618
taskset -cp 15 481592
taskset -cp 15 482040

We do also pin the main dpdk vrouter PID.
I did run those commands via a shell script:

[root@cpt7-dpdk-tovb-nbp ~]# sh setupsvconly.sh
pid 481615's current affinity list: 0-4,14-18,28-32,42-46
pid 481615's new affinity list: 15
pid 481616's current affinity list: 0-4,14-18,28-32,42-46
pid 481616's new affinity list: 15
pid 481617's current affinity list: 0-4,14-18,28-32,42-46
pid 481617's new affinity list: 15
pid 481618's current affinity list: 0-4,14-18,28-32,42-46
pid 481618's new affinity list: 15
pid 481592's current affinity list: 0-4,14-18,28-32,42-46
pid 481592's new affinity list: 15
pid 482040's current affinity list: 0-4,14-18,28-32,42-46
pid 482040's new affinity list: 15

As a result, we expect those cores to be assigned to 15:

[root@cpt7-dpdk-tovb-nbp ~]# pidstat -t -p 481592
Linux 3.10.0-1062.4.1.el7.x86_64 (cpt7-dpdk-tovb-nbp)   01/10/2020      _x86_64_        (56 CPU)

06:18:09 PM   UID      TGID       TID    %usr %system  %guest    %CPU   CPU  Command
06:18:09 PM     0    481592         -    0.23    0.15    0.00    0.37    15  contrail-vroute
06:18:09 PM     0         -    481592    0.00    0.00    0.00    0.00    15  |__contrail-vroute
06:18:09 PM     0         -    481605    0.00    0.00    0.00    0.00     3  |__rte_mp_handle
06:18:09 PM     0         -    481606    0.00    0.00    0.00    0.00     3  |__rte_mp_async
06:18:09 PM     0         -    481614    0.00    0.00    0.00    0.00    15  |__eal-intr-thread
06:18:09 PM     0         -    481615    0.00    0.00    0.00    0.00    15  |__lcore-slave-1
06:18:09 PM     0         -    481616    0.00    0.00    0.00    0.00    15  |__lcore-slave-2
06:18:09 PM     0         -    481617    0.00    0.00    0.00    0.00    15  |__lcore-slave-8
06:18:09 PM     0         -    481618    0.00    0.00    0.00    0.00    15  |__lcore-slave-9
06:18:09 PM     0         -    481619    0.06    0.04    0.00    0.09     1  |__lcore-slave-10
06:18:09 PM     0         -    481620    0.06    0.04    0.00    0.09     2  |__lcore-slave-11
06:18:09 PM     0         -    481621    0.06    0.04    0.00    0.09     3  |__lcore-slave-12
06:18:09 PM     0         -    481622    0.06    0.04    0.00    0.09     4  |__lcore-slave-13
[root@cpt7-dpdk-tovb-nbp ~]# taskset -cp -a 481592
pid 481592's current affinity list: 15
pid 481605's current affinity list: 3,4,14-18,28-32,42-46
pid 481606's current affinity list: 3,4,14-18,28-32,42-46
pid 481614's current affinity list: 3,4,14-18,28-32,42-46
pid 481615's current affinity list: 15
pid 481616's current affinity list: 15
pid 481617's current affinity list: 15
pid 481618's current affinity list: 15
pid 481619's current affinity list: 1
pid 481620's current affinity list: 2
pid 481621's current affinity list: 3
pid 481622's current affinity list: 4
pid 482040's current affinity list: 15

Optionally, we can pin control threads as well (rte* and eal):

[root@cpt7-dpdk-tovb-nbp ~]# sh setupall.sh
pid 481605's current affinity list: 3,4,14-18,28-32,42-46
pid 481605's new affinity list: 0,14,16-18,28,42,44-46
pid 481606's current affinity list: 3,4,14-18,28-32,42-46
pid 481606's new affinity list: 0,14,16-18,28,42,44-46
pid 481614's current affinity list: 3,4,14-18,28-32,42-46
pid 481614's new affinity list: 0,14,16-18,28,42,44-46
pid 481615's current affinity list: 15
pid 481615's new affinity list: 15
pid 481616's current affinity list: 15
pid 481616's new affinity list: 15
pid 481617's current affinity list: 15
pid 481617's new affinity list: 15
pid 481618's current affinity list: 15
pid 481618's new affinity list: 15
pid 481592's current affinity list: 15
pid 481592's new affinity list: 15
pid 482040's current affinity list: 15
pid 482040's new affinity list: 15

but, due to the dpdk bug, result is not as good as expected:

[root@cpt7-dpdk-tovb-nbp ~]# pidstat -t -p 481592
Linux 3.10.0-1062.4.1.el7.x86_64 (cpt7-dpdk-tovb-nbp)   01/10/2020      _x86_64_        (56 CPU)

06:20:04 PM   UID      TGID       TID    %usr %system  %guest    %CPU   CPU  Command
06:20:04 PM     0    481592         -    0.24    0.16    0.00    0.40    15  contrail-vroute
06:20:04 PM     0         -    481592    0.00    0.00    0.00    0.00    15  |__contrail-vroute
06:20:04 PM     0         -    481605    0.00    0.00    0.00    0.00     3  |__rte_mp_handle
06:20:04 PM     0         -    481606    0.00    0.00    0.00    0.00     3  |__rte_mp_async
06:20:04 PM     0         -    481614    0.00    0.00    0.00    0.00    16  |__eal-intr-thread
06:20:04 PM     0         -    481615    0.00    0.00    0.00    0.00    15  |__lcore-slave-1
06:20:04 PM     0         -    481616    0.00    0.00    0.00    0.00    15  |__lcore-slave-2
06:20:04 PM     0         -    481617    0.00    0.00    0.00    0.00    15  |__lcore-slave-8
06:20:04 PM     0         -    481618    0.00    0.00    0.00    0.00    15  |__lcore-slave-9
06:20:04 PM     0         -    481619    0.06    0.04    0.00    0.10     1  |__lcore-slave-10
06:20:04 PM     0         -    481620    0.06    0.04    0.00    0.10     2  |__lcore-slave-11
06:20:04 PM     0         -    481621    0.06    0.04    0.00    0.10     3  |__lcore-slave-12
06:20:04 PM     0         -    481622    0.06    0.04    0.00    0.10     4  |__lcore-slave-13
06:20:04 PM     0         -    482040    0.00    0.00    0.00    0.00    28  |__lcore-slave-9
[root@cpt7-dpdk-tovb-nbp ~]# taskset -cp -a 481592
pid 481592's current affinity list: 15
pid 481605's current affinity list: 0,14,16-18,28,42,44-46
pid 481606's current affinity list: 0,14,16-18,28,42,44-46
pid 481614's current affinity list: 0,14,16-18,28,42,44-46
pid 481615's current affinity list: 15
pid 481616's current affinity list: 15
pid 481617's current affinity list: 15
pid 481618's current affinity list: 15
pid 481619's current affinity list: 1
pid 481620's current affinity list: 2
pid 481621's current affinity list: 3
pid 481622's current affinity list: 4
pid 482040's current affinity list: 15

Some control threads (rte*) run on cpu 3 even if that cpu is not included in the affinity list of those processes.
Anyhow, control threads are not as important as service ones.
Pinning service cores should increase performance.
However, as we have just seen, at the moment, the process is pretty manual and time-consuming.
Moreover, as taskset commands reference the current PID of a process, if a dpdk vrouter agent is brought down (ifdown/ifup vhost0), service cores pinning will need to be re-done as there will be new PIDs into play.
Future contrail releases will overcome this by allowing to set a sort of second coremask used to pin service cores. This will make the whole pinning process automated and consistent across dpdk vrouter agent restarts 😊
Check if your performance actually groes!
Ciao
IoSonoUmberto

Some mixed tips and tricks to deal with a DPDK vRouter

By default, Contrail vRouter runs in kernel mode. That is totally fine as long as you do not care too much about performance. Kernel mode means that IO actions require to go through the kernel and this limits the overall performance.
A “cloud friendly” solution, as opposed to SRIOV or PCIPT, is DPDK. DPDK, simply put, brings kernel mode into user space: vrouter runs in user space, leading to better performance as it no longer has to go to the kernel every time. How DPDK actually works is out of the scope of this document, there is plenty of great documentation out there on the Internet. Here, we want to focus on how DPDK fits in contrail and how we can configure it and monitor it.
From a very high level, this a dpdk enabled compute node 🙂
vrouter_schema
Using DPDK requires us to know the “internals” of our server.
Modern servers have multiple CPUs (cores), spread across 2 sockets. We talk about NUMA nodes: node0 and node1.
Install “pciutils” and start having a look at the server NUMA topology:

[root@server-5d ~]# lscpu | grep NUMA
NUMA node(s):          2
NUMA node0 CPU(s):     0-13,28-41
NUMA node1 CPU(s):     14-27,42-55

We have 2 NUMA nodes.
Server has 28 physical cores in total.
Those cores are “hyperthreaded”, meaning that each physical core looks like 2 cores. We say there are 28 physical cores but 56 logical cores (vcpus).
Each vcpu has a sibling. For example, 0/28 are sibling, 1/29 are siblings and so on.
NUMA topology can also be seen by using numactl (must be installed):

[root@server-5d ~]# numactl --hardware | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 28 29 30 31 32 33 34 35 36 37 38 39 40 41
node 1 cpus: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 42 43 44 45 46 47 48 49 50 51 52 53 54 55

NICs are connected to a NUMA node as well.
It is useful to learn this information.
We list server NICs:

[root@server-5d ~]# lspci -nn | grep thern
01:00.0 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01)
01:00.1 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01)
02:00.0 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01)
02:00.1 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01)
We note down the NIC indexes of the interfaces we will use with DPDK, in this case 02:00.0 and 02:00.1.

We check NUMA node information for those NICs:

[root@server-5d ~]# lspci -vmms 02:00.0 | grep NUMA
NUMANode:       0
[root@server-5d ~]# lspci -vmms 02:00.1 | grep NUMA
NUMANode:       0

Both NICs are connected to NUMA node0.
Alternatively, we can get the same info via lstopo:

[root@server-5d ~]# lstopo
Machine (256GB total)
  NUMANode L#0 (P#0 128GB)
    Package L#0 + L3 L#0 (35MB)
...
    HostBridge L#0
      PCIBridge
        PCI 8086:1521
          Net L#0 "eth0"
        PCI 8086:1521
          Net L#1 "eno2"
      PCIBridge
        2 x { PCI 8086:10fb }

DPDK requires the usage of Huge Pages. Huge Pages, as the name suggests, are big memory pages. This bigger size allows a more efficient memory access as it reduces memory access swaps.
There are two kinds of huge pages: 2M pages and 1G pages.
Contrail works with both (if Ansible deployer is used, please use 2M huge pages)
The number of Huge Pages we have to configure may vary depending on our needs. We might choose it based on the expected number of VMs we are going to run on those compute nodes. Let’s say that, realistically, we might think of dedicating some servers to DPDK. In that case, the whole server memory can be “converted” into huge pages.
Actually, not the whole memory as you need to leave some for the host OS.
Moreover, if 1GB huge pages are used, then remember to leave some 2M hugepages for the vrouter; 128 should be enough.
This is possible as we can have a mix of 1G and 2M hugepages.
Once Huge Pages are created, we can verify their creation:

[root@server-5d ~]# cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
0
[root@server-5d ~]# cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
30000
[root@server-5d ~]# cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
29439

In this example we only have 2M hugepages, as the number of 1G hugepages is 0.
There are 30000 2M hugepages (60GB) on node0 and 29439 are still available.
Same verification can be done for node1.
Remember, if we configure N hugepages, N/2 will be created on NUMA node0 and N/2 on NUMA node1.
Each installer has its own specific way to configure huge pages. With Contrail ansible deployer, that is specified when defining the dpdk compute node within the instances.yaml file:

  compute_dpdk_1:
    provider: bms
    ip: 172.30.200.46
    roles:
      vrouter:
        AGENT_MODE: dpdk
        HUGE_PAGES: 120
      openstack_compute:

Here, I highlighted only the huge pages related settings.
Inside a server, things happen fast but a slight delay can mean performance degradation. We said modern servers have cores spread across different NUMA nodes. Memory is on both nodes as well.
NUMA nodes are connected through a high speed connection, called QPI. Even if QPI is at high speed, going through it can lead to worse performance. Why?
This happens when, for example, a process running on node0 has to access memory located on node1. In this case, the process will query the local RAM controller on node0 and from there it will be re-directed, through the QPI path, towards node1 RAM controller. This higher-hops paths means a higher number of interrupts and cpu cycles that increase the total delay of a single operation. This higher delay, in turn, leads to a “slower” VM, hence worse performance.
In order to avoid this, we need to be sure QPI path is crossed as less as possible or, better, never.
This means having vrouter, memory and NICs all connected to the same NUMA Node.
We said huge pages are created on both nodes. We verified our NICs are connected to NUMA 0. As a consequence, we will have to pin vrouter to cores belonging to NUMA 0.
Going further, even VMs should be placed on the same NUMA node. This places an issue: suppose we have N vcpus on our server. We pin vrouter to numa 0 which means excluding N/2 vcpus belonging to numa1. On Numa0 we allocate X vcpus to vrouter and Y to host OS. This leaves N/2-X-Y vcpus available to VMs. Even if ideal, this is not a practical choice as it means that you need 2 servers to have the same number of vcpus. For this reason, we usually end up creating VMs on both numa nodes knowing that we might pay something in terms of performance.
Assigning DPDK vrouter to specific server cores requires the definition of a so-called coremask.
Let’s recall NUMA topology:

NUMA node0 CPU(s):     0-13,28-41
NUMA node1 CPU(s):     14-27,42-55

We have to pin vrouter cores to NUMA 0 as explained before.
We have to choose between these cpus:

0  1  2  3  4  5  6  7  8  9  10 11 12 13
28 29 30 31 32 33 34 35 36 37 38 39 40 41

We decide to assign 4 physical cores (8 vcpus) to vrouter.
Core 0 should be avoided and left to host OS.
Siblings pairs should be picked.
As a result, we pin our vrouter to 4, 5, 6, 7, 32, 33, 34, 35.
Next, we write cpu numbers from the highest assigned vcpu to 0 and write a 1 for each assigned core:

35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1  1  1  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 0 1 1 1 1 0 0 0 0

Finally, we read that long binary string and convert into hexadecimal.
We obtain, 0xf000000f0.
This is our coremask!
With Ansible deployer, core mask is deployed as follows:

  compute_dpdk_1:
    provider: bms
    ip: 172.30.200.46
    roles:
      vrouter:
        AGENT_MODE: dpdk
        CPU_CORE_MASK: "0xf000000f0"
      openstack_compute:

Be sure core mask is enclosed in quotes otherwise the string will be interpreted as a hex number, converted to decimal and this will lead to wrong pinning!
After vrouter has been provisioned, we can check everything was done properly.
DPDK process is running with PID 3732 (output split in multiple lines for simplicity)

root      3732  2670 99 15:35 ?        17:40:59 
/usr/bin/contrail-vrouter-dpdk --no-daemon --socket-mem 1024 1024 --vlan_tci 200 --vlan_fwd_intf_name bond1 
--vdev eth_bond_bond1,mode=4,xmit_policy=l23,socket_id=0,mac=0c:c4:7a:59:56:40,lacp_rate=1,slave=0000:02:00.0,slave=0000:02:00.1

We get the PID tree:

[root@server-5d ~]#  pstree -p $(ps -ef | awk '$8=="/usr/bin/contrail-vrouter-dpdk" {print $2}')
contrail-vroute(3732)─┬─{contrail-vroute}(3894)
                      ├─{contrail-vroute}(3895)
                      ├─{contrail-vroute}(3896)
                      ├─{contrail-vroute}(3897)
                      ├─{contrail-vroute}(3898)
                      ├─{contrail-vroute}(3899)
                      ├─{contrail-vroute}(3900)
                      ├─{contrail-vroute}(3901)
                      ├─{contrail-vroute}(3902)
                      ├─{contrail-vroute}(3903)
                      ├─{contrail-vroute}(3904)
                      ├─{contrail-vroute}(3905)
                      └─{contrail-vroute}(3906)

And the assigned cores for each PID:

[root@server-5d ~]# ps -mo pid,tid,comm,psr,pcpu -p $(ps -ef | awk '$8=="/usr/bin/contrail-vrouter-dpdk" {print $2}')
  PID   TID COMMAND         PSR %CPU
 3732     - contrail-vroute   -  802
    -  3732 -                13  4.4
    -  3894 -                 1  0.0
    -  3895 -                11  1.8
    -  3896 -                 9  0.0
    -  3897 -                18  0.0
    -  3898 -                22  0.2
    -  3899 -                 4 99.9
    -  3900 -                 5 99.9
    -  3901 -                 6 99.9
    -  3902 -                 7 99.9
    -  3903 -                32 99.9
    -  3904 -                33 99.9
    -  3905 -                34 99.9
    -  3906 -                35 99.9

As you can see, the last 8 processes are assigned to the cores we specified within the coremask. This confirms us vrouter was provisioned correctly.
Those 8 vcpus are running at 99.9% CPU. This is normal as DPDK forwarding cores constantly poll the NIC queues to see if there are packets to tx/rx. This leads to CPU being always around 100%.
But what do those cores represent?
Contrail vRouter cores have a specific meaning defined in the following C enum data structure:

enum {
    VR_DPDK_KNITAP_LCORE_ID = 0,
    VR_DPDK_TIMER_LCORE_ID,
    VR_DPDK_UVHOST_LCORE_ID,
    VR_DPDK_IO_LCORE_ID,       	= 3
    VR_DPDK_IO_LCORE_ID2,
	VR_DPDK_IO_LCORE_ID3,
    VR_DPDK_IO_LCORE_ID4,
    VR_DPDK_LAST_IO_LCORE_ID,  	# 7
    VR_DPDK_PACKET_LCORE_ID,  	# 8
    VR_DPDK_NETLINK_LCORE_ID,
    VR_DPDK_FWD_LCORE_ID,      	# 10
};

We can find those names by running using the “ps” command with some additional arguments:

[root@server-5b ~]# ps -T -p 54490
  PID  SPID TTY          TIME CMD
54490 54490 ?        02:46:12 contrail-vroute
54490 54611 ?        00:02:33 eal-intr-thread
54490 54612 ?        01:35:26 lcore-slave-1
54490 54613 ?        00:00:00 lcore-slave-2
54490 54614 ?        00:00:17 lcore-slave-8
54490 54615 ?        00:02:14 lcore-slave-9
54490 54616 ?        2-21:44:06 lcore-slave-10
54490 54617 ?        2-21:44:06 lcore-slave-11
54490 54618 ?        2-21:44:06 lcore-slave-12
54490 54619 ?        2-21:44:06 lcore-slave-13
54490 54620 ?        2-21:44:06 lcore-slave-14
54490 54621 ?        2-21:44:06 lcore-slave-15
54490 54622 ?        2-21:44:06 lcore-slave-16
54490 54623 ?        2-21:44:06 lcore-slave-17
54490 54990 ?        00:00:00 lcore-slave-9

– Contrail-vroute is main thread
– lcore-slave-1 is timer thread
– lcore-slave-2 is uvhost (for qemu) thread
– lcore-slave-8 is pkt0 thread
– lcore-slave-9 is netlink thread (for nh/rt programming)
– lcore-slave-10 onwards are forwarding threads, th eons running at 100% as they are constantly polling the interfaces
Since Contrail 5.0, Contrail is containerized, meaning services are hosted inside contaners.
vRouter configuration parameters can be seen inside the vrouter agent:

(vrouter-agent)[root@server-5d /]$ cat /etc/contrail/contrail-vrouter-agent.conf
[DEFAULT]
platform=dpdk
physical_interface_mac=0c:c4:7a:59:56:40
physical_interface_address=0000:00:00.0
physical_uio_driver=uio_pci_generic

There we find configuration values like vrouter mode, dpdk in this case, or the uio driver.
Moreover, we have the MAC address of the vhost0 interface.
In this setup the vhost0 did sit on a bond interface, meaning that MAC is the bond MAC address.
Last, a “0” ID is given as interface address.
Let’s get back for a moment to the vrouter dpdk process we can spot using “ps”:

root     44541 44279 99 12:03 ?        18:52:02 /usr/bin/contrail-vrouter-dpdk --no-daemon --socket-mem 1024 1024 --vlan_tci 200 --vlan_fwd_intf_name bond1 --vdev eth_bond_bond1,mode=4,xmit_policy=l23,socket_id=0,mac=0c:c4:7a:59:56:40,lacp_rate=1,slave=0000:02:00.0,slave=0000:02:00.1

– Process ID is 44541
– Option socket-mem tells us that dpdk is using 1GB (1024 MB) on each NUMA node. This happens because, even if, as said, optimal placement only involves numa node0, in real world VMs will be spawned on both nodes
– Vlan 200 is extracted from the physical interface on which vhost0 sits on
In this case vhost0 uses interface vlan0200 (tagged with vlan-id 200) as physical interface which, in turn, sits on the bond
As a consequence, inspecting this process we will see some parameters taken from the vlan interface while other from the bond
– Forward interface is bond1
– Bond is configured with mode 4 (recommended) and hash policy “l23” (policy “l34” is recommended)
– Bond1 is connected to socket 0
– Bond1 MAC address is specified
– LACP is used over the bond
– Bond slaves interfaces addresses re listed
Last, let’s sum up how we can configure a dpdk compute with Ansible deployer:

compute_dpdk_1:
    provider: bms
    ip: 172.30.200.46
    roles:
      vrouter:
        PHYSICAL_INTERFACE: vlan0200
        AGENT_MODE: dpdk
        CPU_CORE_MASK: “0xf000000f0”
        DPDK_UIO_DRIVER: uio_pci_generic
        HUGE_PAGES: 60000
      openstack_compute:

We add several parameters:
• uio_pci_generic is a generic DPDK driver which works both on Ubuntu and RHEL (Centos)
• agent mode is dpdk, default is kernel
• 60000 huge pages will be created. Right now (may 2019) Ansible deployer only works with 2MB hugepages. This means we are allocating 60000*2M=120 GB
• Coremask tells how many cores must be used for the vrouter and which cores have to be pinned
This should cover the basics about Contrail and DPDK and we should be able to deploy dpdk vrouter and verify everything was built properly
Ciao
IoSonoUmberto