http://vxpertise.net/2012/06/summarizing-numa-scheduling/

Sitting on my sofa this morning watching Scrubs, I was thinking about the NUMA related considerations in vSphere – yes, I am a nerd. I read about this for the first time back in the days of vSphere 4.0, but it probably existed for much longer. Then it came to my mind that since vSphere 5.0 VMware supports the configuration of the number of sockets and cores per socket for a Virtual Machine and the 5.0 feature called vNUMA. I googled the topic for a while an found a bit of information here and there. I figured it was time to write a single article to completely cover the topic.

What is NUMA?

Let’s start with a quick review of NUMA. This is taken from Wikipedia:

Non-Uniform Memory Access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than non-local memory, that is, memory local to another processor or memory shared between processors.

This means in a physical server with two or more sockets on an Intel Nehalem or AMD Opteron platform, very often we find memory that is local to one and memory that is local to the other socket. A socket, its local memory and the bus connecting the two components is called a NUMA node. Both sockets are connected to the other sockets’ memory allowing remote access.

Figure 1: A NUMA system.

Please be aware that an additional socket in a system does NOT necessarily mean an additional NUMA node! Two or more sockets can be connected to memory with no distinction between local and remote. In this case, and in the case where we have only a single socket,  we have a UMA (uniform memory access) architecture.

UMA system: one or more sockets connected to the same RAM.

Scheduling – The Complete Picture

Whenever we virtualize complete operating systems, we get two levels of where scheduling takes place: A VM is provided with vCPUs (virtual CPUs) for execution and the hypervisor has to schedule those vCPUs accross pCPUs (physical CPUs). On top of this, the guest scheduler distributes execution time on vCPUs to processes and threads.

Figure 2: Two levels of scheduling

So, we have to take a look at scheduling at two different levels to understand what is going on there. But before we go into more detail we have to take a look at a problem that might arise in NUMA systems.

The Locality Problem

Each NUMA node has its own computing power (the cores on the socket) and a dedicated amount of memory assigned to that node.  You can very often even see that taking a look at your mainboard. You will see two sockets and two separate groups of memory slots.

Figure 3: A dual-socket mainboard.

Those two sockets are connected to their local memory through a memory bus, but they can also access the other socket’s memory via an interconnect. AMD calls that interconnect HyperTransport which is the equivalent to Intel’s QPI (QuickPath Interconnect) technology. The names both suggest very high throughput and low latency. Well, that’s true, but compared to the local memory bus connection they are still far behind.

What does this mean to us? A process or virtual machine that was started on either of the two nodes should not be moved to a different node by the scheduler. If that happened – and it can happen if the scheduler in NUMA-unware – the process or VM would have to access its memory through the NUMA node interconnect resulting in higher memory latency. For memory intensive workloads, this can seriously influence performance of applications! This is referred to by the term “NUMA locality”.

Small VMs on ESXi

ESX and ESXi servers are NUMA-aware for a while now – to be exact since version 3.5.

NUMA-awareness means the scheduler is aware of the NUMA topology: the number of NUMA nodes, number of sockets per node, the number of cores per socket and the amount of memory local to a single NUMA node. The scheduler will try to avoid issues with NUMA locality. To do that, ESXi will make an initial placement decision to assign a starting VM to a NUMA node. From now on, the VM’s vCPUs are load balanced dynamically across cores on that same socket.

Figure 4: NUMA-aware scheduling.

In this example, VMs A and B were assigned to NUMA node 1 having to share cores on that socket. VM C is scheduled on a different node, so that VMs A and B will not have to share cores with VM C. In the case of very high load on either socket, ESXi can decide to migrate a VM from one NUMA node to another. But that’s not going to happen recklessly as the price for that is very high: To avoid NUMA locality problems after the migration, ESXi will migrate the VM’s memory image, too. That puts high load on the memory bus and the interconnect and could influence the overall performance on that host. But if perceived benefits outreach costs, that is going to happen.

In the figure above, the VMs are “small” meaning they have less vCPUs than the number of cores per NUMA node and less memory than what is local to a single NUMA node.

Large VMs on ESXi prior to vSphere 4.1

Thing start to become interesting for VMs with more vCPUs than the number of cores on a single socket. The hypervisor scheduler would have to have that VM span multiple NUMA nodes. A VM like this will not be handled by the NUMA scheduler anymore – so no home node will be assigned.  As a result, the VM’s vCPUs will not be restricted to one or two NUMA nodes but can be scheduled anywhere on the system. Memory will be allocated from all NUMA nodes in a round-robin fashion. Like that, memory access latencies will dramatically increase.

Figure 5: A large VM spannung two NUMA nodes.

To avoid this, it is the administrators job to make sure every VM fits into a single NUMA node. This includes the number of vCPUs and the amount of memory allocated to this VM.

Wide-VMs since vSphere 4.1

Introduced in vSphere 4.1 the concept of a “Wide-VM” addresses the issue of memory locality for virtual machines larger than a single NUMA node. The VM is split into two or more NUMA clients which are then treated as if they were separate VMs handled by theNUMA scheduler. That means, each NUMA client will be assigned its own home node and be limited to the pCPUs on that node. Memory will be allocated from the NUMA nodes the VM’s NUMA clients are assigned to. This improves the locality issue and enhances performance for Wide-VMs. A technical white paper provided by VMware goes into more detail on how big the performance impact really is.

As a result, chances of remote access and high latencies are decreased. But this is not the final solution because operating systems are still unaware of what is happening down there.

Scheduling in the Guest OS

Before vSphere 5.0, the NUMA topology was unknown to the guest OS. The scheduler inside the guest OS was not aware of the number of NUMA nodes, their associated local memory or the number of cores contained by the socket. From the OS’s perspective, all available vCPUs were seen as being their own sockets, all memory can be accessed from all sockets in the same speed. Due to this unawareness, a scheduling decision made by the OS could suddenly render a well-performing process suffering from bad memory locality after is was moved from one vCPU to another.

In figure 5, the VM spans two NUMA nodes with 4 vCPUs on one and 2 vCPUs on the other node. The OS sees 6 single-core sockets and treats them all as scheduling targets of equal quality for any running process. But actually, scheduling a process from the very left vCPU to the very right vCPU migrates the process from one physical NUMA node to another.

vNUMA since vSphere 5.0

vNUMA exposes the NUMA topology to the guest OS allowing for better scheduling decisions in the operating system. ESXi creates virtual sockets visible to the OS each with an equal amount of vCPUs visible as cores. Memory is evenly split accross sockets creating multiple NUMA nodes from the OS’s perspective. Using hardware version 8 for your VMs,  you can use vSphere Client to configure vNUMA per VM:

This results in two lines in the VM’s .vmx configuration file:

numvcpus = "8"
cpuid.coresPerSocket = "4"

Well, this is not the end of the story. This I read in the Resource Management Guide:

If the number of cores per socket (cpuid.coresPerSocket) is greater than one, and the number of virtual cores in the virtual machine is greater than 8, the virtual NUMA node size matches the virtual socket size.

The best way to understand this, is to have a look into a Linux OS and investigate the CPU from there: I configured a Debian Squeeze 64bit to have 2 virtual sockets and 2 cores per socket using vSphere Client und used the /proc/cpuinfo file and a tool called numactl to gather the following info:

root@vnumademo:~# numactl --hardware
available: 1 nodes (0-0)
node 0 cpus: 0 1 2 3
node 0 size: 1023 MB
node 0 free: 898 MB
node distances:
node   0
0:  10
root@vnumademo:~# cat /proc/cpuinfo | grep "physical id"
physical id     : 0
physical id     : 0
physical id     : 1
physical id     : 1
root@vnumademo:~#

The numactl tool shows only a single NUMA node – I configured 2 virtual sockets in vSphere Client, remember? Well, sockets doesn’t necessarily mean NUMA node (see above). From the OS’s perspective, this is a UMA system with 2 sockets.

Next, I configured the VM for 2 virtual sockets, 6 cores per socket. This time, we exceed 8 vCPUs,  so Linux should see a NUMA system now. And it does:

root@vnumademo:~# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5
node 0 size: 511 MB
node 0 free: 439 MB
node 1 cpus: 6 7 8 9 10 11
node 1 size: 511 MB
node 1 free: 462 MB
node distances:
node   0   1
0:  10  20
1:  20  10
root@vnumademo:~# cat /proc/cpuinfo | grep "physical id"
physical id     : 0
physical id     : 0
physical id     : 0
physical id     : 0
physical id     : 0
physical id     : 0
physical id     : 1
physical id     : 1
physical id     : 1
physical id     : 1
physical id     : 1
physical id     : 1
root@vnumademo:~#

As explained above, vNUMA kicks in from 9 vCPUs. To reduce that threshold to some lower number, configure the numa.vcpu.maxPerVirtualNodeadvanced setting for that VM. This setting defaults to 4 (as it is per virtual node).

Bottom Lines for Administrators

vSphere 4.0 and before:

  • Configure a VM with less vCPUs than the number of physical cores per socket.
  • Configure a VM with less memory than what is local to a single physical NUMA node.

vSphere 4.1:

  • Configure a VM with more vCPUs than the number of physical cores per socket is a bit less of a problem but there is still a chance of remote accesses.

vSphere 5.0:

  • Configuring 8 or less vCPUs for a VM does not change much compared to vSphere 4.1.
  • Assigning more than 8 vCPUs to a VM spread across multiple sockets create virtual NUMA nodes inside the guest allowing for better scheduling decisions in the guest.

For every version of vSphere, please note that the whole issue of memory latency might not even apply to your VM! For VMs with low memory workloads the whole question might be irrelevant as the performance loss is so minimal.

Sources and Links

  • http://frankdenneman.nl/2010/12/node-interleaving-enable-or-disable/
  • http://www.vmware.com/resources/techresources/10131
  • http://frankdenneman.nl/2010/09/esx-4-1-numa-scheduling/
  • http://frankdenneman.nl/2010/02/sizing-vms-and-numa-nodes/
  • http://cto.vmware.com/vnuma-what-it-is-and-why-it-matters/
  • http://labs.vmware.com/publications/performance-evaluation-of-hpc-benchmarks-on-vmwares-esxi-server

Summarizing NUMA Scheduling两篇文章,解释得不错的更多相关文章

  1. 两篇文章带你走入.NET Core 世界:Kestrel+Nginx+Supervisor 部署上云服务器(二)

    背景: 上一篇:两篇文章带你走入.NET Core 世界:CentOS+Kestrel+Ngnix 虚拟机先走一遍(一) 已经交待了背景,这篇就省下背景了,这是第二篇文章了,看完就木有下篇了. 直接进 ...

  2. vnext 技术两篇文章和评论

    研究vnext的两篇 好文章,重点看评论! http://www.cnblogs.com/shanyou/p/4589930.html http://www.cnblogs.com/shanyou/p ...

  3. 两篇文章带你走入.NET Core 世界:CentOS+Kestrel+Ngnix 虚拟机先走一遍(一)

    背景: 上一篇:ASP.Net Core on Linux (CentOS7)共享第三方依赖库部署 已经交待了背景,这篇就省下背景了. 折腾的过程分两步: 第一步是:本机跑虚拟机部署试一下: 第二步是 ...

  4. 有关C#写一个WindowsService的两篇文章

    1.http://blog.csdn.net/yysyangyangyangshan/article/details/10515035 上面的这篇文章一共两段,第二段讲的是使用代码来安装发布这个Win ...

  5. solr中facet及facet.pivot理解(整合两篇文章保留参考)

    Facet['fæsɪt]很难翻译,只能靠例子来理解了.Solr作者Yonik Seeley也给出更为直接的名字:导航(Guided Navigation).参数化查询(Paramatic Searc ...

  6. 【Kubernetes】两篇文章 搞懂 K8s 的 fannel 网络原理

    近期公司的flannel网络很不稳定,花时间研究了下并且保证云端自动部署的网络能够正常work. 1.网络拓扑 拓扑如下:(点开看大图)  容器网卡通过docker0桥接到flannel0网卡,而每个 ...

  7. Android Bootloader LittleKernel的两篇文章 【转】

    转自:http://blog.csdn.net/loongembedded/article/details/41747523 2014-12-05 14:37 3599人阅读 评论(2) 收藏 举报 ...

  8. Android Bootloader LittleKernel的两篇文章

    Android 开发之 ---- bootloader (LK) LK是什么 LK 是 Little Kernel 它是 appsbl (Applications ARM Boot Loader)流程 ...

  9. 学习OpenCV——粒子滤波(网上两篇文章总结)

    粒子滤波的理论实在是太美妙了,用一组不同权重的随机状态来逼近复杂的概率密度函数.其再非线性.非高斯系统中具有优良的特性.opencv给出了一个实现,但是没有给出范例,学习过程中发现网络上也找不到.le ...

随机推荐

  1. 【spring boot】【elasticsearch】spring boot整合elasticsearch,启动报错Caused by: java.lang.IllegalStateException: availableProcessors is already set to [8], rejecting [8

    spring boot整合elasticsearch, 启动报错: Caused by: java.lang.IllegalStateException: availableProcessors ], ...

  2. Android SDK Download List

    from://http://sofire.iteye.com/blog/1961552 Android SDK Download List! 通过分析SDK Manager里要用到的repositor ...

  3. 手机也需“绿色环保”,省电类APP或将成为“标配”?

        随着移动APP的大幅添加.非常多用户发现,这手机耗电量是越来越大了,在各种娱乐应用.办公应用.社交应用的冲击下,以往两天充一次电都OK.如今一天充一次还不够用,因为续航能力变弱.这也为用户带来 ...

  4. MECE分析法(Mutually Exclusive Collectively Exhaustive)

    什么是MECE分析法? MECE,是Mutually Exclusive Collectively Exhaustive,中文意思是“相互独立,完全穷尽”. 也就是对于一个重大的议题,能够做到不重叠. ...

  5. Ubuntu 16.04 grub rescue 模式下修复 grub

      前几天整理了下电脑的分区,合并并删除一些分区,结果导致 grub 被破坏了,Ubuntu进不去了,启动后直接进入了 rescure 模式.后来又折腾了下,终于修复好了,现总结一下. 先说一下我的系 ...

  6. 试用ArcGIS Server 10.1 X64 for windows

    ArcGIS 10.1 发布已经很久了,其Server只支持x64,为此我还专门下载安装了windows Server 2003 x64,进行安装测试. 我测试了集群功能,比起10.0 ,没有使用域控 ...

  7. linux C++ 多线程使用pthread_cond 条件变量

    1. 背景 多线程中经常需要使用到锁(pthread_mutex_t)来完成多个线程之间的互斥操作. 但是互斥锁有一个明显到缺点: 只有两种状态,锁定和非锁定. 而条件变量则通过允许线程阻塞并等待另一 ...

  8. EF6源码学习-准备篇

    现在对于.net开发人员来说EF已经很流行了,虽然我在2010年的时候就用过EF,也看过几本书,但是还没有仔细研究EF的code, 曾经也尝试阅读EF5的源代码,后来由于时间关系也没有坚持住.现在计划 ...

  9. 如何在Windows版的ScaleIO的节点中添加磁盘

    嗯, 为什么会有这个问题呢? 因为我要安装ScaleIO 1.32, 在使用Installation manager完成初始安装之后, 需要将一些磁盘添加到ScaleIO的storage pool中. ...

  10. CSS的50个代码片段

    1.css全局 html, body, div, span, applet, object, iframe, h1, h2, h3, h4, h5, h6, p, blockquote, pre, a ...