| 
									
										
										
										
											2015-12-06 05:11:27 +08:00
										 |  |  | <!-- BEGIN MUNGE: UNVERSIONED_WARNING --> | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | <!-- BEGIN STRIP_FOR_RELEASE --> | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | <img src="http://kubernetes.io/img/warning.png" alt="WARNING" | 
					
						
							|  |  |  |      width="25" height="25"> | 
					
						
							|  |  |  | <img src="http://kubernetes.io/img/warning.png" alt="WARNING" | 
					
						
							|  |  |  |      width="25" height="25"> | 
					
						
							|  |  |  | <img src="http://kubernetes.io/img/warning.png" alt="WARNING" | 
					
						
							|  |  |  |      width="25" height="25"> | 
					
						
							|  |  |  | <img src="http://kubernetes.io/img/warning.png" alt="WARNING" | 
					
						
							|  |  |  |      width="25" height="25"> | 
					
						
							|  |  |  | <img src="http://kubernetes.io/img/warning.png" alt="WARNING" | 
					
						
							|  |  |  |      width="25" height="25"> | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | <h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2> | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | If you are using a released version of Kubernetes, you should | 
					
						
							|  |  |  | refer to the docs that go with that version. | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2016-03-09 10:06:40 +08:00
										 |  |  | <!-- TAG RELEASE_LINK, added by the munger automatically --> | 
					
						
							|  |  |  | <strong> | 
					
						
							|  |  |  | The latest release of this document can be found | 
					
						
							|  |  |  | [here](http://releases.k8s.io/release-1.2/docs/design/nodeaffinity.md). | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2015-12-06 05:11:27 +08:00
										 |  |  | Documentation for other releases can be found at | 
					
						
							|  |  |  | [releases.k8s.io](http://releases.k8s.io). | 
					
						
							|  |  |  | </strong> | 
					
						
							|  |  |  | -- | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | <!-- END STRIP_FOR_RELEASE --> | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | <!-- END MUNGE: UNVERSIONED_WARNING --> | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | # Node affinity and NodeSelector
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ## Introduction
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | This document proposes a new label selector representation, called `NodeSelector`, | 
					
						
							|  |  |  | that is similar in many ways to `LabelSelector`, but is a bit more flexible and is | 
					
						
							|  |  |  | intended to be used only for selecting nodes. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | In addition, we propose to replace the `map[string]string` in `PodSpec` that the scheduler | 
					
						
							|  |  |  | currently uses as part of restricting the set of nodes onto which a pod is | 
					
						
							| 
									
										
										
										
											2016-03-02 08:27:00 +08:00
										 |  |  | eligible to schedule, with a field of type `Affinity` that contains one or | 
					
						
							| 
									
										
										
										
											2015-12-06 05:11:27 +08:00
										 |  |  | more affinity specifications. In this document we discuss `NodeAffinity`, which | 
					
						
							|  |  |  | contains one or more of the following | 
					
						
							|  |  |  | * a field called `RequiredDuringSchedulingRequiredDuringExecution` that will be | 
					
						
							|  |  |  | represented by a `NodeSelector`, and thus generalizes the scheduling behavior of | 
					
						
							|  |  |  | the current `map[string]string` but still serves the purpose of restricting | 
					
						
							|  |  |  | the set of nodes onto which the pod can schedule. In addition, unlike the behavior | 
					
						
							|  |  |  | of the current `map[string]string`, when it becomes violated the system will | 
					
						
							|  |  |  | try to eventually evict the pod from its node. | 
					
						
							|  |  |  | * a field called `RequiredDuringSchedulingIgnoredDuringExecution` which is identical | 
					
						
							|  |  |  | to `RequiredDuringSchedulingRequiredDuringExecution` except that the system | 
					
						
							|  |  |  | may or may not try to eventually evict the pod from its node. | 
					
						
							|  |  |  | * a field called `PreferredDuringSchedulingIgnoredDuringExecution` that specifies which nodes are | 
					
						
							|  |  |  | preferred for scheduling among those that meet all scheduling requirements. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | (In practice, as discussed later, we will actually *add* the `Affinity` field | 
					
						
							|  |  |  | rather than replacing `map[string]string`, due to backward compatibility requirements.) | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The affiniy specifications described above allow a pod to request various properties | 
					
						
							|  |  |  | that are inherent to nodes, for example "run this pod on a node with an Intel CPU" or, in a | 
					
						
							|  |  |  | multi-zone cluster, "run this pod on a node in zone Z." | 
					
						
							|  |  |  | ([This issue](https://github.com/kubernetes/kubernetes/issues/9044) describes | 
					
						
							|  |  |  | some of the properties that a node might publish as labels, which affinity expressions | 
					
						
							|  |  |  | can match against.) | 
					
						
							|  |  |  | They do *not* allow a pod to request to schedule | 
					
						
							|  |  |  | (or not schedule) on a node based on what other pods are running on the node. That | 
					
						
							|  |  |  | feature is called "inter-pod topological affinity/anti-afinity" and is described | 
					
						
							|  |  |  | [here](https://github.com/kubernetes/kubernetes/pull/18265). | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ## API
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ### NodeSelector
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ```go | 
					
						
							|  |  |  | // A node selector represents the union of the results of one or more label queries | 
					
						
							|  |  |  | // over a set of nodes; that is, it represents the OR of the selectors represented | 
					
						
							|  |  |  | // by the nodeSelectorTerms. | 
					
						
							|  |  |  | type NodeSelector struct { | 
					
						
							|  |  |  | 	// nodeSelectorTerms is a list of node selector terms. The terms are ORed. | 
					
						
							|  |  |  | 	NodeSelectorTerms []NodeSelectorTerm `json:"nodeSelectorTerms,omitempty"` | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | // An empty node selector term matches all objects. A null node selector term | 
					
						
							|  |  |  | // matches no objects. | 
					
						
							|  |  |  | type NodeSelectorTerm struct { | 
					
						
							|  |  |  | 	// matchExpressions is a list of node selector requirements. The requirements are ANDed. | 
					
						
							|  |  |  | 	MatchExpressions []NodeSelectorRequirement `json:"matchExpressions,omitempty"` | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | // A node selector requirement is a selector that contains values, a key, and an operator | 
					
						
							|  |  |  | // that relates the key and values. | 
					
						
							|  |  |  | type NodeSelectorRequirement struct { | 
					
						
							|  |  |  | 	// key is the label key that the selector applies to. | 
					
						
							|  |  |  | 	Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"` | 
					
						
							|  |  |  | 	// operator represents a key's relationship to a set of values. | 
					
						
							|  |  |  | 	// Valid operators are In, NotIn, Exists, DoesNotExist. Gt, and Lt. | 
					
						
							|  |  |  | 	Operator NodeSelectorOperator `json:"operator"` | 
					
						
							|  |  |  | 	// values is an array of string values. If the operator is In or NotIn, | 
					
						
							|  |  |  | 	// the values array must be non-empty. If the operator is Exists or DoesNotExist, | 
					
						
							|  |  |  | 	// the values array must be empty. If the operator is Gt or Lt, the values | 
					
						
							|  |  |  | 	// array must have a single element, which will be interpreted as an integer. | 
					
						
							|  |  |  |     // This array is replaced during a strategic merge patch. | 
					
						
							|  |  |  | 	Values []string `json:"values,omitempty"` | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | // A node selector operator is the set of operators that can be used in | 
					
						
							|  |  |  | // a node selector requirement. | 
					
						
							|  |  |  | type NodeSelectorOperator string | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | const ( | 
					
						
							|  |  |  | 	NodeSelectorOpIn           NodeSelectorOperator = "In" | 
					
						
							|  |  |  | 	NodeSelectorOpNotIn        NodeSelectorOperator = "NotIn" | 
					
						
							|  |  |  | 	NodeSelectorOpExists       NodeSelectorOperator = "Exists" | 
					
						
							|  |  |  | 	NodeSelectorOpDoesNotExist NodeSelectorOperator = "DoesNotExist" | 
					
						
							|  |  |  | 	NodeSelectorOpGt           NodeSelectorOperator = "Gt" | 
					
						
							|  |  |  | 	NodeSelectorOpLt           NodeSelectorOperator = "Lt" | 
					
						
							|  |  |  | ) | 
					
						
							|  |  |  | ``` | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ### NodeAffinity
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | We will add one field to `PodSpec` | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ```go | 
					
						
							|  |  |  | Affinity *Affinity  `json:"affinity,omitempty"` | 
					
						
							|  |  |  | ``` | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The `Affinity` type is defined as follows | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ```go | 
					
						
							|  |  |  | type Affinity struct { | 
					
						
							|  |  |  | 	NodeAffinity *NodeAffinity `json:"nodeAffinity,omitempty"` | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | type NodeAffinity struct { | 
					
						
							|  |  |  | 	// If the affinity requirements specified by this field are not met at | 
					
						
							|  |  |  | 	// scheduling time, the pod will not be scheduled onto the node. | 
					
						
							|  |  |  | 	// If the affinity requirements specified by this field cease to be met | 
					
						
							|  |  |  | 	// at some point during pod execution (e.g. due to a node label update), | 
					
						
							|  |  |  | 	// the system will try to eventually evict the pod from its node. | 
					
						
							|  |  |  | 	RequiredDuringSchedulingRequiredDuringExecution *NodeSelector  `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"` | 
					
						
							|  |  |  | 	// If the affinity requirements specified by this field are not met at | 
					
						
							|  |  |  | 	// scheduling time, the pod will not be scheduled onto the node. | 
					
						
							|  |  |  | 	// If the affinity requirements specified by this field cease to be met | 
					
						
							|  |  |  | 	// at some point during pod execution (e.g. due to a node label update), | 
					
						
							|  |  |  | 	// the system may or may not try to eventually evict the pod from its node. | 
					
						
							|  |  |  | 	RequiredDuringSchedulingIgnoredDuringExecution  *NodeSelector  `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"` | 
					
						
							|  |  |  | 	// The scheduler will prefer to schedule pods to nodes that satisfy | 
					
						
							|  |  |  | 	// the affinity expressions specified by this field, but it may choose | 
					
						
							|  |  |  | 	// a node that violates one or more of the expressions. The node that is | 
					
						
							|  |  |  | 	// most preferred is the one with the greatest sum of weights, i.e. | 
					
						
							|  |  |  | 	// for each node that meets all of the scheduling requirements (resource | 
					
						
							|  |  |  | 	// request, RequiredDuringScheduling affinity expressions, etc.), | 
					
						
							|  |  |  | 	// compute a sum by iterating through the elements of this field and adding | 
					
						
							|  |  |  | 	// "weight" to the sum if the node matches the corresponding MatchExpressions; the | 
					
						
							|  |  |  | 	// node(s) with the highest sum are the most preferred. | 
					
						
							|  |  |  | 	PreferredDuringSchedulingIgnoredDuringExecution []PreferredSchedulingTerm  `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"` | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | // An empty preferred scheduling term matches all objects with implicit weight 0 | 
					
						
							|  |  |  | // (i.e. it's a no-op). A null preferred scheduling term matches no objects. | 
					
						
							|  |  |  | type PreferredSchedulingTerm struct { | 
					
						
							|  |  |  |     // weight is in the range 1-100 | 
					
						
							|  |  |  | 	Weight int  `json:"weight"` | 
					
						
							|  |  |  | 	// matchExpressions is a list of node selector requirements. The requirements are ANDed. | 
					
						
							|  |  |  | 	MatchExpressions []NodeSelectorRequirement  `json:"matchExpressions,omitempty"` | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | ``` | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Unfortunately, the name of the existing `map[string]string` field in PodSpec is `NodeSelector` | 
					
						
							|  |  |  | and we can't change it since this name is part of the API. Hopefully this won't | 
					
						
							|  |  |  | cause too much confusion. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ## Examples
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ** TODO: fill in this section ** | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | * Run this pod on a node with an Intel or AMD CPU | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | * Run this pod on a node in availability zone Z | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ## Backward compatibility
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | When we add `Affinity` to PodSpec, we will deprecate, but not remove, the current field in PodSpec | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ```go | 
					
						
							|  |  |  | NodeSelector map[string]string `json:"nodeSelector,omitempty"` | 
					
						
							|  |  |  | ``` | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Old version of the scheduler will ignore the `Affinity` field. | 
					
						
							|  |  |  | New versions of the scheduler will apply their scheduling predicates to both `Affinity` and `nodeSelector`, | 
					
						
							|  |  |  | i.e. the pod can only schedule onto nodes that satisfy both sets of requirements. We will not | 
					
						
							|  |  |  | attempt to convert between `Affinity` and `nodeSelector`. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Old versions of non-scheduling clients will not know how to do anything semantically meaningful | 
					
						
							|  |  |  | with `Affinity`, but we don't expect that this will cause a problem. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | See [this comment](https://github.com/kubernetes/kubernetes/issues/341#issuecomment-140809259) | 
					
						
							|  |  |  | for more discussion. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Users should not start using `NodeAffinity` until the full implementation has been in Kubelet and the master | 
					
						
							|  |  |  | for enough binary versions that we feel comfortable that we will not need to roll back either Kubelet | 
					
						
							|  |  |  | or master to a version that does not support them. Longer-term we will use a programatic approach to | 
					
						
							|  |  |  | enforcing this (#4855). | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ## Implementation plan
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 1. Add the `Affinity` field to PodSpec and the `NodeAffinity`, `PreferredDuringSchedulingIgnoredDuringExecution`, | 
					
						
							|  |  |  | and `RequiredDuringSchedulingIgnoredDuringExecution` types to the API | 
					
						
							|  |  |  | 2. Implement a scheduler predicate that takes `RequiredDuringSchedulingIgnoredDuringExecution` into account | 
					
						
							|  |  |  | 3. Implement a scheduler priority function that takes `PreferredDuringSchedulingIgnoredDuringExecution` into account | 
					
						
							|  |  |  | 4. At this point, the feature can be deployed and `PodSpec.NodeSelector` can be marked as deprecated | 
					
						
							|  |  |  | 5. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to the API | 
					
						
							|  |  |  | 6. Modify the scheduler predicate from step 2 to also take `RequiredDuringSchedulingRequiredDuringExecution` into account | 
					
						
							|  |  |  | 7. Add `RequiredDuringSchedulingRequiredDuringExecution` to Kubelet's admission decision | 
					
						
							|  |  |  | 8. Implement code in Kubelet *or* the controllers that evicts a pod that no longer satisfies | 
					
						
							|  |  |  | `RequiredDuringSchedulingRequiredDuringExecution` | 
					
						
							|  |  |  | (see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)). | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | We assume Kubelet publishes labels describing the node's membership in all of the relevant scheduling | 
					
						
							|  |  |  | domains (e.g. node name, rack name, availability zone name, etc.). See #9044. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ## Extensibility
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The design described here is the result of careful analysis of use cases, a decade of experience | 
					
						
							|  |  |  | with Borg at Google, and a review of similar features in other open-source container orchestration | 
					
						
							|  |  |  | systems. We believe that it properly balances the goal of expressiveness against the goals of | 
					
						
							|  |  |  | simplicity and efficiency of implementation. However, we recognize that | 
					
						
							|  |  |  | use cases may arise in the future that cannot be expressed using the syntax described here. | 
					
						
							|  |  |  | Although we are not implementing an affinity-specific extensibility mechanism for a variety | 
					
						
							|  |  |  | of reasons (simplicity of the codebase, simplicity of cluster deployment, desire for Kubernetes | 
					
						
							|  |  |  | users to get a consistent experience, etc.), the regular Kubernetes | 
					
						
							|  |  |  | annotation mechanism can be used to add or replace affinity rules. The way this work would is | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 1. Define one or more annotations to describe the new affinity rule(s) | 
					
						
							|  |  |  | 1. User (or an admission controller) attaches the annotation(s) to pods to request the desired scheduling behavior. | 
					
						
							|  |  |  | If the new rule(s) *replace* one or more fields of `Affinity` then the user would omit those fields | 
					
						
							|  |  |  | from `Affinity`; if they are *additional rules*, then the user would fill in `Affinity` as well as the | 
					
						
							|  |  |  | annotation(s). | 
					
						
							|  |  |  | 1. Scheduler takes the annotation(s) into account when scheduling. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | If some particular new syntax becomes popular, we would consider upstreaming it by integrating | 
					
						
							|  |  |  | it into the standard `Affinity`. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ## Future work
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Are there any other fields we should convert from `map[string]string` to `NodeSelector`? | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ## Related issues
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The review for this proposal is in #18261. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The main related issue is #341. Issue #367 is also related. Those issues reference other | 
					
						
							|  |  |  | related issues. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | <!-- BEGIN MUNGE: GENERATED_ANALYTICS --> | 
					
						
							|  |  |  | []() | 
					
						
							|  |  |  | <!-- END MUNGE: GENERATED_ANALYTICS --> |