Commit Graph

2990 Commits

Author SHA1 Message Date
upodroid dedd4df0a2 fetch cni plugins from GitHub releases 2024-12-18 19:48:06 +01:00
Paco Xu 59dfb0e779 skip if cri proxy is disabled/undefined 2024-11-19 11:17:07 +08:00
Laura Lorenz 9ab0d81d76 Now that sleep is shorter, only expect to reach 3 within 30s
Focused too much on the container restart one in commit that fixed that

Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-13 01:39:58 +00:00
Laura Lorenz 59f9858086 Move function specific to container restart test inline
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-12 23:59:30 +00:00
Laura Lorenz 529d5ba9d3 Don't overly indirect image name
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-12 23:34:57 +00:00
Laura Lorenz 8e7b2af712 Use a better util
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-12 23:30:03 +00:00
Laura Lorenz 285d433dea Clearer image pull test and utils
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-12 23:30:00 +00:00
Laura Lorenz e03d0f60ef Orient tests to run faster, but tolerate infra slowdowns up to 5 minutes
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-12 21:48:28 +00:00
Laura Lorenz d293c5088f Fix spelling
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-12 21:12:20 +00:00
Laura Lorenz 1da8ca816e Extract restart number properly
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-12 20:00:11 +00:00
Laura Lorenz 2732d57e33 Missed refactor of container name here
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-12 19:50:11 +00:00
Laura Lorenz e6059d7386 Fix typecheck and verify
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-12 19:48:38 +00:00
Laura Lorenz f032068ef7 Focus on restart numbers instead of timing
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-12 07:12:24 +00:00
Laura Lorenz bad037b505 Formatting
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-12 04:48:10 +00:00
Laura Lorenz 15bae1eadf Add container restart test too
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-12 04:30:46 +00:00
Laura Lorenz fc4ac5efeb Move image pull backoff test to be with other image pull tests
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-12 01:27:44 +00:00
Laura Lorenz 2479d91f2a Fix test to count pull tries
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-12 01:27:34 +00:00
Laura Lorenz 6ef05dbd01 The idea of how this test should work
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-11 17:55:41 +00:00
Laura Lorenz 6337a28a68 Organize into its own context
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-11 17:55:41 +00:00
Laura Lorenz f913b7afe8 Adding imagepull backoff test
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-11 17:55:41 +00:00
Kubernetes Prow Robot 1dd81aa1c9
Merge pull request #126653 from zhifei92/fix-podstatus
fix the issue of losing the pending phase after a node restart.
2024-11-07 21:06:54 +00:00
Kubernetes Prow Robot ef37cb503b
Merge pull request #128634 from thockin/remove_PodHostIPs_gate_for_1.32
Remove PodHostIPs feature gates
2024-11-07 13:47:54 +00:00
zhifei92 bed96b4eb6 fix: fix the issue of losing the pending phase after a node restart. 2024-11-07 21:10:11 +08:00
Lan Liang 6e5a3cde50
Remove PodHostIPs feature gates.
Signed-off-by: Lan Liang <gcslyp@gmail.com>
2024-11-06 23:10:36 -08:00
Kubernetes Prow Robot 6cc3570466
Merge pull request #128190 from HarshalNeelkamal/external-jwt
Add plugin and key-cache for ExternalJWTSigner integration
2024-11-07 06:29:45 +00:00
Kubernetes Prow Robot c462d4c8e5
Merge pull request #126096 from utam0k/support-disabling-oom-group-kill
kubelet: new kubelet config option for disabling group oom kill
2024-11-07 06:29:36 +00:00
Harshal Neelkamal 6fdacf0411 Add plugin and key-cache for ExternalJWTSigner integration 2024-11-07 03:16:23 +00:00
utam0k 4f909c14a0
kubelet: new kubelet config option for disabling group oom kill
Signed-off-by: utam0k <k0ma@utam0k.jp>
2024-11-07 12:03:04 +09:00
Kubernetes Prow Robot 48c65d1870
Merge pull request #128576 from bart0sh/PR166-refactor-kubelet-stop-and-restart
e2e_node: refactor Kubelet stopping and restarting
2024-11-06 20:10:40 +00:00
Patrick Ohly 33ea278c51 DRA: use v1beta1 API
No code is left which depends on the v1alpha3, except of course the code
implementing that version.
2024-11-06 13:03:19 +01:00
Ed Bartosh 3aa95dafea e2e_node: refactor stopping and restarting kubelet
Moved Kubelet health checks from test cases to the stopKubelet API.
This should make the API cleaner and easier to use.
2024-11-06 11:34:48 +02:00
Kubernetes Prow Robot 98b4ee6bfa
Merge pull request #126525 from dshebib/addSidecarE2EImgTest
Restart sidecar container when the image has changed
2024-11-06 00:35:35 +00:00
Kubernetes Prow Robot f64eeb523d
Merge pull request #128096 from bart0sh/PR161-e2e_node-consolidate-NFSServer-APIs
e2e_node: consolidated NFSServer APIs.
2024-11-05 00:33:35 +00:00
Abhijit Hoskeri d86debe500 e2e_node: Pass e2eCriProxy instead of updating global.
e2eCriProxy is defined in a _test.go and referenced
in a non-test file. This confuses gopls.

It's also clearer to future readers.
2024-11-02 17:40:49 -07:00
Kubernetes Prow Robot 453efd7a4b
Merge pull request #121604 from pacoxu/image-pull-e2e
[node-e2e] add test cases for serialize and parallel image pulling
2024-10-31 08:01:26 +00:00
Paco Xu 82df7a7d82 use cri proxy injector for parallel pulling image tests 2024-10-31 14:50:50 +08:00
Kubernetes Prow Robot daef8c2419
Merge pull request #127266 from pohly/dra-admin-access-in-status
DRA API: AdminAccess in DeviceRequestAllocationResult + DRAAdminAccess feature gate
2024-10-30 03:41:25 +00:00
Kubernetes Prow Robot 5fcef4f79d
Merge pull request #128422 from bart0sh/PR163-density-e2e_node-adjust-limits
density test: adjust CPU and memory limits
2024-10-30 02:37:31 +00:00
Kubernetes Prow Robot a339a36a36
Merge pull request #127506 from ffromani/cpu-pool-size-metrics
node: metrics: add metrics about cpu pool sizes
2024-10-30 00:17:24 +00:00
Ed Bartosh 04f7a86001 density test: adjust CPU and memory limits
Adjusted limits based on recent job log:
I1028 20:05:42.079182 1002 resource_usage_test.go:199] Resource usage:
  container cpu(cores) memory_working_set(MB) memory_rss(MB)
  "kubelet" 0.024      22.17                  14.20
  "runtime" 0.041      409.70                 84.21

  I1028 20:05:42.079274 1002 resource_usage_test.go:206] CPU usage of containers:
  container 50th% 90th% 95th% 99th% 100th%
  "/"       N/A   N/A   N/A   N/A   N/A
  "runtime" 0.014 0.834 0.834 0.834 1.083
  "kubelet" 0.023 0.093 0.093 0.093 0.164

Increasing 95th percentile for runtime CPU usage should also make
pull-kubernetes-node-kubelet-containerd-flaky less flaky.
2024-10-30 00:48:56 +02:00
Patrick Ohly f3fef01e79 DRA API: AdminAccess in DeviceRequestAllocationResult
Drivers need to know that because admin access may also grant additional
permissions. The allocator needs to ignore such results when determining which
devices are considered as allocated.

In both cases it is conceptually cleaner to not rely on the content of the
ClaimSpec.
2024-10-29 09:50:07 +01:00
Kubernetes Prow Robot 685b8b3ba1
Merge pull request #126981 from kannon92/stable-empty-dir-promotion
KEP-1967: promote size backed memory volumes to stable
2024-10-29 01:00:54 +00:00
Kubernetes Prow Robot 1d8828ce70
Merge pull request #128091 from saschagrunert/cni-plugins
Update cni-plugins to v1.6.0
2024-10-27 03:01:06 +00:00
Francesco Romani 14ec0edd10 node: metrics: add metrics about cpu pool sizes
Add metrics about the sizing of the cpu pools.
Currently the cpumanager maintains 2 cpu pools:
- shared pool: this is where all pods with non-exclusive
  cpu allocation run
- exclusive pool: this is the union of the set of exclusive
  cpus allocated to containers, if any (requires static policy in use).

By reporting the size of the pools, the users (humans or machines)
can get better insights and more feedback about how the resources
actually allocated to the workload and how the node resources are used.
2024-10-24 15:35:51 +02:00
Kubernetes Prow Robot 8c7160205d
Merge pull request #127922 from PiotrProkop/topology-manager-policy-options-e2e
add e2e tests for prefer-closest-numa-nodes TopologyManagerPolicyOption
2024-10-24 14:17:03 +01:00
PiotrProkop a6eb3281cc add e2e tests for prefer-closest-numa-nodes TopologyManagerPolicyOption suboptimal allocation
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
2024-10-24 11:45:39 +02:00
Ed Bartosh 2ac5dfe379 e2e_node: check container metrics conditionally
When PodAndContainerStatsFromCRI FG is enabled, Kubelet tries to get
list of metrics from the CRI runtime using CRI API 'ListMetricDescriptors'.

As this API is not implemented in neither CRI-O nor Containerd versions
used in the test-infra, ResourceMetrics test case fails to gather
certain container metrics.

Excluding container metrics from the expected list of metrics if
PodAndContainerStatsFromCRI is enabled should solve the issue.
2024-10-23 21:08:36 +03:00
Kubernetes Prow Robot c6669ea7d6
Merge pull request #127155 from ffromani/alignment-metrics
node: metrics: add resource alignment metrics
2024-10-23 09:54:58 +01:00
Francesco Romani c025861e0c node: metrics: add resource alignment metrics
In order to improve the observability of the resource management
in kubelet, cpu allocation and NUMA alignment, we add more metrics
to report if resource alignment is in effect.

The more precise reporting would probably be using pod status,
but this would require more invasive and riskier changes,
and possibly extra interactions to the APIServer.

We start adding metrics to report if containers got their
compute resources aligned.
If metrics are growing, the assingment is working as expected;
If metrics stay consistent, perhaps at zero, no resource
alignment is done.

Extra fixes brought by this work
- retroactively add labels for existing tests
- running metrics test demands precision accounting to avoid flakes;
  ensure the node state is restored pristine between each test, to
  minimize the aforementioned risk of flakes.
- The test pod command line was wrong, with this the pod could not
  reach Running state. That gone unnoticed so far because
  no test using this utility function actually needed a pod
  in running state.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2024-10-23 08:05:38 +02:00
Davanum Srinivas abbc5ad346
Copy limited pieces of code we use from runc's apparmor and utils packages
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2024-10-22 09:56:22 -04:00