Commit Graph

1027 Commits

Author SHA1 Message Date
Aravindh Puthiyaparambil 00be157dab
kubelet: use env vars in node log query PS command
- Use environment variables to pass string arguments in the node log
  query PS command
- Split getLoggingCmd into getLoggingCmdEnv and getLoggingCmdArgs
  for better modularization
2025-01-13 14:25:35 -08:00
Patrick Ohly a5de75458e DRA API: bump maximum size of ReservedFor to 256
The original limit of 32 seemed sufficient for a single GPU on a node. But for
shared non-local resources it is too low. For example, a ResourceClaim might be
used to allocate an interconnect channel that connects all pods of a workload
running on several different nodes, in which case the number of pods can be
considerably larger.

256 is high enough for currently planned systems. If we need something even
higher in the future, an alternative approach might be needed to avoid
scalability problems.

Normally, increasing such a limit would have to be done incrementally over two
releases. In this case we decided on
Slack (https://kubernetes.slack.com/archives/CJUQN3E4T/p1734593174791519) to
make an exception and apply this change to current master for 1.33 and backport
it to the next 1.32.x patch release for production usage.

This breaks downgrades to a 1.32 release without this change if there are
ResourceClaims with a number of consumers > 32 in ReservedFor. In practice,
this breakage is very unlikely because there are no workloads yet which need so
many consumers and such downgrades to a previous patch release are also
unlikely. Downgrades to 1.31 already weren't supported when using DRA v1beta1.
2025-01-09 14:27:03 +01:00
lauralorenz 7fe41da522
KEP-4603: Node specific kubelet config for maximum backoff down to 1 second (#128374)
* Add feature gate, API, and conflict validation tests for enablecrashloopbackoffmax

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Handle when current base is longer than node max

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Update pkg/features/kube_features.go

Co-authored-by: Tsubasa Nagasawa <toversus2357@gmail.com>

* Fix indentation

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Follow convention for success test

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Normalize casing, and change field to Duration

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Fix json name and some other casing errors

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Another one I missed before

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Don't clobber global max function

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Change to flat value in defaults.go

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Streamline validation and defaults

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Fix typecheck

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Lint

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Tighten up validation for subsecond values

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Rename field from MaxBackOffPeriod to MaxContainerRestartPeriod

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* A few missed references to renames

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Only compare flags in flags test

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Don't mess with SetDefault signature

Nobody messes with SetDefault signature

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Fix stale signature change, and update test data

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Inspect current feature gates at defaulting time

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Don't use the global feature gate for temp usage

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Expose default error, and some comments

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Hint fuzzer for less arbitrary values to FeatureGates

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

---------

Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Co-authored-by: Tsubasa Nagasawa <toversus2357@gmail.com>
2024-11-09 01:44:43 +00:00
Kubernetes Prow Robot c25f5eefe4
Merge pull request #128407 from ndixita/pod-level-resources
[PodLevelResources] Pod Level Resources Feature Alpha
2024-11-08 07:10:50 +00:00
Kubernetes Prow Robot 45260fd76a
Merge pull request #127857 from Jefftree/cle-v1alpha2
Coordinated Leader Election add v1alpha2
2024-11-08 07:10:43 +00:00
Kubernetes Prow Robot 6db94779e4
Merge pull request #128686 from thockin/take_over_pr-125233
Add missing comments in k8s.io/api/core/v1
2024-11-08 05:24:59 +00:00
ndixita 85488b5f10 Generated files and compatability data from API changes 2024-11-08 03:00:50 +00:00
Bo Wang 495af2a3d4
Add missing comments in k8s.io/api/core/v1
Signed-off-by: Bo Wang <wangbob@uniontech.com>
2024-11-07 18:42:33 -08:00
Jefftree e86c38b249 generated 2024-11-08 02:27:20 +00:00
Kubernetes Prow Robot 4cf2818f96
Merge pull request #128240 from LionelJouin/KEP-4817
DRA: Implementation of ResourceClaim.Status.Devices (KEP-4817)
2024-11-08 02:21:24 +00:00
Kubernetes Prow Robot 4d10ae8fdc
Merge pull request #127513 from tkashem/delete-undecryptable
KEP-3926: unsafe deletion of corrupt objects
2024-11-08 02:21:04 +00:00
Abu Kashem aff05b0bca api: run codegen
run 'make update' to code gen for changes in meta/v1 DeleteOptions
2024-11-07 17:37:55 -05:00
Kubernetes Prow Robot 3300aa1783
Merge pull request #128247 from mattcary/autodelete-ga
Promote StatefulSetAutoDeletePVC to stable in 1.32
2024-11-07 22:20:43 +00:00
Lionel Jouin d84c8d2a64 [KEP-4817] make update 2024-11-07 22:19:09 +01:00
Kubernetes Prow Robot 9660e5c4cd
Merge pull request #127360 from knight42/feat/split-stdout-stderr-server-side
API: add a new `Stream` field to `PodLogOptions`
2024-11-07 19:44:45 +00:00
Kubernetes Prow Robot 50362ac7d0 Promote StatefulSetAutoDeletePVC to stable for 1.32. 2024-11-07 09:43:49 -08:00
Lionel Jouin d28b50e0a0 [KEP-4817] make update
Signed-off-by: Lionel Jouin <lionel.jouin@est.tech>
2024-11-07 10:36:09 +01:00
Kubernetes Prow Robot c462d4c8e5
Merge pull request #126096 from utam0k/support-disabling-oom-group-kill
kubelet: new kubelet config option for disabling group oom kill
2024-11-07 06:29:36 +00:00
Jian Zeng 4193824215
chore: update generated code
Signed-off-by: Jian Zeng <anonymousknight96@gmail.com>
2024-11-07 13:52:16 +08:00
Kubernetes Prow Robot afc204104c
Merge pull request #128601 from pohly/dra-api-opaque-parameters-length-limit
DRA API: opaque parameters length limit
2024-11-07 03:53:35 +00:00
utam0k 4f909c14a0
kubelet: new kubelet config option for disabling group oom kill
Signed-off-by: utam0k <k0ma@utam0k.jp>
2024-11-07 12:03:04 +09:00
Kubernetes Prow Robot fa0979c15f
Merge pull request #124074 from carlory/clean-100001
fix description for PersistentVolumeSource and VolumeSource
2024-11-06 22:07:29 +00:00
Kubernetes Prow Robot 96250d4411
Merge pull request #124918 from SergeyKanzhelev/commentIgnoringBadStatuses
added a comment that statuses lists are not being validated
2024-11-06 20:09:29 +00:00
Patrick Ohly 446f20aa3e DRA API: add maximum length of opaque parameters
This had been left out unintentionally earlier. Because theoretically there
might now be existing objects with parameters that are larger than whatever
limit gets enforced now, the limit only gets checked when parameters get
created or modified.

This is similar to the validation of CEL expressions and for consistency, the
same 10 Ki limit as for those is chosen.

Because the limit is not enforced for stored parameters, it can be increased in
the future, with the caveat that users who need larger parameters then depend
on the newer Kubernetes release with a higher limit. Lowering the limit is
harder because creating deployments that worked in older Kubernetes will not
work anymore with newer Kubernetes.
2024-11-06 17:29:51 +01:00
Patrick Ohly 30f5282656 DRA API: rename DeviceCapacity.Quantity to DeviceCapacity.Value
Based on review
feedback (https://github.com/kubernetes/kubernetes/pull/127511#discussion_r1823521172).
2024-11-06 13:03:20 +01:00
Patrick Ohly 81fd64256c DRA API: use DeviceCapacity struct instead of plain Quantity
This enables a future extension where capacity of a single device gets consumed
by different claims. The semantic without any additional fields is the same as
before: a capacity cannot be split up and is only an attribute of a device.

Because its semantically the same as before, two-way conversion to v1alpha3 is
possible.
2024-11-06 13:03:19 +01:00
Patrick Ohly 0ee52b23cd DRA API: generated files 2024-11-06 13:03:19 +01:00
carlory 7cb4a1f144 fix description for PersistentVolumeSource and VolumeSource 2024-11-06 10:51:04 +08:00
Kubernetes Prow Robot 2d6c8a129d
Merge pull request #127134 from jpbetz/mutating-admission
KEP-3962: MutatingAdmissionPolicy Alpha
2024-11-05 17:31:38 +00:00
Kubernetes Prow Robot f56db61db5
Merge pull request #126862 from carlory/HPAContainerMetrics
Remove generally available feature gate HPAContainerMetrics
2024-11-05 16:19:29 +00:00
Kubernetes Prow Robot bc79d3ba87
Merge pull request #128396 from ritazh/deprecate-EnforceMountableSecretsAnnotation
deprecate EnforceMountableSecretsAnnotation in 1.32
2024-11-05 06:07:40 +00:00
Joe Betz 700e3b5664 Update OpenAPI and fix openAPI tests to handle unexported jsonreferences
Co-authored-by: Alexander Zielensk <alexzielenski@gmail.com>
2024-11-04 21:40:54 -05:00
Joe Betz fe3a7f5291 generate code 2024-11-04 21:40:47 -05:00
Rita Zhang e7cdc59555
deprecate EnforceMountableSecretsAnnotation in 1.32
Signed-off-by: Rita Zhang <rita.z.zhang@gmail.com>
2024-11-04 13:13:32 -08:00
Sergey Kanzhelev 4fc209f12b generated files 2024-11-03 06:28:45 +00:00
Jan Safranek 3867cb40ad Regenerated API 2024-11-01 12:45:56 +01:00
Patrick Ohly 4419568259 DRA: treat AdminAccess as a new feature gated field
Using the "normal" logic for a feature gated field simplifies the
implementation of the feature gate.

There is one (entirely theoretic!) problem with updating from 1.31: if a claim
was allocated in 1.31 with admin access, the status field was not set because
it didn't exist yet. If a driver now follows the current definition of "unset =
off", then it will not grant admin access even though it should. This is
theoretic because drivers are starting to support admin access with 1.32, so
there shouldn't be any claim where this problem could occur.
2024-10-29 10:22:31 +01:00
Patrick Ohly 9a7e4ccab2 DRA admin access: add feature gate
The new DRAAdminAccess feature gate has the following effects:
- If disabled in the apiserver, the spec.devices.requests[*].adminAccess
  field gets cleared. Same in the status. In both cases the scenario
  that it was already set and a claim or claim template get updated
  is special: in those cases, the field is not cleared.

  Also, allocating a claim with admin access is allowed regardless of the
  feature gate and the field is not cleared. In practice, the scheduler
  will not do that.
- If disabled in the resource claim controller, creating ResourceClaims
  with the field set gets rejected. This prevents running workloads
  which depend on admin access.
- If disabled in the scheduler, claims with admin access don't get
  allocated. The effect is the same.

The alternative would have been to ignore the fields in claim controller and
scheduler. This is bad because a monitoring workload then runs, blocking
resources that probably were meant for production workloads.
2024-10-29 09:50:11 +01:00
Patrick Ohly f3fef01e79 DRA API: AdminAccess in DeviceRequestAllocationResult
Drivers need to know that because admin access may also grant additional
permissions. The allocator needs to ignore such results when determining which
devices are considered as allocated.

In both cases it is conceptually cleaner to not rely on the content of the
ClaimSpec.
2024-10-29 09:50:07 +01:00
Kubernetes Prow Robot 86b99869cb
Merge pull request #128299 from SergeyKanzhelev/updateDHS
Update Device Health fields description for KEP-4680
2024-10-28 22:19:01 +00:00
Kubernetes Prow Robot 3690cb7f9a
Merge pull request #128101 from pohly/dra-api-cel-cost-limit
DRA API: implement CEL cost limit
2024-10-26 20:18:52 +01:00
Sergey Kanzhelev aed81e5d47 regenerate files 2024-10-26 07:11:06 +00:00
Patrick Ohly f548fc2264 DRA API: implement CEL cost limit
The main purpose is to protect against denial-of-service attacks.  Scheduling
time depends a lot on unpredictable factors and expected scheduling time also
varies, so no attempt is made to limit the overall time spent on evaluating CEL
expressions per claim.
2024-10-23 21:24:45 +02:00
Kubernetes Prow Robot 0e6961e898
Merge pull request #126955 from tallclair/cleanup
Remove corev1.Binding deprecation message
2024-10-23 02:21:19 +01:00
carlory d62ee4ab5f Remove generally available feature gate HPAContainerMetrics 2024-10-18 14:37:53 +08:00
Kubernetes Prow Robot b1b4e5d397
Merge pull request #128003 from pohly/dra-classic-dra-removal
DRA: remove "classic DRA"
2024-10-18 00:55:17 +01:00
Kubernetes Prow Robot 51f76febd7
Merge pull request #127402 from mimowo/managed-by-beta-update
Graduate JobManagedBy to Beta in 1.32
2024-10-17 19:27:14 +01:00
Kubernetes Prow Robot c5a85abecb
Merge pull request #122867 from oilbeater/patch-1
fix broken links
2024-10-17 19:27:06 +01:00
Kubernetes Prow Robot f5b92902a3
Merge pull request #124434 from tu1h/fix-compute-resources-link
API docs: point outdate link to current link
2024-10-17 17:17:03 +01:00
Michal Wozniak 70a8ceb6f0 Graduate JobManagedBy to Beta in 1.32
# Conflicts:
#	pkg/features/kube_features.go
2024-10-17 09:01:54 +02:00