This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Work Distribution

1 - ManifestWork

What is ManifestWork

ManifestWork is used to define a group of Kubernetes resources on the hub to be applied to the managed cluster. In the open-cluster-management project, a ManifestWork resource must be created in the cluster namespace. A work agent implemented in work project is run on the managed cluster and monitors the ManifestWork resource in the cluster namespace on the hub cluster.

An example of ManifestWork to deploy a deployment to the managed cluster is shown in the following example.

apiVersion: work.open-cluster-management.io/v1 kind: ManifestWork metadata: namespace: <target managed cluster> name: hello-work-demo spec: workload: manifests: - apiVersion: apps/v1 kind: Deployment metadata: name: hello namespace: default spec: selector: matchLabels: app: hello template: metadata: labels: app: hello spec: containers: - name: hello image: quay.io/asmacdo/busybox command: ["sh", "-c", 'echo "Hello, Kubernetes!" && sleep 3600']

Status tracking

Work agent will track all the resources defined in ManifestWork and update its status. There are two types of status in manifestwork. The resourceStatus tracks the status of each manifest in the ManifestWork and conditions reflects the overall status of the ManifestWork. Work agent currently checks whether a resource is Available, meaning the resource exists on the managed cluster, and Applied means the resource defined in ManifestWork has been applied to the managed cluster.

Here is an example.

apiVersion: work.open-cluster-management.io/v1 kind: ManifestWork metadata: ... spec: ... status: conditions: - lastTransitionTime: "2021-06-15T02:26:02Z" message: Apply manifest work complete reason: AppliedManifestWorkComplete status: "True" type: Applied - lastTransitionTime: "2021-06-15T02:26:02Z" message: All resources are available reason: ResourcesAvailable status: "True" type: Available resourceStatus: manifests: - conditions: - lastTransitionTime: "2021-06-15T02:26:02Z" message: Apply manifest complete reason: AppliedManifestComplete status: "True" type: Applied - lastTransitionTime: "2021-06-15T02:26:02Z" message: Resource is available reason: ResourceAvailable status: "True" type: Available resourceMeta: group: apps kind: Deployment name: hello namespace: default ordinal: 0 resource: deployments version: v1

Fine-grained field values tracking

Optionally, we can let the work agent aggregate and report certain fields from the distributed resources to the hub clusters by setting FeedbackRule for the ManifestWork:

apiVersion: work.open-cluster-management.io/v1 kind: ManifestWork metadata: ... spec: workload: ... manifestConfigs: - resourceIdentifier: group: apps resource: deployments namespace: default name: hello feedbackRules: - type: WellKnownStatus - type: JSONPaths jsonPaths: - name: isAvailable path: '.status.conditions[?(@.type=="Available")].status'

The feedback rules prescribe the work agent to periodically get the latest states of the resources, and scrape merely those expected fields from them, which is helpful for trimming the payload size of the status. Note that the collected feedback values on the ManifestWork will not be updated unless the latest value is changed/different from the previous recorded value. Currently, it supports two kinds of FeedbackRule:

  • WellKnownStatus: Using the pre-built template of feedback values for those well-known kubernetes resources.
  • JSONPaths: A valid Kubernetes JSON-Path that selects a scalar field from the resource. Currently supported types are Integer, String, Boolean and JsonRaw. JsonRaw returns only when you have enabled the RawFeedbackJsonString feature gate on the agent. The agent will return the whole structure as a JSON string.

The default feedback value scraping interval is 30 second, and we can override it by setting --status-sync-interval on your work agent. Too short period can cause excessive burden to the control plane of the managed cluster, so generally a recommended lower bound for the interval is 5 second.

In the end, the scraped values from feedback rules will be shown in the status:

apiVersion: work.open-cluster-management.io/v1 kind: ManifestWork metadata: ... spec: ... status: resourceStatus: manifests: - conditions: ... resourceMeta: ... statusFeedback: values: - fieldValue: integer: 1 type: Integer name: ReadyReplicas - fieldValue: integer: 1 type: Integer name: Replicas - fieldValue: integer: 1 type: Integer name: AvailableReplicas - fieldValue: string: "True" type: String name: isAvailable

Garbage collection

To ensure the resources applied by ManifestWork are reliably recorded, the work agent creates an AppliedManifestWork on the managed cluster for each ManifestWork as an anchor for resources relating to ManifestWork. When ManifestWork is deleted, work agent runs a Foreground deletion, that ManifestWork will stay in deleting state until all its related resources has been fully cleaned in the managed cluster.

Delete options

User can explicitly choose not to garbage collect the applied resources when a ManifestWork is deleted. The user should specify the deleteOption in the ManifestWork. By default, deleteOption is set as Foreground which means the applied resources on the spoke will be deleted with the removal of ManifestWork. User can set it to Orphan so the applied resources will not be deleted. Here is an example:

apiVersion: work.open-cluster-management.io/v1 kind: ManifestWork metadata: ... spec: workload: ... deleteOption: propagationPolicy: Orphan

Alternatively, user can also specify a certain resource defined in the ManifestWork to be orphaned by setting the deleteOption to be SelectivelyOrphan. Here is an example with SelectivelyOrphan specified. It ensures the removal of deployment resource specified in the ManifestWork while the service resource is kept.

apiVersion: work.open-cluster-management.io/v1 kind: ManifestWork metadata: name: selective-delete-work spec: workload: ... deleteOption: propagationPolicy: SelectivelyOrphan selectivelyOrphans: orphaningRules: - group: "" resource: services namespace: default name: helloworld

Resource Race and Adoption

It is possible to create two ManifestWorks for the same cluster with the same resource defined. For example, the user can create two Manifestworks on cluster1, and both Manifestworks have the deployment resource hello in default namespace. If the content of the resource is different, the two ManifestWorks will fight, and it is desired since each ManifestWork is treated as equal and each ManifestWork is declaring the ownership of the resource. If there is another controller on the managed cluster that tries to manipulate the resource applied by a ManifestWork, this controller will also fight with work agent.

When one of the ManifestWork is deleted, the applied resource will not be removed no matter DeleteOption is set or not. The remaining ManifestWork will still keep the ownership of the resource.

To resolve such conflict, user can choose a different update strategy to alleviate the resource conflict.

  • CreateOnly: with this strategy, the work-agent will only ensure creation of the certain manifest if the resource does not exist. work-agent will not update the resource, hence the ownership of the whole resource can be taken over by another ManifestWork or controller.
  • ServerSideApply: with this strategy, the work-agent will run server side apply for the certain manifest. The default field manager is work-agent, and can be customized. If another ManifestWork or controller takes the ownership of a certain field in the manifest, the original ManifestWork will report conflict. User can prune the original ManifestWork so only field that it will own maintains.
  • ReadOnly: with this strategy, the work-agent will not apply manifests onto the cluster, but it still can read resource fields and return results when feedback rules are defined. Only metadata of the manifest is required to be defined in the spec of the ManifestWork with this strategy.

An example of using ServerSideApply strategy as following:

  1. User creates a ManifestWork with ServerSideApply specified:
apiVersion: work.open-cluster-management.io/v1 kind: ManifestWork metadata: namespace: <target managed cluster> name: hello-work-demo spec: workload: ... manifestConfigs: - resourceIdentifier: group: apps resource: deployments namespace: default name: hello updateStrategy: type: ServerSideApply
  1. User creates another ManifestWork with ServerSideApply but with different field manager.
apiVersion: work.open-cluster-management.io/v1 kind: ManifestWork metadata: namespace: <target managed cluster> name: hello-work-replica-patch spec: workload: manifests: - apiVersion: apps/v1 kind: Deployment metadata: name: hello namespace: default spec: replicas: 3 manifestConfigs: - resourceIdentifier: group: apps resource: deployments namespace: default name: hello updateStrategy: type: ServerSideApply serverSideApply: force: true fieldManager: work-agent-another

The second ManifestWork only defines replicas in the manifest, so it takes the ownership of replicas. If the first ManifestWork is updated to add replicas field with different value, it will get conflict condition and manifest will not be updated by it.

Instead of create the second ManifestWork, user can also set HPA for this deployment. HPA will also take the ownership of replicas, and the update of replicas field in the first ManifestWork will return conflict condition.

Permission setting for work agent

All workload manifests are applied to the managed cluster by the work agent, and by default the work agent has the following permission for the managed cluster:

  • clusterRole admin(instead of the cluster-admin) to apply kubernetes common resources
  • managing customresourcedefinitions, but can not manage a specific custom resource instance
  • managing clusterrolebindings, rolebindings, clusterroles, roles, including the bind and escalate permission, this is why we can grant work-agent service account extra permissions using ManifestWork

So if the workload manifests to be applied on the managed cluster exceeds the above permission, for example some Customer Resource instances, there will be an error ... is forbidden: User "system:serviceaccount:open-cluster-management-agent:klusterlet-work-sa" cannot get resource ... reflected on the ManifestWork status.

To prevent this, the service account klusterlet-work-sa used by the work-agent needs to be given the corresponding permissions. There are several ways:

  • add permission on the managed cluster directly, we can
    • aggregate the new clusterRole for your to-be-applied resources to the existing admin clusterRole
    • OR create role/clusterRole roleBinding/clusterRoleBinding for the klusterlet-work-sa service account
  • add permission on the hub cluster by another ManifestWork, the ManifestWork includes
    • an clusterRole with label "open-cluster-management.io/aggregate-to-work": "true" for your to-be-applied resources, the rules defined in the clusterRole will be aggregated to the work agent(OCM version >= v0.12.0)
    • OR role/clusterRole roleBinding/clusterRoleBinding for the klusterlet-work-sa service account

Below is an example use ManifestWork to give klusterlet-work-sa permission for resource machines.cluster.x-k8s.io

  • Option 1: Use aggregated clusterRole
apiVersion: work.open-cluster-management.io/v1 kind: ManifestWork metadata: namespace: cluster1 name: permission-set spec: workload: manifests: - apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: open-cluster-management:klusterlet-work:my-role labels: open-cluster-management.io/aggregate-to-work: "true" # with this label, the clusterRole will be selected to aggregate rules: # Allow agent to managed machines - apiGroups: ["cluster.x-k8s.io"] resources: ["machines"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  • Option 2: Use clusterRole and clusterRoleBinding
apiVersion: work.open-cluster-management.io/v1 kind: ManifestWork metadata: namespace: cluster1 name: permission-set spec: workload: manifests: - apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: open-cluster-management:klusterlet-work:my-role rules: # Allow agent to managed machines - apiGroups: ["cluster.x-k8s.io"] resources: ["machines"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] - apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: open-cluster-management:klusterlet-work:my-binding roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: open-cluster-management:klusterlet-work:my-role subjects: - kind: ServiceAccount name: klusterlet-work-sa namespace: open-cluster-management-agent

Treating defaulting/immutable fields in API

The kube-apiserver sets the defaulting/immutable fields for some APIs if the user does not set them. And it may fail to deploy these APIs using ManifestWork. Because in the reconcile loop, the work agent will try to update the immutable or default field after comparing the desired manifest in the ManifestWork and existing resource in the cluster, and the update will fail or not take effect.

Let’s use Job as an example. The kube-apiserver will set a default selector and label on the Pod of Job if the user does not set spec.Selector in the Job. The fields are immutable, so the ManifestWork will report AppliedManifestFailed when we apply a Job without spec.Selector using ManifestWork.

apiVersion: work.open-cluster-management.io/v1 kind: ManifestWork metadata: namespace: cluster1 name: exmaple-job spec: workload: manifests: - apiVersion: batch/v1 kind: Job metadata: name: pi namespace: default spec: template: spec: containers: - name: pi image: perl:5.34.0 command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"] restartPolicy: Never backoffLimit: 4

There are 2 options to fix this issue.

  1. Specify the fields manually if they are configurable. For example, set spec.manualSelector=true and your own labels in the spec.selector of the Job, and set the same labels for the containers.
apiVersion: work.open-cluster-management.io/v1 kind: ManifestWork metadata: namespace: cluster1 name: exmaple-job-1 spec: workload: manifests: - apiVersion: batch/v1 kind: Job metadata: name: pi namespace: default spec: manualSelector: true selector: matchLabels: job: pi template: metadata: labels: job: pi spec: containers: - name: pi image: perl:5.34.0 command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"] restartPolicy: Never backoffLimit: 4
  1. Set the updateStrategy ServerSideApply in the ManifestWork for the API.
apiVersion: work.open-cluster-management.io/v1 kind: ManifestWork metadata: namespace: cluster1 name: exmaple-job spec: manifestConfigs: - resourceIdentifier: group: batch resource: jobs namespace: default name: pi updateStrategy: type: ServerSideApply workload: manifests: - apiVersion: batch/v1 kind: Job metadata: name: pi namespace: default spec: template: spec: containers: - name: pi image: perl:5.34.0 command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"] restartPolicy: Never backoffLimit: 4

Dynamic identity authorization

All manifests in ManifestWork are applied by the work-agent using the mounted service account to raise requests against the managed cluster by default. And the work agent has very high permission to access the managed cluster which means that any hub user with write access to the ManifestWork resources will be able to dispatch any resources that the work-agent can manipulate to the managed cluster.

The executor subject feature(introduced in release 0.9.0) provides a way to clarify the owner identity(executor) of the ManifestWork before it takes effect so that we can explicitly check whether the executor has sufficient permission in the managed cluster.

The following example clarifies the owner “executor1” of the ManifestWork, so before the work-agent applies the “default/test” ConfigMap to the managed cluster, it will first check whether the ServiceAccount “default/executor” has the permission to apply this ConfigMap

apiVersion: work.open-cluster-management.io/v1 kind: ManifestWork metadata: namespace: cluster1 name: example-manifestwork spec: executor: subject: type: ServiceAccount serviceAccount: namespace: default name: executor1 workload: manifests: - apiVersion: v1 data: a: b kind: ConfigMap metadata: namespace: default name: test

Not any hub user can specify any executor at will. Hub users can only use the executor for which they have an execute-as(virtual verb) permission. For example, hub users bound to the following Role can use the “executor1” ServiceAccount in the “default” namespace on the managed cluster.

apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: cluster1-executor1 namespace: cluster1 rules: - apiGroups: - work.open-cluster-management.io resources: - manifestworks verbs: - execute-as resourceNames: - system:serviceaccount:default:executor1

For backward compatibility, if the executor is absent, the work agent will keep using the mounted service account to apply resources. But using the executor is encouraged, so we have a feature gate NilExecutorValidating to control whether any hub user is allowed to not set the executor. It is disabled by default, we can use the following configuration to the ClusterManager to enable it. When it is enabled, not setting executor will be regarded as using the “/klusterlet-work-sa” (namespace is empty, name is klusterlet-work-sa) virtual service account on the managed cluster for permission verification, which means only hub users with “execute-as” permissions on the “system:serviceaccount::klusterlet-work-sa” ManifestWork are allowed not to set the executor.

spec: workConfiguration: featureGates: - feature: NilExecutorValidating mode: Enable

Work-agent uses the SubjectAccessReview API to check whether an executor has permission to the manifest resources, which will cause a large number of SAR requests to the managed cluster API-server, so we provided a new feature gate ExecutorValidatingCaches(in release 0.10.0) to cache the result of the executor’s permission to the manifest resource, it is only works when the managed cluster uses RBAC mode authorization, and is disabled by default as well, but can be enabled by using the following configuration for Klusterlet:

spec: workConfiguration: featureGates: - feature: ExecutorValidatingCaches mode: Enable

Enhancement proposal: Work Executor Group

2 - ManifestWorkReplicaSet

What is ManifestWorkReplicaSet

ManifestWorkReplicaSet is an aggregator API that uses Manifestwork and Placement to create manifestwork for the placement-selected clusters.

View an example of ManifestWorkReplicaSet to deploy a CronJob and Namespace for a group of clusters selected by placements.

apiVersion: work.open-cluster-management.io/v1alpha1 kind: ManifestWorkReplicaSet metadata: name: mwrset-cronjob namespace: ocm-ns spec: placementRefs: - name: placement-rollout-all # Name of a created Placement rolloutStrategy: rolloutType: All - name: placement-rollout-progressive # Name of a created Placement rolloutStrategy: rolloutType: Progressive progressive: minSuccessTime: 5m progressDeadline: 10m maxFailures: 5% mandatoryDecisionGroups: - groupName: "prod-canary-west" - groupName: "prod-canary-east" - name: placement-rollout-progressive-per-group # Name of a created Placement rolloutStrategy: rolloutType: ProgressivePerGroup progressivePerGroup: progressDeadline: 10m maxFailures: 2 manifestWorkTemplate: deleteOption: propagationPolicy: SelectivelyOrphan selectivelyOrphans: orphaningRules: - group: '' name: ocm-ns namespace: '' resource: Namespace manifestConfigs: - feedbackRules: - jsonPaths: - name: lastScheduleTime path: .status.lastScheduleTime - name: lastSuccessfulTime path: .status.lastSuccessfulTime type: JSONPaths resourceIdentifier: group: batch name: sync-cronjob namespace: ocm-ns resource: cronjobs workload: manifests: - kind: Namespace apiVersion: v1 metadata: name: ocm-ns - kind: CronJob apiVersion: batch/v1 metadata: name: sync-cronjob namespace: ocm-ns spec: schedule: '* * * * *' concurrencyPolicy: Allow suspend: false jobTemplate: spec: backoffLimit: 2 template: spec: containers: - name: hello image: 'quay.io/prometheus/busybox:latest' args: - /bin/sh - '-c' - date; echo Hello from the Kubernetes cluster

The PlacementRefs uses the Rollout Strategy API to apply the manifestWork to the selected clusters. In the example above; the placementRefs refers to three placements; placement-rollout-all, placement-rollout-progressive and placement-rollout-progressive-per-group. For more info regards the rollout strategies check the Rollout Strategy section at the placement document. Note: The placement reference must be in the same namespace as the manifestWorkReplicaSet.

Status tracking

The ManifestWorkReplicaSet example above refers to three placements each one will have its placementSummary in ManifestWorkReplicaSet status. The PlacementSummary shows the number of manifestWorks applied to the placement’s clusters based on the placementRef’s rolloutStrategy and total number of clusters. The manifestWorkReplicaSet Summary aggregate the placementSummaries showing the total number of applied manifestWorks to all clusters.

The manifestWorkReplicaSet has three status conditions;

  1. PlacementVerified verify the placementRefs status; not exist or empty cluster selection.
  2. PlacementRolledOut verify the rollout strategy status; progressing or complete.
  3. ManifestWorkApplied verify the created manifestWork status; applied, progressing, degraded or available.

The manifestWorkReplicaSet determine the ManifestWorkApplied condition status based on the resource state (applied or available) of each manifestWork.

Here is an example.

apiVersion: work.open-cluster-management.io/v1alpha1 kind: ManifestWorkReplicaSet metadata: name: mwrset-cronjob namespace: ocm-ns spec: placementRefs: - name: placement-rollout-all ... - name: placement-rollout-progressive ... - name: placement-rollout-progressive-per-group ... manifestWorkTemplate: ... status: conditions: - lastTransitionTime: '2023-04-27T02:30:54Z' message: '' reason: AsExpected status: 'True' type: PlacementVerified - lastTransitionTime: '2023-04-27T02:30:54Z' message: '' reason: Progressing status: 'False' type: PlacementRolledOut - lastTransitionTime: '2023-04-27T02:30:54Z' message: '' reason: AsExpected status: 'True' type: ManifestworkApplied placementSummary: - name: placement-rollout-all availableDecisionGroups: 1 (10 / 10 clusters applied) summary: applied: 10 available: 10 progressing: 0 degraded: 0 total: 10 - name: placement-rollout-progressive availableDecisionGroups: 3 (20 / 30 clusters applied) summary: applied: 20 available: 20 progressing: 0 degraded: 0 total: 20 - name: placement-rollout-progressive-per-group availableDecisionGroups: 4 (15 / 20 clusters applied) summary: applied: 15 available: 15 progressing: 0 degraded: 0 total: 15 summary: applied: 45 available: 45 progressing: 0 degraded: 0 total: 45

Release and Enable Feature

ManifestWorkReplicaSet is in alpha release and it is not enabled by default. In order to enable the ManifestWorkReplicaSet feature, it has to be enabled in the cluster-manager instance in the hub. Use the following command to edit the cluster-manager CR (custom resource) in the hub cluster.

$ oc edit ClusterManager cluster-manager

Add the workConfiguration field to the cluster-manager CR as below and save.

kind: ClusterManager metadata: name: cluster-manager spec: ... workConfiguration: featureGates: - feature: ManifestWorkReplicaSet mode: Enable

In order to assure the ManifestWorkReplicaSet has been enabled successfully check the cluster-manager using the command below

$ oc get ClusterManager cluster-manager -o yml

You should find under the status->generation the cluster-manager-work-controller deployment has been added as below

kind: ClusterManager metadata: name: cluster-manager spec: ... status: ... generations: ... - group: apps lastGeneration: 2 name: cluster-manager-work-webhook namespace: open-cluster-management-hub resource: deployments version: v1 - group: apps lastGeneration: 1 name: cluster-manager-work-controller namespace: open-cluster-management-hub resource: deployments version: v1