-
Notifications
You must be signed in to change notification settings - Fork 789
feat(provider): add External Metrics provider#1863
feat(provider): add External Metrics provider#1863jlore-decathlon wants to merge 6 commits intofluxcd:mainfrom
Conversation
Proposed addition
The current Datadog metric provider relies on their Metric API.
However, this API has pretty low rate limits, and people with a moderately sized infrastructure tend to reach these limits quite easily when scaling their usage of Flagger or datadog-based autoscaling (like KEDA).
Datadog offers a more scalable alternative by making its Cluster Agent batch requests by groups of 35 see Cluster Agent Autoscaling Metrics. It then makes these metrics available within the cluster by exposing an endpoint following Kubernetes External Metrics API.
Note
This endpoint is not documented by Datadog, as they expect people to have the agent register against the control plane as the cluster's external metrics provider and then making these metrics available through k8s API Server, removing the need to query the endpoint directly.
However, by implementing a kubernetes API, its behavior is predictable and stable enough to be used directly.
We've relied on the way KEDA implemented a similar feature during design and implementation. However, Flagger is not an autoscaling solution so we're not going to mimic the metric proxy Keda operates. We simply propose to query the external metric server directly. By doing this, we also chose to make the provider generic and compatible with any external metrics server. The downside is that we cannot abstract the way datadog names its metrics which isn't trivial.
fix: #1235
Any alternatives you've considered?
We've pondered modifying the Datadog metric provider instead of making an external metrics provider. But we felt that this had the benefit of making other external metric providers compatible and kept the code datadog-agnostic.
We could theoretically make it even more generic and use any kubernetes metric API (standard, Custom or External), but I think Flagger already offers this
Disclaimer
- This PR was peer programmed with my colleague @mveroone.
- We're not Go developers, yet we did our best to follow the project's coding conventions and guidelines. Any feedback is welcome and we'll be happy to rework any part of that contribution you think needs it.
- AI disclosure : AIL 1 (see https://danielmiessler.com/blog/ai-influence-level-ail). Minor code autocomplete, but mostly manual coding and writing.
- We've built the docker image and tested end-to-end on one of our GKE clusters against Datadog cluster agent endpoint. it seems to work as expected, but the lack of feedback from Flagger leaves room for some doubts, but we were unsure if adding some logging was a good idea.
85c595d to
139a34a
Compare
3757b5a to
72ad54a
Compare
72ad54a to
9925699
Compare
implementations. Datadog Cluster Agent can batch metric queries
and expose them through an endpoint compatible with Kubernetes External
Metrics API.
This implementations allows to use this endpoint and any other server
implementing Kubernetes External Metrics API. Including k8s API server
itself.
Co-authored-by: Johan Lore
Co-authored-by: Maxime Veroone
Signed-off-by: Johan Lore
9925699 to
2ee47e0
Compare
31f6e7d to
eeeccfc
Compare
eeeccfc to
86cc361
Compare
aryan9600
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for opening this PR!
Signed-off-by: Maxime Veroone
eb8b59f to
2e0a69c
Compare
|
Note : if that's okay, we'll squash commits after a few rounds of review so we can fix the DCO |
| restConfig.TLSClientConfig = rest.TLSClientConfig{ | ||
| Insecure: provider.InsecureSkipVerify, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason why we override the entire tls client config here with an insecure tls configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None, just lack of golang skills from my part. let me know if the fix is the right one.
| if tokenBytes, ok := credentials["token"]; ok { | ||
| restConfig.BearerToken = string(tokenBytes) | ||
| } | ||
| // TODO: handle user name/password auth if needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets address this TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm choosing to just remove it.
ExternalMetrics Servers use the same principle as Kubernetes API Server, where Bearer Token auth is standard. No known need for user/password auth so far, so let's not handle needs we don't know we have.
| // at the ExternalMetricsProvider's Address, using the provided query string, | ||
| // and returns the *first* result as a float64. | ||
| func (p *ExternalMetricsProvider) RunQuery(query string) (float64, error) { | ||
| // The Provider interface only allows a plain string query so decode it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // The Provider interface only allows a plain string query so decode it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in my commit
| // where only the metricName is required. | ||
| // and returns the namespace, metricName, and labelSelector separately. | ||
| func parseExternalMetricsQuery(query string) (namespace string, metricName string, labelSelector labels.Selector, err error) { | ||
| u, err := url.Parse("dummy:///" + query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a comment explaining why we add a dummy scheme here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
|
||
| // ExternalMetricsProvider fetches metrics from an ExternalMetricsProvider. | ||
| type ExternalMetricsProvider struct { | ||
| timeout time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to set the rest.Config's Timeout field to this value - this timeout gets used internally for the http client used to send requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I overlooked it.
However, I'm a little uncomfortable in hardcoding it, although since we're limited by the Provider Interface / API, which I tried not to extend in this PR, that seems like an acceptable trade-off. WDYT?
Also I'm not sure if I should test it ? if it's part of the restClient, it's already tested ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to test it. i understand your qualms about hardcoding it but as you said we're limited by the Provider interface so we just need to swallow this poor design. if you're up for it, please feel free to raise another PR which adds a timeout field to the Provider APi and modify all other providers to use that over the hardcoded default.
Signed-off-by: Maxime Veroone
* Adjusted a few comments
* Fixed a wrongful override of TLSClient
aryan9600
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good - a few nits
| if err != nil || restConfig == nil { | ||
| return nil, fmt.Errorf("Not in a kubernetes cluster: %w", err) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if err != nil || restConfig == nil { | |
| return nil, fmt.Errorf("Not in a kubernetes cluster: %w", err) | |
| } | |
| if err != nil { | |
| return nil, fmt.Errorf("not in a kubernetes cluster: %w", err) | |
| } | |
| if restConfig == nil { | |
| return nil, fmt.Errorf("not in a kubernetes cluster: rest config is nil") | |
| } |
| InsecureSkipVerify bool | ||
| creds map[string][]byte | ||
| builderFunc func() (*rest.Config, error) | ||
| wantErr bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| wantErr bool |
no point of this field if we are not testing the error path
| client: &fakeExternalMetricsClient, | ||
| } | ||
|
|
||
| tests := []struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets add one test for the error path
| metadata: | ||
| name: external-metric-server-token | ||
| namespace: default | ||
| data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| data: | |
| stringData: |