Improve metrics building performance, limit endpoints to whitelist #1616

timvisee · 2023-03-28T13:49:23Z

Improves #1541 with suggestions from #1541 (comment).

This improves upon the existing metrics system. It now collects/formats the metrics output more efficiently.

It now uses a whitelist for endpoints to report only the most significant ones: a selection of search, recommend and upsert endpoints:

const REST_ENDPOINT_WHITELIST: &[&str] = &[
    "/collections/{name}/index",
    "/collections/{name}/points",
    "/collections/{name}/points/payload",
    "/collections/{name}/points/recommend",
    "/collections/{name}/points/recommend/batch",
    "/collections/{name}/points/search",
    "/collections/{name}/points/search/batch",
];
const GRPC_ENDPOINT_WHITELIST: &[&str] = &[
    "/qdrant.Points/OverwritePayload",
    "/qdrant.Points/Recommend",
    "/qdrant.Points/RecommendBatch",
    "/qdrant.Points/Search",
    "/qdrant.Points/SearchBatch",
    "/qdrant.Points/SetPayload",
    "/qdrant.Points/Upsert",
];

This also:

limits timing output to HTTP 200
sets Content-Type header for /metrics
fixes incorrect value for cluster commit metric
adds a basic OpenAPI test for /metrics

Click here for a snippet of output with whitelisting.

# HELP app_info information about qdrant server
# TYPE app_info counter
app_info{name="qdrant",version="0.11.1"} 1
# HELP collections_total number of collections
# TYPE collections_total gauge
collections_total 4
# HELP cluster_enabled is cluster support enabled
# TYPE cluster_enabled counter
cluster_enabled 0
# HELP rest_responses_total total number of responses
# TYPE rest_responses_total counter
rest_responses_total{method="PUT",endpoint="/collections/{name}/points",status="200"} 5
rest_responses_total{method="POST",endpoint="/collections/{name}/points/search/batch",status="200"} 5
rest_responses_total{method="POST",endpoint="/collections/{name}/points/search",status="200"} 10
rest_responses_total{method="PUT",endpoint="/collections/{name}/index",status="200"} 15
rest_responses_total{method="POST",endpoint="/collections/{name}/points",status="200"} 5
# HELP rest_responses_fail_total total number of failed responses
# TYPE rest_responses_fail_total counter
rest_responses_fail_total{method="PUT",endpoint="/collections/{name}/points",status="200"} 0
rest_responses_fail_total{method="POST",endpoint="/collections/{name}/points/search/batch",status="200"} 0
rest_responses_fail_total{method="POST",endpoint="/collections/{name}/points/search",status="200"} 0
rest_responses_fail_total{method="PUT",endpoint="/collections/{name}/index",status="200"} 0
rest_responses_fail_total{method="POST",endpoint="/collections/{name}/points",status="200"} 0
# HELP rest_responses_avg_duration_seconds average response duration
# TYPE rest_responses_avg_duration_seconds gauge
rest_responses_avg_duration_seconds{method="PUT",endpoint="/collections/{name}/points",status="200"} 0.008645599609375
rest_responses_avg_duration_seconds{method="POST",endpoint="/collections/{name}/points/search/batch",status="200"} 0.0003522000122070312
rest_responses_avg_duration_seconds{method="POST",endpoint="/collections/{name}/points/search",status="200"} 0.0004141000061035156
rest_responses_avg_duration_seconds{method="PUT",endpoint="/collections/{name}/index",status="200"} 0.0002236666717529297
rest_responses_avg_duration_seconds{method="POST",endpoint="/collections/{name}/points",status="200"} 0.000155
# HELP rest_responses_min_duration_seconds minimum response duration
# TYPE rest_responses_min_duration_seconds gauge
rest_responses_min_duration_seconds{method="PUT",endpoint="/collections/{name}/points",status="200"} 0.006217
rest_responses_min_duration_seconds{method="POST",endpoint="/collections/{name}/points/search/batch",status="200"} 0.000283
rest_responses_min_duration_seconds{method="POST",endpoint="/collections/{name}/points/search",status="200"} 0.000337
rest_responses_min_duration_seconds{method="PUT",endpoint="/collections/{name}/index",status="200"} 0.000145
rest_responses_min_duration_seconds{method="POST",endpoint="/collections/{name}/points",status="200"} 0.000142
# HELP rest_responses_max_duration_seconds maximum response duration
# TYPE rest_responses_max_duration_seconds gauge
rest_responses_max_duration_seconds{method="PUT",endpoint="/collections/{name}/points",status="200"} 0.012047
rest_responses_max_duration_seconds{method="POST",endpoint="/collections/{name}/points/search/batch",status="200"} 0.000429
rest_responses_max_duration_seconds{method="POST",endpoint="/collections/{name}/points/search",status="200"} 0.000577
rest_responses_max_duration_seconds{method="PUT",endpoint="/collections/{name}/index",status="200"} 0.00032
rest_responses_max_duration_seconds{method="POST",endpoint="/collections/{name}/points",status="200"} 0.00017
# HELP grpc_responses_total total number of responses
# TYPE grpc_responses_total counter
grpc_responses_total{endpoint="/qdrant.Points/Recommend"} 8
grpc_responses_total{endpoint="/qdrant.Points/Upsert"} 8
grpc_responses_total{endpoint="/qdrant.Points/Search"} 32
# HELP grpc_responses_fail_total total number of failed responses
# TYPE grpc_responses_fail_total counter
grpc_responses_fail_total{endpoint="/qdrant.Points/Recommend"} 0
grpc_responses_fail_total{endpoint="/qdrant.Points/Upsert"} 0
grpc_responses_fail_total{endpoint="/qdrant.Points/Search"} 0
# HELP grpc_responses_avg_duration_seconds average response duration
# TYPE grpc_responses_avg_duration_seconds gauge
grpc_responses_avg_duration_seconds{endpoint="/qdrant.Points/Recommend"} 0.0001723572998046875
grpc_responses_avg_duration_seconds{endpoint="/qdrant.Points/Upsert"} 0.001592212890625
grpc_responses_avg_duration_seconds{endpoint="/qdrant.Points/Search"} 0.0005532779541015625
# HELP grpc_responses_min_duration_seconds minimum response duration
# TYPE grpc_responses_min_duration_seconds gauge
grpc_responses_min_duration_seconds{endpoint="/qdrant.Points/Recommend"} 0.000151
grpc_responses_min_duration_seconds{endpoint="/qdrant.Points/Upsert"} 0.001379
grpc_responses_min_duration_seconds{endpoint="/qdrant.Points/Search"} 0.000419
# HELP grpc_responses_max_duration_seconds maximum response duration
# TYPE grpc_responses_max_duration_seconds gauge
grpc_responses_max_duration_seconds{endpoint="/qdrant.Points/Recommend"} 0.000292
grpc_responses_max_duration_seconds{endpoint="/qdrant.Points/Upsert"} 0.001893
grpc_responses_max_duration_seconds{endpoint="/qdrant.Points/Search"} 0.000765

All Submissions:

Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?

Changes to Core Features:

Have you added an explanation of what your changes do and why you'd like us to include them?
Have you written new tests for your core changes, as applicable?
Have you successfully ran tests with your changes locally?

The whitelist contains a selection of search, recommend and upsert endpoints.

This test probes for some strings that must exist in the output

src/common/metrics.rs

ffuugoo

LGTM. Great job, @timvisee! 👍

timvisee · 2023-03-29T11:07:27Z

Added the sorting comment. All is green. Merging now.

…1616) * Fix incorrect metrics value for cluster commit * Rewrite metrics logic, don't use registry, write values directly * Only report REST timings for requests having HTTP 200 response * Limit metrics reporting of endpoints to whitelist The whitelist contains a selection of search, recommend and upsert endpoints. * Add MetricsParam, remove detail level, keep anonymize * Request metrics in basic API test * Specify content type for metrics endpoint * Add OpenAPI test for metrics endpoint, remove from basic API test This test probes for some strings that must exist in the output * Add note that metrics endpoint whitelist must be sorted

timvisee added 8 commits March 28, 2023 15:41

Fix incorrect metrics value for cluster commit

de83241

Rewrite metrics logic, don't use registry, write values directly

4bd2239

Only report REST timings for requests having HTTP 200 response

00386c8

Limit metrics reporting of endpoints to whitelist

6c40282

The whitelist contains a selection of search, recommend and upsert endpoints.

Add MetricsParam, remove detail level, keep anonymize

212b817

Request metrics in basic API test

ba9af4d

Specify content type for metrics endpoint

51a54c9

Add OpenAPI test for metrics endpoint, remove from basic API test

40d871d

This test probes for some strings that must exist in the output

timvisee changed the title ~~Draft: Improve metrics building performance, limit endpoints to whitelist~~ Improve metrics building performance, limit endpoints to whitelist Mar 28, 2023

timvisee requested review from ffuugoo and generall March 28, 2023 15:51

timvisee marked this pull request as ready for review March 28, 2023 15:51

ffuugoo reviewed Mar 29, 2023

View reviewed changes

src/common/metrics.rs Outdated Show resolved Hide resolved

ffuugoo approved these changes Mar 29, 2023

View reviewed changes

Add note that metrics endpoint whitelist must be sorted

0112a29

timvisee merged commit 91a0200 into dev Mar 29, 2023

timvisee mentioned this pull request Mar 29, 2023

Add aggregated vector count to prometheus /metrics endpoint #1598

Merged

8 tasks

generall mentioned this pull request Apr 19, 2023

upd wal commit #1749

Closed

8 tasks

agourlay deleted the improve-metrics branch July 12, 2023 15:47

frittentheke mentioned this pull request Jul 18, 2024

Some metrics (like cluster_enabled) are wrongly typed as COUNTER instead of GAUGE // _total postfix only to be used for counters #4696

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve metrics building performance, limit endpoints to whitelist #1616

Improve metrics building performance, limit endpoints to whitelist #1616

Improve metrics building performance, limit endpoints to whitelist #1616

Improve metrics building performance, limit endpoints to whitelist #1616

Conversation

All Submissions:

Changes to Core Features:

Choose a reason for hiding this comment