CAdvisor 설정

1 개요

cAdvisor Runtime Options
cAdvisor 런타임 옵션

https://github.com/google/cadvisor/blob/v0.50.0/docs/runtime_options.md

이 문서에서는 cAdvisor에서 사용할 수 있는 런타임 플래그 집합을 설명합니다.

2 컨테이너 레이블

--store_container_labels=false- 컨테이너 레이블과 환경 변수를 각 컨테이너의 Prometheus 메트릭 레이블로 변환하지 마세요.
--whitelisted_container_labels- 각 컨테이너의 Prometheus 메트릭에 대한 레이블로 변환할 컨테이너 레이블의 쉼표로 구분된 목록입니다. store_container_labels이를 적용하려면 false로 설정해야 합니다.

3 컨테이너 env

--env_metadata_whitelist: 컨테이너에 대해 수집해야 하는 환경 변수 키의 쉼표로 구분된 목록입니다. 지금은 containerd와 docker 런타임만 지원합니다.

4 모니터링되는 컨테이너 제한

--docker_only=false- 루트 cgroup을 제외한 원시 cgroup 메트릭을 보고하지 않습니다. --raw_cgroup_prefix_whitelist--docker_only- 지정된 경우에도 수집해야 하는 cgroup 경로 접두사의 쉼표로 구분된 목록 --disable_root_cgroup_stats=false- 루트 Cgroup 통계 수집을 비활성화합니다.

5 컨테이너 힌트

컨테이너 힌트는 컨테이너에 대한 추가 정보를 cAdvisor에 전달하는 방법입니다. 이런 방식으로 cAdvisor는 수집하는 통계를 증강할 수 있습니다. 컨테이너 힌트 형식에 대한 자세한 내용은 정의를 참조하세요 . 컨테이너 힌트는 오늘날 원시 컨테이너 드라이버에서만 사용된다는 점에 유의하세요.

--container_hints="/etc/cadvisor/container_hints.json": location of the container hints file

6 CPU

--enable_load_reader=false: Whether to enable cpu load reader --max_procs=0: max number of CPUs that can be used simultaneously. Less than 1 for default (number of cores).

7 디버깅 및 로깅

디버깅에 도움이 되는 cAdvisor 네이티브 플래그:

--log_backtrace_at="": when logging hits line file:N, emit a stack trace --log_cadvisor_usage=false: Whether to log the usage of the cAdvisor container --version=false: print cAdvisor version and exit --profiling=false: Enable profiling via web interface host:port/debug/pprof/ glog 에서 우리가 유용하다고 생각하는 몇 가지 플래그는 다음과 같습니다.

--log_dir="": If non-empty, write log files in this directory --logtostderr=false: log to standard error instead of files --alsologtostderr=false: log to standard error as well as files --stderrthreshold=0: logs at or above this threshold go to stderr --v=0: log level for V logs --vmodule=: comma-separated list of pattern=N settings for file-filtered logging

8 도커

--docker="unix:///var/run/docker.sock": docker endpoint (default "unix:///var/run/docker.sock") --docker_root="/var/lib/docker": DEPRECATED: docker root is read from docker info (this is a fallback, default: /var/lib/docker) (default "/var/lib/docker") --docker-tls: use TLS to connect to docker --docker-tls-cert="cert.pem": client certificate for TLS-connection with docker --docker-tls-key="key.pem": private key for TLS-connection with docker --docker-tls-ca="ca.pem": trusted CA for TLS-connection with docker

9 Podman

--podman="unix:///var/run/podman/podman.sock": podman endpoint (default "unix:///var/run/podman/podman.sock")

10 Housekeeping

하우스키핑은 cAdvisor가 수행하는 주기적 작업입니다. 이러한 작업 중에 cAdvisor는 컨테이너 통계를 수집합니다. 이러한 플래그는 cAdvisor가 하우스키핑을 수행하는 방법과 시기를 제어합니다.

10.1 다이내믹 하우스키핑

동적 하우스키핑 간격을 사용하면 cAdvisor가 통계를 수집하는 빈도를 변경할 수 있습니다. 이는 컨테이너가 얼마나 활성 상태인지에 따라 달라집니다. 이를 끄면 예측 가능한 하우스키핑 간격이 제공되지만 cAdvisor의 리소스 사용량이 증가합니다.

--allow_dynamic_housekeeping=true: Whether to allow the housekeeping interval to be dynamic

10.2 하우스키핑 간격

하우스키핑 간격. cAdvisor에는 글로벌과 컨테이너별이라는 두 가지 하우스키핑이 있습니다.

글로벌 하우스키핑은 cAdvisor에서 한 번만 수행되는 단일 하우스키핑입니다. 이는 일반적으로 새 컨테이너를 감지합니다. 오늘날 cAdvisor는 커널 이벤트가 있는 새 컨테이너를 발견하므로 이 글로벌 하우스키핑은 대부분 누락된 이벤트가 있는 경우 백업으로 사용됩니다.

컨테이너당 하우스키핑은 cAdvisor가 추적하는 각 컨테이너에서 한 번씩 실행됩니다. 이는 일반적으로 컨테이너 통계를 가져옵니다.

--global_housekeeping_interval=1m0s: Interval between global housekeepings --housekeeping_interval=1s: Interval between container housekeepings --max_housekeeping_interval=1m0s: Largest interval to allow between container housekeepings (default 1m0s)

11 HTTP

cAdvisor가 수신하는 곳을 지정하세요.

--http_auth_file="": HTTP auth file for the web UI --http_auth_realm="localhost": HTTP auth realm for the web UI (default "localhost") --http_digest_file="": HTTP digest file for the web UI --http_digest_realm="localhost": HTTP digest file for the web UI (default "localhost") --listen_ip="": IP to listen on, defaults to all IPs --port=8080: port to listen (default 8080) --url_base_prefix=/: optional path prefix aded to all resource URLs; useful when running cAdvisor behind a proxy. (default /)

12 로컬 저장 기간

cAdvisor는 최신 이력 데이터를 메모리에 저장합니다. 얼마나 오랫동안 이력을 저장하는지는 --storage_duration플래그로 구성할 수 있습니다.

--storage_duration=2m0s: How long to store data.

13 머신

--boot_id_file="/proc/sys/kernel/random/boot_id": Comma-separated list of files to check for boot-id. Use the first one that exists. (default "/proc/sys/kernel/random/boot_id") --machine_id_file="/etc/machine-id,/var/lib/dbus/machine-id": Comma-separated list of files to check for machine-id. Use the first one that exists. (default "/etc/machine-id,/var/lib/dbus/machine-id") --update_machine_info_interval=5m: Interval between machine info updates. (default 5m)

14 메트릭

--application_metrics_count_limit=100: Max number of application metrics to store (per container) (default 100) --collector_cert="": Collector's certificate, exposed to endpoints for certificate based authentication. --collector_key="": Key for the collector's certificate --disable_metrics=<metrics>: comma-separated list of metrics to be disabled. Options are advtcp,app,cpu,cpuLoad,cpu_topology,cpuset,disk,diskIO,hugetlb,memory,memory_numa,network,oom_event,percpu,perf_event,process,referenced_memory,resctrl,sched,tcp,udp. (default advtcp,cpu_topology,cpuset,hugetlb,memory_numa,process,referenced_memory,resctrl,sched,tcp,udp) --enable_metrics=<metrics>: comma-separated list of metrics to be enabled. If set, overrides 'disable_metrics'. Options are advtcp,app,cpu,cpuLoad,cpu_topology,cpuset,disk,diskIO,hugetlb,memory,memory_numa,network,oom_event,percpu,perf_event,process,referenced_memory,resctrl,sched,tcp,udp. --prometheus_endpoint="/metrics": Endpoint to expose Prometheus metrics on (default "/metrics") --disable_root_cgroup_stats=false: Disable collecting root Cgroup stats

15 스토리지 드라이버

--storage_driver="": Storage driver to use. Data is always cached shortly in memory, this controls where data is pushed besides the local cache. Empty means none. Options are: <empty>, bigquery, elasticsearch, influxdb, kafka, redis, statsd, stdout --storage_driver_buffer_duration="1m0s": Writes in the storage driver will be buffered for this duration, and committed to the non memory backends as a single transaction (default 1m0s) --storage_driver_db="cadvisor": database name (default "cadvisor") --storage_driver_host="localhost:8086": database host:port (default "localhost:8086") --storage_driver_password="root": database password (default "root") --storage_driver_secure=false: use secure connection with database --storage_driver_table="stats": table name (default "stats") --storage_driver_user="root": database username (default "root")

16 퍼프 이벤트

--perf_events_config="" Path to a JSON file containing configuration of perf events to measure. Empty value disables perf events measuring. 코어 perf 이벤트는 CPU당 Prometheus 엔드포인트에 노출되거나 이벤트별로 집계될 수 있습니다. 이는 옵션과 함께 매개변수를 --disable_metrics통해 제어됩니다 . 예:--enable_metricspercpu

--disable_metrics="percpu"- 핵심 성능 이벤트가 집계됩니다. --disable_metrics=""- 코어 성능 이벤트는 CPU별로 노출됩니다. CPU당 많은 perf 이벤트가 노출되면 "열린 파일이 너무 많습니다" 오류가 발생할 수 있습니다. 이는 시스템 제한을 통과하기 때문에 발생합니다. .을 사용하여 파일 설명자의 최대 수를 늘려보세요 ulimit -n <value>.

코어 성능 이벤트의 집계 형태는 데이터 볼륨을 크게 감소시킵니다. 코어 성능 이벤트의 집계 형태 스케일링 비율( container_perf_metric_scaling ratio)은 특정 이벤트에 대한 스케일링 비율의 가장 낮은 값을 나타내어 최악의 정밀도를 보여줍니다.

16.1 Perf 서브시스템 소개

커널 perf 서브시스템의 목표 중 하나는 애플리케이션 프로파일링을 허용하는 CPU 성능 카운터를 계측하는 것입니다. 프로파일링은 하드웨어 이벤트(예: 폐기된 명령어 수, 캐시 미스 수)를 계산하는 성능 카운터를 설정하여 수행됩니다. 카운터는 CPU 하드웨어 레지스터이며 그 수는 제한되어 있습니다.

perf 하위 시스템의 다른 목표(추적 등)는 이 설명서의 범위를 벗어나며, 이에 대해 자세히 알아보려면 아래의 추가 자료 섹션을 참조하세요.

다음의 perf-event 관련 용어에 익숙해지세요:

multiplexing- 2세대 Intel® Xeon® 확장 가능 프로세서는 하이퍼 스레드당 4개의 카운터를 제공합니다. 구성된 이벤트 수가 사용 가능한 카운터 수보다 큰 경우 Linux는 계산을 다중화하고 일부(또는 모든) 이벤트가 항상 계산되지 않습니다. 이러한 상황에서 이벤트가 계산된 시간 양과 이벤트가 활성화된 시간에 대한 정보가 제공됩니다. cAdvisor가 노출하는 카운터 값은 자동으로 조정됩니다. grouping- 시나리오에서 회계 처리된 이벤트가 파생 메트릭을 계산하는 데 사용되는 경우, 거래 방식으로 측정하는 것이 합리적입니다. 즉, 그룹의 모든 이벤트는 동일한 기간 내에 회계 처리되어야 합니다. 사용 가능한 카운터가 있는 것보다 더 많은 이벤트를 그룹화하는 것은 불가능하다는 점을 명심하세요. uncore events- 핵심 외부의 PMU에서 계산할 수 있는 이벤트. PMU- 성과 모니터링 부서

16.1.1 구성 값 가져오기

perf 도구 사용:

perf list출력 에서 이벤트를 식별합니다 . 명령 실행:perf stat -I 5000 -vvv -e EVENT_NAME perf_event_attr출력 섹션을 찾아서 perf stat구성 및 유형 필드를 구성 파일에 복사합니다.

perf_event_attr:

 type                             18
 size                             112
 config                           0x304
 sample_type                      IDENTIFIER
 read_format                      TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
 disabled                         1
 inherit                          1
 exclude_guest                    1

구성 파일은 다음과 같습니다. {

 "core": {
   "events": [
     "event_name"
   ],
   "custom_events": [
     {
       "type": 18,
       "config": [
         "0x304"
       ],
       "name": "event_name"
     }
   ]
 },
 "uncore": {
   "events": [
     "event_name"
   ],
   "custom_events": [
     {
       "type": 18,
       "config": [
         "0x304"
       ],
       "name": "event_name"
     }
   ]
 }

} 구성 값은 다음에서도 얻을 수 있습니다.

Intel® 64 및 IA32 아키텍처 성능 모니터링 이벤트 Uncore 이벤트 구성 Uncore 이벤트 이름은 PMU_PREFIXPMU_PREFIX/event_name 형식 이어야 하며 , 이는 이름에 해당 접두사가 붙은 모든 PMU에서 통계가 계산됨을 의미합니다.

예를 들어 설명해 보겠습니다.

{

 "uncore": {
   "events": [
     "uncore_imc/cas_count_read",
     "uncore_imc_0/cas_count_write",
     "cas_count_all"
   ],
   "custom_events": [
     {
       "config": [
         "0x304"
       ],
       "name": "uncore_imc_0/cas_count_write"
     },
     {
       "type": 19,
       "config": [
         "0x304"
       ],
       "name": "cas_count_all"
     }
   ]
 }

} uncore_imc/cas_count_read- 유형과 사용자 정의 이벤트에 항목이 없기 때문에 libpfm 패키지에서 제공된 구성을 사용하는 모든uncore_imc 통합 메모리 컨트롤러 PMU 에서 계산됩니다 . (다음 함수 사용: https://man7.org/linux/man-pages/man3/pfm_get_os_event_encoding.3.html )

uncore_imc_0/cas_count_write- 사용자 정의 이벤트의 유형과 항목으로 인해 제공된 구성을 사용하여 PMU uncore_imc_0에서 계산됩니다 uncore_imc_0.

uncore_imc_1/cas_count_all- 사용자 정의 이벤트에 유형 필드가 포함되어 있기 때문에 PMU에서 19개 유형과 제공된 구성을 사용하여 이벤트를 계산합니다.

16.1.2 이름으로 perf 이벤트 구성

libpfm4 에서 지원하는 이벤트를 사용하여 이름으로 perf 이벤트를 구성하는 것이 가능합니다 . 자세한 내용은 libpfm4 설명서를 참조하세요 .

플랫폼에서 지원되는 perf 이벤트는 libpfm4와 함께 제공되는 python 스크립트인 pmu.py 를 사용하여 검색할 수 있습니다 . 스크립트 요구 사항을 참조하세요 .

libpfm4에서 지원하는 이벤트 이름을 사용한 perf 이벤트 구성 예 출력 예 pmu.py:

$ python pmu.py INSTRUCTIONS 1 u 0 k 1 period 3 freq 4 precise 5 excl 6 mg 7 mh 8 cpu 9 pinned 10 INSTRUCTION_RETIRED 192 e 2 i 3 c 4 t 5 intx 7 intxcp 8 u 0 k 1 period 3 freq 4 excl 6 mg 7 mh 8 cpu 9 pinned 10 UNC_M_CAS_COUNT 4 RD 3 WR 12 e 0 i 1 t 2 period 3 freq 4 excl 6 cpu 9 pinned 10 나열된 이벤트에 대한 perf 이벤트 구성:

{

 "core": {
   "events": [
     "instructions",
     "instruction_retired"
   ]
 },
 "uncore": {
   "events": [
     "uncore_imc/unc_m_cas_count:rd",
     "uncore_imc/unc_m_cas_count:wr"
   ]
 }

} 참고: PMU_PREFIX는 구성 값을 사용한 구성과 동일한 방식으로 제공됩니다.

16.1.3 그룹화

{

 "core": {
   "events": [
     ["instructions", "instruction_retired"]
   ]
 },
 "uncore": {
   "events": [
     ["uncore_imc_0/unc_m_cas_count:rd", "uncore_imc_0/unc_m_cas_count:wr"],
     ["uncore_imc_1/unc_m_cas_count:rd", "uncore_imc_1/unc_m_cas_count:wr"]
   ]
 }

}

16.2 추가 읽기

Brendan Gregg의 블로그에 있는 perf 예제
커널 퍼프 위키
man perf_event_open
Linux 커널의 perf 서브시스템
Uncore 성능 모니터링 참조 매뉴얼

아래의 설정 예시를 참조하세요.

json

Copy

{
  "core": {
    "events": [
      "instructions",
      "instructions_retired"
    ],
    "custom_events": [
      {
        "type": 4,
        "config": [
          "0x5300c0"
        ],
        "name": "instructions_retired"
      }
    ]
  },
  "uncore": {
    "events": [
      "uncore_imc/cas_count_read"
    ],
    "custom_events": [
      {
        "config": [
          "0xc04"
        ],
        "name": "uncore_imc/cas_count_read"
      }
    ]
  }
}

위의 예에서:

instructions그룹화되지 않은 이벤트로 측정되며 호출하여 얻을 수 있는 인간 친화적 인터페이스를 사용하여 지정됩니다 perf list. 명령 출력에 나타나는 모든 이름을 사용할 수 있습니다 perf list. 이는 대부분의 사용자가 의존할 인터페이스입니다.
instructions_retired그룹화되지 않은 이벤트로 측정되며 사용 가능한 모든 perf 이벤트를 지정할 수 있는 고급 API를 사용하여 지정됩니다(일부는 이름이 지정되지 않고 일반 문자열로 지정할 수 없음). 이벤트 이름은 메트릭 이름이 되는 사람이 읽을 수 있는 문자열이어야 합니다.
cas_count_readtype설정되지 않은 필드와 접두사 로 인해 모든 통합 메모리 컨트롤러 성능 모니터링 장치에서 그룹화되지 않은 코어 이벤트로 측정됩니다 uncore_imc.

17 Resctrl

측정 항목을 얻기 위해, cAdvisor는 cadvisor 접두어가 있는 자체 모니터링 그룹을 만듭니다.

Resctrl 파일 시스템은 cgroups처럼 계층적이지 않으므로 사용자는 경쟁 조건과 예상치 못한 동작을 피하기 위해 --docker_only 플래그를 설정해야 합니다 .

Console

Copy

--resctrl_interval=0: Resctrl mon groups updating interval. Zero value disables updating mon groups.

18 스토리지 드라이버 특정 지침:

InfluxDB 지침 . ElasticSearch 지침 . 카프카 지침 . 프로메테우스 지침 .