簡介
SkyWalking是一個開源的可觀測性平臺,用於收集、分析、聚合和可視化來自服務和雲原生基礎設施的數據。SkyWalking提供了一種簡單的方式來保持對分佈式系統的清晰視圖,即使跨雲平臺也能做到。它是一個現代化的應用性能管理(APM)工具,專為雲原生、基於容器的分佈式系統而設計。
SkyWalking為服務、服務實例、端點和進程提供了可觀測性的能力。術語“Service”(服務)、“Instance”(服務實例)和“Endpoint”(端點)今天在各處都被廣泛使用,因此值得在SkyWalking的上下文中明確定義它們的具體含義:
- Service(服務):表示提供相同行為以處理傳入請求的一組工作負載。在使用儀器代理或SDK時,您可以定義服務的名稱。SkyWalking還可以使用您在諸如Istio之類的平臺中定義的名稱。
- Service Instance(服務實例):服務組中的每個個體工作負載都被稱為實例。就像在Kubernetes中的pod一樣,它不必是單個操作系統進程,但是如果您使用儀器代理,實際上一個實例就是一個真正的操作系統進程。
- Endpoint(端點):服務中用於傳入請求的路徑,例如HTTP URI路徑或gRPC服務類 方法簽名。
- Process(進程):操作系統進程。在某些情況下,服務實例不是一個進程,例如,Kubernetes的pod可能包含多個進程。
架構
SkyWalking在邏輯上分為四個部分:Probes(探針)、Platform backend(平臺後端)、Storage(存儲)和UI(用戶界面)。
![](https://news.xinpengboligang.com/upload/keji/0dc5e3c43da764c1fd94d295a3964853.jpeg)
- Probes(探針)負責收集各種格式(SkyWalking、Zipkin、OpenTelemetry、Prometheus、Zabbix等)的遙測數據,包括指標、跟蹤、日志和事件。
- Platform backend(平臺後端)支持對跟蹤、指標、日志和事件進行數據聚合、分析和流式處理。可作為聚合角色、接收角色或兩者兼而有之。
- Storage(存儲)通過一個開放/可插拔的接口存儲SkyWalking數據。您可以選擇現有的實現,如ElasticSearch、H2、MySQL、TiDB、BanyanDB,或者實現您自己的存儲。
- UI(用戶界面)是一個高度可定制的基於Web的界面,允許SkyWalking最終用戶可視化和管理SkyWalking數據。
使用說明
略
配置說明
存儲配置
storage:
selector: ${SW_STORAGE:elasticsearch}
elasticsearch:
namespace: ${SW_NAMESPACE:""}
clusterNodes: ${SW_STORAGE_ES_CLUSTER_NODES:localhost:9200}
protocol: ${SW_STORAGE_ES_HTTP_PROTOCOL:"http"}
trustStorePath: ${SW_STORAGE_ES_SSL_JKS_PATH:""}
trustStorePass: ${SW_STORAGE_ES_SSL_JKS_PASS:""}
user: ${SW_ES_USER:""}
password: ${SW_ES_PASSWORD:""}
secretsManagementFile: ${SW_ES_SECRETS_MANAGEMENT_FILE:""} # Secrets management file in the properties format includes the username, password, which are managed by 3rd party tool.
dayStep: ${SW_STORAGE_DAY_STEP:1} # Represent the number of days in the one minute/hour/day index.
indexShardsNumber: ${SW_STORAGE_ES_INDEX_SHARDS_NUMBER:1} # Shard number of new indexes
indexReplicasNumber: ${SW_STORAGE_ES_INDEX_REPLICAS_NUMBER:1} # Replicas number of new indexes
# Specify the settings for each index individually.
# If configured, this setting has the highest priority and overrides the generic settings.
specificIndexSettings: ${SW_STORAGE_ES_SPECIFIC_INDEX_SETTINGS:""}
# Super data set has been defined in the codes, such as trace segments.The following 3 config would be improve es performance when storage super size data in es.
superDatasetDayStep: ${SW_STORAGE_ES_SUPER_DATASET_DAY_STEP:-1} # Represent the number of days in the super size dataset record index, the default value is the same as dayStep when the value is less than 0
superDatasetIndexShardsFactor: ${SW_STORAGE_ES_SUPER_DATASET_INDEX_SHARDS_FACTOR:5} # This factor provides more shards for the super data set, shards number = indexShardsNumber * superDatasetIndexShardsFactor. Also, this factor effects Zipkin traces.
superDatasetIndexReplicasNumber: ${SW_STORAGE_ES_SUPER_DATASET_INDEX_REPLICAS_NUMBER:0} # Represent the replicas number in the super size dataset record index, the default value is 0.
indexTemplateOrder: ${SW_STORAGE_ES_INDEX_TEMPLATE_ORDER:0} # the order of index template
bulkActions: ${SW_STORAGE_ES_BULK_ACTIONS:1000} # Execute the async bulk record data every ${SW_STORAGE_ES_BULK_ACTIONS} requests
flushInterval: ${SW_STORAGE_ES_FLUSH_INTERVAL:10} # flush the bulk every 10 seconds whatever the number of requests
concurrentRequests: ${SW_STORAGE_ES_CONCURRENT_REQUESTS:2} # the number of concurrent requests
resultWindowMaxSize: ${SW_STORAGE_ES_QUERY_MAX_WINDOW_SIZE:10000}
metadataQueryMaxSize: ${SW_STORAGE_ES_QUERY_MAX_SIZE:5000}
segmentQueryMaxSize: ${SW_STORAGE_ES_QUERY_SEGMENT_SIZE:200}
profileTaskQueryMaxSize: ${SW_STORAGE_ES_QUERY_PROFILE_TASK_SIZE:200}
profileDataQueryScrollBatchSize: ${SW_STORAGE_ES_QUERY_PROFILE_DATA_SCROLLING_BATCH_SIZE:100}
oapAnalyzer: ${SW_STORAGE_ES_OAP_ANALYZER:"{\"analyzer\":{\"oap_analyzer\":{\"type\":\"stop\"}}}"} # the oap analyzer.
oapLogAnalyzer: ${SW_STORAGE_ES_OAP_LOG_ANALYZER:"{\"analyzer\":{\"oap_log_analyzer\":{\"type\":\"standard\"}}}"} # the oap log analyzer. It could be customized by the ES analyzer configuration to support more language log formats, such as Chinese log, Japanese log and etc.
advanced: ${SW_STORAGE_ES_ADVANCED:""}
# Set it to `true` could shard metrics indices into multi-physical indices
# as same as the versions(one index template per metric/meter aggregation function) before 9.2.0.
logicSharding: ${SW_STORAGE_ES_LOGIC_SHARDING:false}
TTL
# Set a timeout on metrics data. After the timeout has expired, the metrics data will automatically be deleted.
recordDataTTL: ${SW_CORE_RECORD_DATA_TTL:3} # Unit is day
metricsDataTTL: ${SW_CORE_METRICS_DATA_TTL:7} # Unit is day
告警配置
可觀測性分析語言(OAL)
OAL(可觀測性分析語言)用於以流模式分析傳入的數據.OAL 重點關註服務、服務實例和端點中的指標。因此,該語言易於學習和使用
OAL腳本現在位於/config文件夾
腳本應命名為 *.oal
// Declare the metrics.
METRICS_NAME = from(CAST SCOPE.(* | [FIELD][,FIELD ...]))
[.filter(CAST FIELD OP [INT | STRING])]
.FUNCTION([PARAM][, PARAM ...])
// Disable hard code
disable(METRICS_NAME);
scope和field介紹
https://skywalking.apache.org/docs/main/v9.3.0/en/concepts-and-designs/scope-definitions/
Scope可選擇值
Service, ServiceInstance, Endpoint, ServiceRelation, ServiceInstanceRelation, and EndpointRelation
Filter
使用過濾器通過字段名和表達式構建字段值的條件。
這些表達式支持通過and、or和(...)進行鏈接。操作符支持==、!=、>、<、>=、<=、in [...],like %...,like ...%,like %...%,contain和not contain,並且基於字段類型進行類型檢測。在不兼容的情況下,可能會觸發編譯或代碼生成錯誤。
聚合函數
longAvg.
doubleAvg.
percent.
rate.
count.
histogram.
apdex.
p99, p95, p90, p75, p50.
Demo
// Calculate p99 of both Endpoint1 and Endpoint2
endpoint_p99 = from(Endpoint.latency).filter(name in ("Endpoint1", "Endpoint2")).summary(0.99)
// Calculate p99 of Endpoint name started with `serv`
serv_Endpoint_p99 = from(Endpoint.latency).filter(name like "serv%").summary(0.99)
// Calculate the avg response time of each Endpoint
endpoint_resp_time = from(Endpoint.latency).avg()
// Calculate the p50, p75, p90, p95 and p99 of each Endpoint by 50 ms steps.
endpoint_percentile = from(Endpoint.latency).percentile(10)
// Calculate the percent of response status is true, for each service.
endpoint_success = from(Endpoint.*).filter(status == true).percent()
// Calculate the sum of response code in [404, 500, 503], for each service.
endpoint_abnormal = from(Endpoint.*).filter(httpResponseStatusCode in [404, 500, 503]).count()
// Calculate the sum of request type in [RequestType.RPC, RequestType.gRPC], for each service.
endpoint_rpc_calls_sum = from(Endpoint.*).filter(type in [RequestType.RPC, RequestType.gRPC]).count()
// Calculate the sum of endpoint name in ["/v1", "/v2"], for each service.
endpoint_url_sum = from(Endpoint.*).filter(name in ["/v1", "/v2"]).count()
// Calculate the sum of calls for each service.
endpoint_calls = from(Endpoint.*).count()
// Calculate the CPM with the GET method for each service.The value is made up with `tagKey:tagValue`.
// Option 1, use `tags contain`.
service_cpm_http_get = from(Service.*).filter(tags contain "http.method:GET").cpm()
// Option 2, use `tag[key]`.
service_cpm_http_get = from(Service.*).filter(tag["http.method"] == "GET").cpm();
// Calculate the CPM with the HTTP method except for the GET method for each service.The value is made up with `tagKey:tagValue`.
service_cpm_http_other = from(Service.*).filter(tags not contain "http.method:GET").cpm()
disable(segment);
disable(endpoint_relation_server_side);
disable(top_n_database_statement);
告警
獨立規則
- 規則名
- 指標名
- 包含名稱
- 排除名稱
- 包含名稱正則
- 排除名稱正則
- 包含標簽
- 排除標簽
- 包含標簽正則
- 排除標簽正則
- 標簽
- 閾值
- 操作符
- 周期
- 計數
- 是否僅作為條件
- 靜默期
組合規則
- 規則名
- 表達式
- 提示信息
- 標簽
Demo
rules:
# Rule unique name, must be ended with `_rule`.
endpoint_percent_rule:
# Metrics value need to be long, double or int
metrics-name: endpoint_percent
threshold: 75
op: <
# The length of time to evaluate the metrics
period: 10
# How many times after the metrics match the condition, will trigger alarm
count: 3
# How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
silence-period: 10
# Specify if the rule can send notification or just as an condition of composite rule
only-as-condition: false
tags:
level: WARNING
service_percent_rule:
metrics-name: service_percent
# [Optional] Default, match all services in this metrics
include-names:
- service_a
- service_b
exclude-names:
- service_c
# Single value metrics threshold.
threshold: 85
op: <
period: 10
count: 4
only-as-condition: false
service_resp_time_percentile_rule:
# Metrics value need to be long, double or int
metrics-name: service_percentile
op: ">"
# Multiple value metrics threshold. Thresholds for P50, P75, P90, P95, P99.
threshold: 1000,1000,1000,1000,1000
period: 10
count: 3
silence-period: 5
message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
only-as-condition: false
meter_service_status_code_rule:
metrics-name: meter_status_code
exclude-labels:
- "200"
op: ">"
threshold: 10
period: 10
count: 3
silence-period: 5
message: The request number of entity {name} non-200 status is more than expected.
only-as-condition: false
composite-rules:
comp_rule:
# Must satisfied percent rule and resp time rule
expression: service_percent_rule && service_resp_time_percentile_rule
message: Service {name} successful rate is less than 80% and P50 of response time is over 1000ms
tags:
level: CRITICAL
通知到微信
wechatHooks:
textTemplate: |-
{
"msgtype": "text",
"text": {
"content": "Apache SkyWalking Alarm: \n %s."
}
}
webhooks:
- https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=dummy_key