簡介

SkyWalking是一個開源的可觀測性平臺，用於收集、分析、聚合和可視化來自服務和雲原生基礎設施的數據。SkyWalking提供了一種簡單的方式來保持對分佈式系統的清晰視圖，即使跨雲平臺也能做到。它是一個現代化的應用性能管理（APM）工具，專為雲原生、基於容器的分佈式系統而設計。

SkyWalking為服務、服務實例、端點和進程提供了可觀測性的能力。術語“Service”（服務）、“Instance”（服務實例）和“Endpoint”（端點）今天在各處都被廣泛使用，因此值得在SkyWalking的上下文中明確定義它們的具體含義：

Service（服務）：表示提供相同行為以處理傳入請求的一組工作負載。在使用儀器代理或SDK時，您可以定義服務的名稱。SkyWalking還可以使用您在諸如Istio之類的平臺中定義的名稱。
Service Instance（服務實例）：服務組中的每個個體工作負載都被稱為實例。就像在Kubernetes中的pod一樣，它不必是單個操作系統進程，但是如果您使用儀器代理，實際上一個實例就是一個真正的操作系統進程。
Endpoint（端點）：服務中用於傳入請求的路徑，例如HTTP URI路徑或gRPC服務類方法簽名。
Process（進程）：操作系統進程。在某些情況下，服務實例不是一個進程，例如，Kubernetes的pod可能包含多個進程。

架構

SkyWalking在邏輯上分為四個部分：Probes（探針）、Platform backend（平臺後端）、Storage（存儲）和UI（用戶界面）。

Probes（探針）負責收集各種格式（SkyWalking、Zipkin、OpenTelemetry、Prometheus、Zabbix等）的遙測數據，包括指標、跟蹤、日志和事件。
Platform backend（平臺後端）支持對跟蹤、指標、日志和事件進行數據聚合、分析和流式處理。可作為聚合角色、接收角色或兩者兼而有之。
Storage（存儲）通過一個開放/可插拔的接口存儲SkyWalking數據。您可以選擇現有的實現，如ElasticSearch、H2、MySQL、TiDB、BanyanDB，或者實現您自己的存儲。
UI（用戶界面）是一個高度可定制的基於Web的界面，允許SkyWalking最終用戶可視化和管理SkyWalking數據。

使用說明

略

配置說明

存儲配置

storage:
  selector: ${SW_STORAGE:elasticsearch}
  elasticsearch:
    namespace: ${SW_NAMESPACE:""}
    clusterNodes: ${SW_STORAGE_ES_CLUSTER_NODES:localhost:9200}
    protocol: ${SW_STORAGE_ES_HTTP_PROTOCOL:"http"}
    trustStorePath: ${SW_STORAGE_ES_SSL_JKS_PATH:""}
    trustStorePass: ${SW_STORAGE_ES_SSL_JKS_PASS:""}
    user: ${SW_ES_USER:""}
    password: ${SW_ES_PASSWORD:""}
    secretsManagementFile: ${SW_ES_SECRETS_MANAGEMENT_FILE:""} # Secrets management file in the properties format includes the username, password, which are managed by 3rd party tool.
    dayStep: ${SW_STORAGE_DAY_STEP:1} # Represent the number of days in the one minute/hour/day index.
    indexShardsNumber: ${SW_STORAGE_ES_INDEX_SHARDS_NUMBER:1} # Shard number of new indexes
    indexReplicasNumber: ${SW_STORAGE_ES_INDEX_REPLICAS_NUMBER:1} # Replicas number of new indexes
    # Specify the settings for each index individually.
    # If configured, this setting has the highest priority and overrides the generic settings.
    specificIndexSettings: ${SW_STORAGE_ES_SPECIFIC_INDEX_SETTINGS:""}
    # Super data set has been defined in the codes, such as trace segments.The following 3 config would be improve es performance when storage super size data in es.
    superDatasetDayStep: ${SW_STORAGE_ES_SUPER_DATASET_DAY_STEP:-1} # Represent the number of days in the super size dataset record index, the default value is the same as dayStep when the value is less than 0
    superDatasetIndexShardsFactor: ${SW_STORAGE_ES_SUPER_DATASET_INDEX_SHARDS_FACTOR:5} #  This factor provides more shards for the super data set, shards number = indexShardsNumber * superDatasetIndexShardsFactor. Also, this factor effects Zipkin traces.
    superDatasetIndexReplicasNumber: ${SW_STORAGE_ES_SUPER_DATASET_INDEX_REPLICAS_NUMBER:0} # Represent the replicas number in the super size dataset record index, the default value is 0.
    indexTemplateOrder: ${SW_STORAGE_ES_INDEX_TEMPLATE_ORDER:0} # the order of index template
    bulkActions: ${SW_STORAGE_ES_BULK_ACTIONS:1000} # Execute the async bulk record data every ${SW_STORAGE_ES_BULK_ACTIONS} requests
    flushInterval: ${SW_STORAGE_ES_FLUSH_INTERVAL:10} # flush the bulk every 10 seconds whatever the number of requests
    concurrentRequests: ${SW_STORAGE_ES_CONCURRENT_REQUESTS:2} # the number of concurrent requests
    resultWindowMaxSize: ${SW_STORAGE_ES_QUERY_MAX_WINDOW_SIZE:10000}
    metadataQueryMaxSize: ${SW_STORAGE_ES_QUERY_MAX_SIZE:5000}
    segmentQueryMaxSize: ${SW_STORAGE_ES_QUERY_SEGMENT_SIZE:200}
    profileTaskQueryMaxSize: ${SW_STORAGE_ES_QUERY_PROFILE_TASK_SIZE:200}
    profileDataQueryScrollBatchSize: ${SW_STORAGE_ES_QUERY_PROFILE_DATA_SCROLLING_BATCH_SIZE:100}
    oapAnalyzer: ${SW_STORAGE_ES_OAP_ANALYZER:"{\"analyzer\":{\"oap_analyzer\":{\"type\":\"stop\"}}}"} # the oap analyzer.
    oapLogAnalyzer: ${SW_STORAGE_ES_OAP_LOG_ANALYZER:"{\"analyzer\":{\"oap_log_analyzer\":{\"type\":\"standard\"}}}"} # the oap log analyzer. It could be customized by the ES analyzer configuration to support more language log formats, such as Chinese log, Japanese log and etc.
    advanced: ${SW_STORAGE_ES_ADVANCED:""}
    # Set it to `true` could shard metrics indices into multi-physical indices
    # as same as the versions(one index template per metric/meter aggregation function) before 9.2.0.
    logicSharding: ${SW_STORAGE_ES_LOGIC_SHARDING:false}

TTL

# Set a timeout on metrics data. After the timeout has expired, the metrics data will automatically be deleted.
    recordDataTTL: ${SW_CORE_RECORD_DATA_TTL:3} # Unit is day
    metricsDataTTL: ${SW_CORE_METRICS_DATA_TTL:7} # Unit is day

告警配置

可觀測性分析語言(OAL)

OAL（可觀測性分析語言）用於以流模式分析傳入的數據.OAL 重點關註服務、服務實例和端點中的指標。因此，該語言易於學習和使用

OAL腳本現在位於/config文件夾

腳本應命名為 *.oal

// Declare the metrics.
METRICS_NAME = from(CAST SCOPE.(* | [FIELD][,FIELD ...]))
[.filter(CAST FIELD OP [INT | STRING])]
.FUNCTION([PARAM][, PARAM ...])
// Disable hard code 
disable(METRICS_NAME);

scope和field介紹
https://skywalking.apache.org/docs/main/v9.3.0/en/concepts-and-designs/scope-definitions/

Scope可選擇值

Service, ServiceInstance, Endpoint, ServiceRelation, ServiceInstanceRelation, and EndpointRelation

Filter

使用過濾器通過字段名和表達式構建字段值的條件。

這些表達式支持通過and、or和(...)進行鏈接。操作符支持==、!=、>、<、>=、<=、in [...]，like %...，like ...%，like %...%，contain和not contain，並且基於字段類型進行類型檢測。在不兼容的情況下，可能會觸發編譯或代碼生成錯誤。

聚合函數

longAvg.

doubleAvg.

percent.

rate.

count.

histogram.

apdex.

p99, p95, p90, p75, p50.

Demo

// Calculate p99 of both Endpoint1 and Endpoint2
endpoint_p99 = from(Endpoint.latency).filter(name in ("Endpoint1", "Endpoint2")).summary(0.99)
// Calculate p99 of Endpoint name started with `serv`
serv_Endpoint_p99 = from(Endpoint.latency).filter(name like "serv%").summary(0.99)
// Calculate the avg response time of each Endpoint
endpoint_resp_time = from(Endpoint.latency).avg()
// Calculate the p50, p75, p90, p95 and p99 of each Endpoint by 50 ms steps.
endpoint_percentile = from(Endpoint.latency).percentile(10)
// Calculate the percent of response status is true, for each service.
endpoint_success = from(Endpoint.*).filter(status == true).percent()
// Calculate the sum of response code in [404, 500, 503], for each service.
endpoint_abnormal = from(Endpoint.*).filter(httpResponseStatusCode in [404, 500, 503]).count()
// Calculate the sum of request type in [RequestType.RPC, RequestType.gRPC], for each service.
endpoint_rpc_calls_sum = from(Endpoint.*).filter(type in [RequestType.RPC, RequestType.gRPC]).count()
// Calculate the sum of endpoint name in ["/v1", "/v2"], for each service.
endpoint_url_sum = from(Endpoint.*).filter(name in ["/v1", "/v2"]).count()
// Calculate the sum of calls for each service.
endpoint_calls = from(Endpoint.*).count()
// Calculate the CPM with the GET method for each service.The value is made up with `tagKey:tagValue`.
// Option 1, use `tags contain`.
service_cpm_http_get = from(Service.*).filter(tags contain "http.method:GET").cpm()
// Option 2, use `tag[key]`.
service_cpm_http_get = from(Service.*).filter(tag["http.method"] == "GET").cpm();
// Calculate the CPM with the HTTP method except for the GET method for each service.The value is made up with `tagKey:tagValue`.
service_cpm_http_other = from(Service.*).filter(tags not contain "http.method:GET").cpm()
disable(segment);
disable(endpoint_relation_server_side);
disable(top_n_database_statement);

告警

獨立規則

規則名
指標名
包含名稱
排除名稱
包含名稱正則
排除名稱正則
包含標簽
排除標簽
包含標簽正則
排除標簽正則
標簽
閾值
操作符
周期
計數
是否僅作為條件
靜默期

組合規則

規則名
表達式
提示信息
標簽

Demo

rules:
  # Rule unique name, must be ended with `_rule`.
  endpoint_percent_rule:
    # Metrics value need to be long, double or int
    metrics-name: endpoint_percent
    threshold: 75
    op: <
    # The length of time to evaluate the metrics
    period: 10
    # How many times after the metrics match the condition, will trigger alarm
    count: 3
    # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
    silence-period: 10
    # Specify if the rule can send notification or just as an condition of composite rule
    only-as-condition: false
    tags:
      level: WARNING
  service_percent_rule:
    metrics-name: service_percent
    # [Optional] Default, match all services in this metrics
    include-names:
      - service_a
      - service_b
    exclude-names:
      - service_c
    # Single value metrics threshold.
    threshold: 85
    op: <
    period: 10
    count: 4
    only-as-condition: false
  service_resp_time_percentile_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_percentile
    op: ">"
    # Multiple value metrics threshold. Thresholds for P50, P75, P90, P95, P99.
    threshold: 1000,1000,1000,1000,1000
    period: 10
    count: 3
    silence-period: 5
    message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
    only-as-condition: false
  meter_service_status_code_rule:
    metrics-name: meter_status_code
    exclude-labels:
      - "200"
    op: ">"
    threshold: 10
    period: 10
    count: 3
    silence-period: 5
    message: The request number of entity {name} non-200 status is more than expected.
    only-as-condition: false
composite-rules:
  comp_rule:
    # Must satisfied percent rule and resp time rule 
    expression: service_percent_rule && service_resp_time_percentile_rule
    message: Service {name} successful rate is less than 80% and P50 of response time is over 1000ms
    tags:
      level: CRITICAL

通知到微信

wechatHooks:
  textTemplate: |-
    {
      "msgtype": "text",
      "text": {
        "content": "Apache SkyWalking Alarm: \n %s."
      }
    }    
  webhooks:
    - https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=dummy_key