Loki

Loki

Loki

安装loki

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
cd /tmp
yum -y install unzip curl
curl -O -L "https://ghproxy.com/https://github.com/grafana/loki/releases/download/v2.3.0/loki-linux-amd64.zip"
unzip "loki-linux-amd64.zip"
sudo cp loki-linux-amd64 /usr/local/bin/
sudo chmod u+x /usr/local/bin/loki-linux-amd64
mkdir -p /etc/loki
mkdir -p /data/loki
cat << EOF > /etc/loki/loki-local-config.yaml
auth_enabled: false
server:
  http_listen_port: 3100
ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 5m
  chunk_retain_period: 30s
  max_transfer_retries: 0
schema_config:
  configs:
    - from: 2021-09-14
      store: boltdb
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 168h
storage_config:
  boltdb:
    directory: /data/loki/index
  filesystem:
    directory: /data/loki/chunks
limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
chunk_store_config:
  max_look_back_period: 0s
table_manager:
  retention_deletes_enabled: false
  retention_period: 0s
EOF
cat << EOF > /etc/systemd/system/lokid.service
[Unit]
Description=lokid service node
After=network-online.target
[Service]
Type=simple
ExecStart=/usr/local/bin/loki-linux-amd64  --config.file=/etc/loki/loki-local-config.yaml
Restart=always
RestartSec=30s
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload && systemctl enable lokid && systemctl start lokid

安装promtail

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
cd /tmp
curl -O -L "https://ghproxy.com/https://github.com/grafana/loki/releases/download/v2.3.0/promtail-linux-amd64.zip"
unzip "promtail-linux-amd64.zip"
sudo cp promtail-linux-amd64 /usr/local/bin/
sudo chmod u+x /usr/local/bin/promtail-linux-amd64
mkdir -p /etc/promtail
cd /etc/promtail
wget https://raw.githubusercontent.com/grafana/loki/main/clients/cmd/promtail/promtail-local-config.yaml
cat > /etc/systemd/system/promtail.service <<EOF
[Unit]
Description=Promtail service
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/promtail-linux-amd64 --config.file /etc/promtail/promtail-local-config.yaml
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload && systemctl enable promtail && systemctl start promtail

安装logcli

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
cd /tmp
curl -O -L "https://ghproxy.com/https://github.com/grafana/loki/releases/download/v2.3.0/logcli-linux-amd64.zip"
unzip "logcli-linux-amd64.zip"
sudo cp logcli-linux-amd64 /usr/local/bin/
sudo chmod u+x /usr/local/bin/logcli-linux-amd64
[root@xxx promtail]# logcli-linux-amd64 labels job
http://localhost:3100/loki/api/v1/label/job/values?end=1631585752373078818&start=1631582152373078818
varlogs
[root@cf-prod-ops promtail]# logcli-linux-amd64 query '{job="varlogs"}'
http://localhost:3100/loki/api/v1/query_range?direction=BACKWARD&end=1631585815853130831&limit=30&query=%7Bjob%3D%22varlogs%22%7D&start=1631582215853130831
Common labels: {filename="/var/log/cloud-init.log", job="varlogs"}
2021-09-14T10:08:51+08:00 {} 2021-09-01 09:43:13,281 - handlers.py[DEBUG]: finish: modules-final: SUCCESS: running modules for final
2021-09-14T10:08:51+08:00 {} 2021-09-01 09:43:13,281 - util.py[DEBUG]: cloud-init mode 'modules' took 0.199 seconds (0.19)

组件介绍

官方架构图 ../images/loki_architecture_components.svg◎ ../images/loki_architecture_components.svg

Distributor(分发器)

验证写入正确性,切成chunk发给ingester

Ingester(采集器)

负责处理实际的存储,把日志发给querier组件

Querier(查询器)

从Ingester提取日志

Querier Frontend(查询前端)

提供查询API

数据流向

chunk存储结构是nosql的模式,根据key生成hash id方便快速查找,把具体的信息存为一个item [[../images/chunks_diagram.png ]]整体一个写入流程分为

  • distributor收到一个HTTP/1请求
  • stream数据流hash成一个字符串
  • 把这个数据流发到合适的ingester和分片replicas
  • 每个ingester创建对应的chunk文件或者在已有chunk修改
  • distributor返回一个成功的code
  • querier收到一个读的HTTP请求
  • querier把请求发给所有的ingester
  • ingester如果收到就把自己有的数据返回
  • 如果都没有返回,querier对备份库慢查询对应的记录
  • querier对收到的ingester数据进行去重,并返回

常见client

logstash

https://grafana.com/docs/loki/next/clients/logstash/

1
2
3
4
5
6
logstash-plugin install logstash-output-loki
Using bundled JDK: /usr/share/logstash/jdk
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
Validating logstash-output-loki
Resolving mixin dependencies
Installing logstash-output-loki

loki集群化部署

memberlist

1
2
3
4
5
6
7
8
memberlist:
  join_members: ["172.16.27.71", "172.16.27.85"]
  node_name: "172.16.27.71"
  dead_node_reclaim_time: 30s
  gossip_to_dead_nodes_time: 15s
  left_ingesters_timeout: 30s
  bind_addr: ['0.0.0.0']
  bind_port: 7946

注意默认join_members是要ip的,但官网说的是hostname比较奇怪,我这里就是node_name是ip,也可能是主机名解析不了就会这样

ingester

设置ingester的replication_factor为2

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 2
    final_sleep: 0s
  chunk_idle_period: 5m
  chunk_retain_period: 30s
  max_transfer_retries: 0
distributor:
  ring:
    kvstore:
      store: memberlist

这个是ingester和distributor的memberlist配置

检查

配置完了memberlist,就可以在很多地方使用这个变量了 检查下systemctl status lokid 这个时候一般能成功,并且提示这个就是成功了,caller=memberlist_client.go:504 msg="joined memberlist cluster" reached_nodes=1

那么哪些能加memberlist呢 似乎大部分组件都支持加入,具体可以去github issue看下

lifecycle和ring需要了解下 查询官方架构发现,lifecycle是用来管理ingester的hash字符串。每个采集器维护一个状态,包括pending,joining,active,leaving,unhealthy等。

这里可以发现,member之前是通过gossip来实现集群的,那么gossip了解下

gossip

gossip是一个协议, 也叫Epidemic Protocol (流行病协议) 建议大家看下gossip知乎回答 基本来说通过一定程度的扩散,最终信息在范围内以一定速度得到传递

增加缓存

这里复制了云原生小白文章, 暂时并没有尝试

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
uery_range:
  results_cache:
    cache:
      redis:
        endpoint: redis:6379
        expiration: 1h
  cache_results: true

index_queries_cache_config:
  redis:
    endpoint: redis:6379
    expiration: 1h

chunk_store_config:
  chunk_cache_config:
    redis:
      endpoint: redis:6379
      expiration: 1h
  write_dedupe_cache_config:
    redis:
      endpoint: redis:6379
      expiration: 1h

http API

http API是非常重要的部分,这里有集群的相关情况,并且和使用密切相关,所以必须要了解 后端endpoints

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# GET /ready
curl localhost:3100/ready
ready

# GET /metrics
curl -s localhost:3100/metrics|head -n 5
# HELP cortex_cache_corrupt_chunks_total Total count of corrupt chunks found in cache.
# TYPE cortex_cache_corrupt_chunks_total counter
cortex_cache_corrupt_chunks_total 0
# HELP cortex_chunk_store_chunks_per_query Distribution of #chunks per query.
# TYPE cortex_chunk_store_chunks_per_query histogram

#GET /config
curl -s localhost:3100/config|head -n 5
target: all
http_prefix: ""
server:
  http_listen_address: ""
  http_listen_port: 3100

#GET /loki/api/v1/status/buildinfo
curl -s localhost:3100/loki/api/v1/status/buildinfo
{"version":"","revision":"","branch":"","buildUser":"","buildDate":"","goVersion":""}

loki http api 建议过一遍

grafana

导入loki这个datasource

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Loki Cheat Sheet
See your logs
Start by selecting a log stream from the Log browser, or alternatively you can write a stream selector into the query field.
Here is an example of a log stream:
{job="default/prometheus"}
Combine stream selectors
{app="cassandra",namespace="prod"}
Returns all log lines from streams that have both labels.
Filtering for search terms.
{app="cassandra"} |~ "(duration|latency)s*(=|is|of)s*[d.]+"
{app="cassandra"} |= "exact match"
{app="cassandra"} != "do not match"
LogQL supports exact and regular expression filters.
Log pipeline
{job="mysql"} |= "metrics" | logfmt | duration > 10s
This query targets the MySQL job, filters out logs that don’t contain the word "metrics" and parses each log line to extract more labels and filters with them.
Count over time
count_over_time({job="mysql"}[5m])
This query counts all the log lines within the last five minutes for the MySQL job.
Rate
rate(({job="mysql"} |= "error" != "timeout")[10s])
This query gets the per-second rate of all non-timeout errors within the last ten seconds for the MySQL job.
Aggregate, count, and group
sum(count_over_time({job="mysql"}[5m])) by (level)
Get the count of logs during the last five minutes, grouping by level.

FAQ

entry with timestamp ignored reason 'entry out of order' for stream

Maximum active stream limit exceeded

1
2
limits_config:
  max_streams_per_user: 0