ElasticSearch

ElasticSearch

ELK准备

添加源

具体可以参考https://www.elastic.co/guide/en/logstash/7.16/installing-logstash.html#_yum

debian系

1
2
3
wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
sudo apt-get install apt-transport-https
sudo sh -c 'echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" > /etc/apt/sources.list.d/elastic-7.x.list'

centos系

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
sudo rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch
cat <<EOF > /etc/yum.repos.d/elastic.repo
[logstash-7.x]
name=Elastic repository for 7.x packages
baseurl=https://artifacts.elastic.co/packages/7.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=1
autorefresh=1

type=rpm-md
EOF

Logstash

安装

debian系
1
sudo apt-get install logstash
centos系
1
sudo yum -y install logstash

ElasticSearch

介绍

参考https://www.elastic.co/guide/cn/elasticsearch/guide/current/getting-started.html

elasticsearch是一个基于lucene库的实时的分布式搜索分析引擎,主要用作全文检索,结构化搜索,分析以及这三种的组合

常见的应用场景有系统日志分析、应用数据分析、安全审计、关键词搜索等

es是面向文档的,对于复杂关系,比如地理信息日期等对象都可以保存,这是相比较于关系型数据库优势的地方

安装

1
2
3
4
  # if centos
  yum -y install elasticsearch
  # if debian
  apt-get install elasticsearch

启动

1
2
3
4
5
systemctl start elasticsearch
systemctl enable elasticsearch
lsof -i:9200
COMMAND   PID          USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
java    32481 elasticsearch  280u  IPv4 29805357      0t0  TCP localhost:wap-wsp (LISTEN)

可能报错:

启动报错了 failed; error='Not enough space' (errno=12) 修改下es的启动参数

1
2
3
4
5
6
liuliancao@liuliancao-dev:~/projects/lion$ sudo cat /etc/elasticsearch/jvm.options|grep Xm
## -Xms4g
## -Xmx4g
-Xms200m
-Xmx200m
启动时间较长,我的虚拟机大概20s..

生产jvm参数参考

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
-Xms4g
-Xmx4g
8-13:-XX:+UseConcMarkSweepGC
8-13:-XX:CMSInitiatingOccupancyFraction=75
8-13:-XX:+UseCMSInitiatingOccupancyOnly
14-:-XX:+UseG1GC
14-:-XX:G1ReservePercent=25
14-:-XX:InitiatingHeapOccupancyPercent=30
-Djava.io.tmpdir=${ES_TMPDIR}
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/data/lib/elasticsearch
-XX:ErrorFile=/data/log/elasticsearch/hs_err_pid%p.log
8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:/data/log/elasticsearch/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m
9-:-Xlog:gc*,gc+age=trace,safepoint:file=/data/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m

测试

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
cat logstash-first.conf
input { stdin { } }
output {
  elasticsearch { hosts => ["localhost:9200"] }
  stdout { codec => rubydebug }
}
# logstash -f logstash-first.conf
Using bundled JDK: /usr/share/logstash/jdk
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
hello, world!
[INFO ] 2021-07-07 11:09:32.287 [Ruby-0-Thread-10: :1] elasticsearch - Installing ILM policy {"policy"=>{"phases"=>{"hot"=>{"actions"=>{"rollover"=>{"max_size"=>"50gb", "max_age"=>"30d"}}}}}} {:name=>"logstash-policy"}
{
      "@version" => "1",
       "message" => "hello, world!",
          "host" => "xxx",
    "@timestamp" => 2021-07-07T03:09:32.187Z
}

代表es数据成功写入

集群搭建

参考集群搭建

三台服务器

RESTful API with JSON over http

通过9200交互

liuliancao@liuliancao-dev:~/projects/lion$ sudo lsof -i:9200 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME java 50762 elasticsearch 284u IPv6 141518 0t0 TCP localhost:9200 (LISTEN) java 50762 elasticsearch 285u IPv6 141519 0t0 TCP localhost:9200 (LISTEN)

Curl, Groovy, Javascript, .NET, PHP, Perl, Python, Ruby (https://www.elastic.co/guide/en/elasticsearch/client/index.html)
Curl

curl -X<VERB> '<PROTOCOL>://<HOST>:<PORT>/<PATH>?<QUERY_STRING>' -d '<BODY>'

查询集群中文档数量

curl -XGET 'http://localhost:9200/_count?pretty' -d ' { "query": { "match_all": {} } } ' 实际执行结果是 liuliancao@liuliancao-dev:~/projects/lion$ curl -XGET 'http://localhost:9200/_count?pretty' -d ' { "query": { "match_all": {} } } ' { "error" : "Content-Type header [application/x-www-form-urlencoded] is not supported", "status" : 406 } ..., 需要调整下header, 这个结果代表我们没有分片和文档存在 liuliancao@liuliancao-dev:~/projects/lion$ curl -XGET -H 'Content-Type: application/json' 'http://localhost:9200/_count?pretty' -d ' { "query": { "match_all": {} } } ' { "count" : 0, "_shards" : { "total" : 0, "successful" : 0, "skipped" : 0, "failed" : 0 } }

JSON形式保存对象
一些es中的概念
索引
类型
属性
集群状态查看
1
2
# curl -XGET 'http://localhost:9200/_cluster/health'
{"cluster_name":"web","status":"red","timed_out":false,"number_of_nodes":6,"number_of_data_nodes":3,"active_primary_shards":4416,"active_shards":4416,"relocating_shards":0,"initializing_shards":12,"unassigned_shards":34046,"delayed_unassigned_shards":0,"number_of_pending_tasks":66,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":907745,"active_shards_percent_as_number":11.477881166502053}
列出所有index
1
curl -X GET "localhost:9200/_cat/indices?v"
模糊删除index
1
DELETE /your-index-pattern*

当然习惯界面的话,在kibana索引管理,里面也可以删除

index的number_of_replicas number_of_shards设置

最近发现系统的shards满了,所以和同事一起看下了参数,发现对于index的参 数设置,分为动态和静态参数 https://www.elastic.co/guide/en/elasticsearch/reference/6.5/index-modules.html#_static_index_settings

https://www.elastic.co/guide/en/elasticsearch/reference/6.5/index-modules.html#dynamic-index-settings

首先前提是logstash-开头是我的索引,如果你没有对应的template,则需要创建 我主要想降低下number_of_shards和number_of_replicas

对于number_of_shards你是无法直接PUT /索引名字 修改settings的,只能关联 template来影响后续的index, 如果需要操作老的,则需要进行reindex操作

修改template

1
2
3
4
5
6
7
8
  PUT /_template/logstash
  {
    "index_patterns": ["*"],
    "settings": {
      "number_of_replicas": 0,
      "number_of_shards": 3
    }
  }

执行reindex样例和创建别名

1
2
3
4
5
6
7
8
9
POST _reindex
{
  "source": {
    "index": "xxx-2023.05-x"
  },
  "dest": {
    "index": "xxx-2023.05-x-new"
  }
}

后来发现集群还是red,检查unassighed shards发现还有,删除掉red的index,恢 复

1
GET _cluster/allocation/explain?pretty

发现提示是有问题的

https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html 基本是这几种错误

DSL

Query查询

一个典型的查询 https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
  GET /_search
  {
    "query": {
      "bool": {
        "must": [
          { "match": { "title":   "Search"        }},
          { "match": { "content": "Elasticsearch" }}
        ],
        "filter": [
          { "term":  { "status": "published" }},
          { "range": { "publish_date": { "gte": "2015-01-01" }}}
        ]
      }
    }
  }
指定正则匹配

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
  GET /_search
  {
      "query": {
            "regexp": {
                    "user.id": {
                              "value": "k.*y",
                              "flags": "ALL",
                              "case_insensitive": true,
                              "max_determinized_states": 10000,
                              "rewrite": "constant_score"
                    }
            }
      }
  }

聚合查询

聚合里面进行count排序
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
"aggs": {
       "hours data": {
          "date_histogram": {
            "field": "@timestamp",
            "calendar_interval": "1m",
            "time_zone": "Asia/Shanghai",
            "min_doc_count": 100,
            "order": {
              "_count": "desc"
            }
          }
       }
  }

kibana

测试使用

浏览器访问服务器地址:5601端口 建议通过nginx+ssl配置,会比较安全

FAQ

es报错 kibana无法启动

shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_inforeason=ALLOCATION_FAILED], at[2024-03-24T12:14:02.651Z], failed_attempts[5], failed_nodes[[joxyW01nTNCGvFW1IjPQMQ, JaEcQBEZTOiztZdlj-iZBw, delayed=false, details[failed shard on node [JaEcQBEZTOiztZdlj-iZBw]: failed recovery, failure RecoveryFailedExceptionlogstash-overseas-ssjj2-hall-server_accesslog-2024.03; nested: CircuitBreakingExceptionparent] Data too large, data for [internal:index/shard/recovery/start_recovery] would be [4212820374/3.9gb], which is larger than the limit of [4080218931/3.7gb], real usage: [4212805912/3.9gb], new bytes reserved: [14462/14.1kb], usages [request=0/0b, fielddata=259024/252.9kb, in_flight_requests=23636/23kb, model_inference=0/0b, eql_sequence=0/0b, accounting=449937262/429mb; ], allocation_status[no_attempt]]]

结果是kibana一直挂,es状态异常,active到不了100%

解决 elasticsearch.yml

1
2
indices.breaker.total.use_real_memory:false
indices.breaker.total.limit: 70%

增加这个以后集群状态变成green了

Elasticsearch in action

elasticsearch6.8导入书里面的数据

这本书的版本比较老,有很多不兼容的地方,这里我进行了6.8的简单适配

你可以参考 https://github.com/liuliancao/elasticsearch-in-action/tree/6.8 我这个 进行导入

导入方法就是把这个脚本放到服务器上面,执行下就好了,我没有用 mapping.json,让es自动生成了,另外由于新版本es不支持多个type,所以索引 变成 get-together-event get-together-group

对应书里面的get-together/event 和get-together/group

对于parent关系啥的,这里我先舍弃了,需要学习的时候看下官方的例子好了, 主要是理解有这个概念然后进行适当延伸

Chapter2

insert data p26

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
root@elk-test-all-in-one:~# curl -H "Content-Type: application/json" -XPUT '192.168.8.150:9200/get-together/group/1?pretty' -d '{"name":"Elasticsearch Denver","organizer":"Lee"}'
{
  "_index" : "get-together",
  "_type" : "group",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

create an index

1
2
root@elk-test-all-in-one:~# curl -H "Content-Type: application/json" -XPUT '192.168.8.150:9200/new-index'
{"acknowledged":true,"shards_acknowledged":true,"index":"new-index"}

get index mapping

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
root@elk-test-all-in-one:~# curl  '192.168.8.150:9200/get-together/_mapping/group?pretty'
{
  "get-together" : {
    "mappings" : {
      "group" : {
        "properties" : {
          "name" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          },
          "organizer" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    }
  }
}

insert data by script ( pay attention to your version 6.x or 7.x change to it)

1
2
3
4
5
6
7
8
9
# clone the repository
git clone https://github.com/dakrone/elasticsearch-in-action.git

# switch to a branch that matches your version. Master works with 1.x and 2.x
# but we currently support 5.x, 6.x and 7.x as well:
git clone https://github.com/dakrone/elasticsearch-in-action.git -b 6.x

# index the sample data
elasticsearch-in-action/populate.sh

get all data

1
curl "192.168.8.150:9200/get-together/_search?pretty"

search by data (uri search)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
root@elk-test-all-in-one:~# curl "192.168.8.150:9200/get-together/_search?pretty&q=EC2"
{
  "took" : 11,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 2.148003,
    "hits" : [
      {
        "_index" : "get-together",
        "_type" : "_doc",
        "_id" : "101",
        "_score" : 2.148003,
        "_routing" : "1",
        "_source" : {
          "relationship_type" : {
            "name" : "event",
            "parent" : "1"
          },
          "host" : "Sean",
          "title" : "Sunday, Surly Sunday",
          "description" : "Sort out any setup issues and work on Surlybird issues. We can use the EC2 node as a bounce point for pairing.",
          "attendees" : [
            "Daniel",
            "Michael",
            "Sean"
          ],
          "date" : "2013-07-21T18:30",
          "location_event" : {
            "name" : "IRC, #denofclojure"
          },
          "reviews" : 2
        }
      }
    ]
  }
}

json query (request body search)

json的广义搜索,创建一个query,查询字符串是EC2和上面的 INDEX/_search?q=EC2 一样

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
root@elk-test-all-in-one:~# curl -H "Content-Type: application/json"  "192.168.8.150:9200/get-together/_search?pretty" -d '{"query":{"query_string":{"query":"EC2"}}}'
{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 2.148003,
    "hits" : [
      {
        "_index" : "get-together",
        "_type" : "_doc",
        "_id" : "101",
        "_score" : 2.148003,
        "_routing" : "1",
        "_source" : {
          "relationship_type" : {
            "name" : "event",
            "parent" : "1"
          },
          "host" : "Sean",
          "title" : "Sunday, Surly Sunday",
          "description" : "Sort out any setup issues and work on Surlybird issues. We can use the EC2 node as a bounce point for pairing.",
          "attendees" : [
            "Daniel",
            "Michael",
            "Sean"
          ],
          "date" : "2013-07-21T18:30",
          "location_event" : {
            "name" : "IRC, #denofclojure"
          },
          "reviews" : 2
        }
      }
    ]
  }
}

query like kibana

这里具体就是lucene字符串了

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
root@lqx-elk-test-all-in-one:~# curl -H "Content-Type: application/json"  "192.168.8.150:9200/get-together/_search?pretty" -d '{"query":{"query_string":{"query":"title:Liberator AND description:JBoss"}}}'
{
  "took" : 27,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 4.2471023,
    "hits" : [
      {
        "_index" : "get-together",
        "_type" : "_doc",
        "_id" : "100",
        "_score" : 4.2471023,
        "_routing" : "1",
        "_source" : {
          "relationship_type" : {
            "name" : "event",
            "parent" : "1"
          },
          "host" : [
            "Lee",
            "Troy"
          ],
          "title" : "Liberator and Immutant",
          "description" : "We will discuss two different frameworks in Clojure for doing different things. Liberator is a ring-compatible web framework based on Erlang Webmachine. Immutant is an all-in-one enterprise application based on JBoss.",
          "attendees" : [
            "Lee",
            "Troy",
            "Daniel",
            "Tom"
          ],
          "date" : "2013-09-05T18:00",
          "location_event" : {
            "name" : "Stoneys Full Steam Tavern",
            "geolocation" : "39.752337,-105.00083"
          },
          "reviews" : 4
        }
      }
    ]
  }
}

query with term

term是术语或者关键词的意思,所以这里表示你就是想搜某个字段匹配,这里我 怎么也查不出来数据,可能我的数据和书里面的不太一样,并且都是自动分词的, 如果你是keyword类型应该是可以的。一般name,id等唯一性字段建议设置成 keyword类型,这样term查询的时候还有不分词的时候性能也会好点。

这个是老的index,都是text啥的类型,没有keyword类型

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
root@lqx-elk-test-all-in-one:~# curl  -H "Content-Type: application/json"  "192.168.8.150:9200/get-together?pretty"
{
  "get-together" : {
    "aliases" : { },
    "mappings" : {
      "_doc" : {
        "properties" : {
          "attendees" : {
            "type" : "text",
            "fields" : {
              "verbatim" : {
                "type" : "keyword"
              }
            }
          },
          "created_on" : {
            "type" : "date",
            "format" : "yyyy-MM-dd"
          },
          "date" : {
            "type" : "date",
            "format" : "date_hour_minute"
          },
          "description" : {
            "type" : "text",
            "term_vector" : "with_positions_offsets"
          },
          "host" : {
            "type" : "text"
          },
          "location_event" : {
            "properties" : {
              "geolocation" : {
                "type" : "geo_point"
              },
              "name" : {
                "type" : "text"
              }
            }
          },
          "location_group" : {
            "type" : "text"
          },
          "members" : {
            "type" : "text"
          },
          "name" : {
            "type" : "text"
          },
          "organizer" : {
            "type" : "text"
          },
          "relationship_type" : {
            "type" : "join",
            "eager_global_ordinals" : true,
            "relations" : {
              "group" : "event"
            }
          },
          "reviews" : {
            "type" : "integer",
            "null_value" : 0
          },
          "tags" : {
            "type" : "text",
            "fields" : {
              "verbatim" : {
                "type" : "keyword"
              }
            }
          },
          "title" : {
            "type" : "text"
          }
        }
      }
    },
    "settings" : {
      "index" : {
        "number_of_shards" : "2",
        "provided_name" : "get-together",
        "creation_date" : "1724298898585",
        "analysis" : {
          "filter" : {
            "myCustomFilter1" : {
              "type" : "lowercase"
            },
            "myCustomFilter2" : {
              "type" : "kstem"
            }
          },
          "char_filter" : {
            "myCustomCharFilter" : {
              "type" : "mapping",
              "mappings" : [
                "ph=>f",
                " u => you ",
                "ES=>Elasticsearch"
              ]
            }
          },
          "analyzer" : {
            "myCustomAnalyzer" : {
              "filter" : [
                "myCustomFilter1",
                "myCustomFilter2"
              ],
              "char_filter" : [
                "myCustomCharFilter"
              ],
              "type" : "custom",
              "tokenizer" : "myCustomTokenizer"
            }
          },
          "tokenizer" : {
            "myCustomTokenizer" : {
              "type" : "letter"
            },
            "myCustomNGramTokenizer" : {
              "type" : "ngram",
              "min_gram" : "2",
              "max_gram" : "3"
            }
          }
        },
        "number_of_replicas" : "1",
        "uuid" : "awp57IJXQdmuGMLsrMuvOA",
        "version" : {
          "created" : "6082399"
        }
      }
    }
  }
}

先自己创建一个index,并且插入三条数据

1
2
3
4
5
6
# add liuiancao-in-action
root@lqx-elk-test-all-in-one:~# curl -XPUT -H "Content-Type: application/json"  "192.168.8.150:9200/liuliancao-in-action" -d '{"mappings":{"_doc":{"properties":{"name":{"type":"keyword"},"gender":{"type":"text"},"description":{"type":"text"}}}}}'
_seq_no":1,"_primary_term":1}
curl -XPOST -H "Content-Type: application/json"  "192.168.8.150:9200/liuliancao-in-action/_doc" -d '{"name":"lnh","description":"a sly fox", "gender":"male"}'
curl -XPOST -H "Content-Type: application/json"  "192.168.8.150:9200/liuliancao-in-action/_doc" -d '{"name":"ysg","description":"a big wolf", "gender":"male"}'
curl -XPOST -H "Content-Type: application/json"  "192.168.8.150:9200/liuliancao-in-action/_doc" -d '{"name":"asd","description":"a big wolf", "gender":"female"}'

这个时候我们执行下查询

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
root@lqx-elk-test-all-in-one:~# curl -H "Content-Type: application/json"  "192.168.8.150:9200/liuliancao-in-action/_search?pretty" -d '{"query":{"term":{"name":"lnh"}}}'
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.6931472,
    "hits" : [
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "4v70kZEBPb4xbihRsA4c",
        "_score" : 0.6931472,
        "_source" : {
          "name" : "lnh",
          "description" : "a sly fox",
          "gender" : "male"
        }
      }
    ]
  }
}

query with aggregations term

基于上面创建的index, 进行聚合,聚合字段为name,使用terms方式,这个聚合的名字叫xingshi_daquan

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
root@lqx-elk-test-all-in-one:~# curl -H "Content-Type: application/json"  "192.168.8.150:9200/liuliancao-in-action/_search?pretty" -d '{"aggregations":{"xingshi_daquan":{"terms":{"field":"name"}}}}'
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "Y_5pkpEBPb4xbihR2RN8",
        "_score" : 1.0,
        "_source" : {
          "name" : "asd",
          "description" : "a big wolf",
          "gender" : "female"
        }
      },
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "av5qkpEBPb4xbihRWBO1",
        "_score" : 1.0,
        "_source" : {
          "name" : "ysg",
          "description" : "a big wolf",
          "gender" : "male"
        }
      },
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "4v70kZEBPb4xbihRsA4c",
        "_score" : 1.0,
        "_source" : {
          "name" : "lnh",
          "description" : "a sly fox",
          "gender" : "male"
        }
      }
    ]
  },
  "aggregations" : {
    "xingshi_daquan" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "asd",
          "doc_count" : 1
        },
        {
          "key" : "lnh",
          "doc_count" : 1
        },
        {
          "key" : "ysg",
          "doc_count" : 1
        }
      ]
    }
  }
}

可以发现

Chapter3

add index mapping and change it

创建一个新索引字段mapping,但是尝试继续修改的是报错了,提示已经有了

我的系统版本是6.8 可以参考 https://www.elastic.co/guide/en/elasticsearch/reference/7.8/mapping.html

1
2
3
4
root@lqx-elk-test-all-in-one:~# curl -XPUT -H "Content-Type: application/json"  "192.168.8.150:9200/liuliancao-in-action-test1" -d '{"mappings":{"_doc":{"properties":{"name":{"type":"keyword"},"gender":{"type":"text"},"description":{"type":"text"},"books":{"type":"text"}}}}}'
{"acknowledged":true,"shards_acknowledged":true,"index":"liuliancao-in-action-test1"}
root@lqx-elk-test-all-in-one:~# curl -XPUT -H "Content-Type: application/json"  "192.168.8.150:9200/liuliancao-in-action-test1" -d '{"mappings":{"_doc":{"properties":{"name":{"type":"keyword"},"gender":{"type":"text"},"description":{"type":"text"},"books":{"type":"text"}, "sales":{"type":"number"}}}}}'
{"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"index [liuliancao-in-action-test1/oC9fSj5TSgKL7EE5MDUy3g] already exists","index_uuid":"oC9fSj5TSgKL7EE5MDUy3g","index":"liuliancao-in-action-test1"}],"type":"resource_already_exists_exception","reason":"index [liuliancao-in-action-test1/oC9fSj5TSgKL7EE5MDUy3g] already exists","index_uuid":"oC9fSj5TSgKL7EE5MDUy3g","index":"liuliancao-in-action-test1"},"status":400}

注意这里_doc只是一个名字,你也可以起成其他的名字, 比如

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
PUT /my-index-000001
{
  "mappings": {
    "lqx_mappings": {
      "properties": {
        "age": {
          "type": "integer"
        },
        "email": {
          "type": "keyword"
        },
        "name": {
          "type": "keyword"
        }
      }
    }
  }
}
PUT /my-index-000001/_mappings/lqx_mappings
{
   "properties": {
     "publish":{"type":"integer"}
   }
}


GET /my-index-000001/_mappings
{
  "my-index-000001" : {
    "mappings" : {
      "lqx_mappings" : {
        "properties" : {
          "age" : {
            "type" : "integer"
          },
          "email" : {
            "type" : "keyword"
          },
          "name" : {
            "type" : "text"
          },
          "publish" : {
            "type" : "integer"
          }
        }
      }
    }
  }
}

所以上面报错的测试可以这样curl

1
2
root@lqx-elk-test-all-in-one:~# curl -XPUT -H "Content-Type: application/json"  "192.168.8.150:9200/liuliancao-in-action/_mappings/_doc" -d '{"properties":{"publish":{"type":"integer"}}}'
{"acknowledged":true}

当你尝试修改已经存在的字段的时候就会报错,这个时候估计只能reindex才能 解决了

1
2
root@lqx-elk-test-all-in-one:~# curl -XPUT -H "Content-Type: application/json"  "192.168.8.150:9200/liuliancao-in-action/_mappings/_doc" -d '{"properties":{"publish":{"type":"keyword"}}}'
{"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[node-2][192.168.8.150:9301][indices:admin/mapping/put]"}],"type":"illegal_argument_exception","reason":"mapper [publish] of different type, current_type [integer], merged_type [keyword]"},"status":400}

但是我们可以添加字字段

1
2
root@lqx-elk-test-all-in-one:~# curl -XPUT -H "Content-Type: application/json"  "192.168.8.150:9200/liuliancao-in-action/_mappings/_doc" -d '{"properties":{"publish":{"type":"integer","fields":{"verbatim":{"type":"keyword"}}}}}'
{"acknowledged":true}

所以es的各个版本的API存在变化,大家要多尝试尝试,有的时候确实挺难受的

get specific fields

如果只想获取指定的字段,原书中的fields=已经不用了,可以用_source或者 stored_fields(依赖设置)来获取

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
root@lqx-elk-test-all-in-one:~# curl -XGET -H "Content-Type: application/json"  "192.168.8.150:9200/liuliancao-in-action/_search?pretty" -d '{"_source":["name"], "query":{"match_all":{}}}'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "Y_5pkpEBPb4xbihR2RN8",
        "_score" : 1.0,
        "_source" : {
          "name" : "asd"
        }
      },
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "av5qkpEBPb4xbihRWBO1",
        "_score" : 1.0,
        "_source" : {
          "name" : "ysg"
        }
      },
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "4v70kZEBPb4xbihRsA4c",
        "_score" : 1.0,
        "_source" : {
          "name" : "lnh"
        }
      }
    ]
  }
}

另一个办法是stored_fields方式,但是我们目前无法修改成stored_fields,所 以执行个reindex

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
curl -XPUT -H "Content-Type: application/json"  "192.168.8.150:9200/liuliancao-in-action-stored" -d '{
      "mappings":{"stored_mappings":{
        "properties" : {
          "description" : {
            "type" : "text"
          },
          "gender" : {
            "type" : "text"
          },
          "name" : {
            "type" : "keyword",
	    "store": true
          },
          "publish" : {
            "type" : "integer",
            "fields" : {
              "verbatim" : {
                "type" : "keyword"
              }
            }
          }
        }
      }
   }
  }
'

root@lqx-elk-test-all-in-one:~# curl -H "Content-Type: application/json"  "192.168.8.150:9200/_reindex" -d '{"source":{"index":"liuliancao-in-action"},"dest":{"index":"liuliancao-in-action-stored"}}'
{"took":182,"timed_out":false,"total":4,"updated":0,"created":4,"deleted":0,"batches":1,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1.0,"throttled_until_millis":0,"failures":[]}

再试下stored_fields

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
root@lqx-elk-test-all-in-one:~# curl -H "Content-Type: application/json"  "192.168.8.150:9200/liuliancao-in-action-stored/_search?pretty" -d '{"stored_fields":["name"], "_source": false}'
{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 5,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "liuliancao-in-action-stored",
        "_type" : "stored_mappings",
        "_id" : "Y_5pkpEBPb4xbihR2RN8",
        "_score" : 1.0,
        "fields" : {
          "name" : [
            "asd"
          ]
        }
      },
      {
        "_index" : "liuliancao-in-action-stored",
        "_type" : "stored_mappings",
        "_id" : "av5qkpEBPb4xbihRWBO1",
        "_score" : 1.0,
        "fields" : {
          "name" : [
            "ysg"
          ]
        }
      },
      {
        "_index" : "liuliancao-in-action-stored",
        "_type" : "stored_mappings",
        "_id" : "cP7klpEBPb4xbihRsj-4",
        "_score" : 1.0,
        "fields" : {
          "name" : [
            "ysg"
          ]
        }
      },
      {
        "_index" : "liuliancao-in-action-stored",
        "_type" : "stored_mappings",
        "_id" : "4v70kZEBPb4xbihRsA4c",
        "_score" : 1.0,
        "fields" : {
          "name" : [
            "lnh"
          ]
        }
      },
      {
        "_index" : "liuliancao-in-action-stored",
        "_type" : "stored_mappings",
        "_id" : "bf7klpEBPb4xbihRhT9x",
        "_score" : 1.0,
        "fields" : {
          "name" : [
            "lnh"
          ]
        }
      }
    ]
  }
}

一般使用stored_fields表示不全部把_source获取过来,优点是可以减少数据传输,提高性 能和避免暴露信息

update doc from id

关于6.8的更新具体请看文档https://www.elastic.co/guide/en/elasticsearch/reference/6.8/docs-update.html 如果是新版把6.8换成对应版本或者current

更新doc内容: 使用POST INDEX_NAME/_doc/DOC_ID/_update -d '{"doc":{}}'

使用script更新 POST INDEX_NAME/_doc/DOC_ID/_update -d '{"script":{}}'

参考官方例子, 你可以在脚本里面使用变量,可以进行计算后赋值

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
POST test/_doc/1/_update
{
    "script" : {
        "source": "if (ctx._source.tags.contains(params.tag)) { ctx.op = 'delete' } else { ctx.op = 'none' }",
        "lang": "painless",
        "params" : {
            "tag" : "green"
        }
    }
}
1
2
root@lqx-elk-test-all-in-one:~# curl -H "Content-Type: application/json" -XPOST "192.168.8.150:9200/liuliancao-in-action/_doc/4v70kZEBPb4xbihRsA4c/_update" -d '{"doc":{"description":"a very sly fox"}}'
{"_index":"liuliancao-in-action","_type":"_doc","_id":"4v70kZEBPb4xbihRsA4c","_version":2,"result":"updated","_shards":{"total":2,"successful":2,"failed":0}}

delete doc

删除就很明确了,不过要注意删除了就恢复不了,除非你创建快照等信息,所以 此操作一定要谨慎 DELETE /INDEX_NAME/DOC_ID

1
2
root@lqx-elk-test-all-in-one:~# curl  -XDELETE -H "Content-Type: application/json"  "192.168.8.150:9200/liuliancao-in-action/_doc/av5qkpEBPb4xbihRWBO1"
{"_index":"liuliancao-in-action","_type":"_doc","_id":"av5qkpEBPb4xbihRWBO1","_version":2,"result":"deleted","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":4,"_primary_term":1}

close or open index

通常当一个索引我们暂时不想删除,后面可能会用,但是目前不用,我们又不想 浪费服务本身的性能,这个时候可以先执行close,这样在内存里面就只会有一些 元数据了

1
2
3
4
5
6
7
8
root@lqx-elk-test-all-in-one:~# curl  -XPOST -H "Content-Type: application/json"  "192.168.8.150:9200/liuliancao-in-action/_close"
{"acknowledged":true}
root@lqx-elk-test-all-in-one:~# curl  -H "Content-Type: application/json"  "192.168.8.150:9200/liuliancao-in-action/_search"
{"error":{"root_cause":[{"type":"index_closed_exception","reason":"closed","index_uuid":"MrfdgdmuSjOoMGZwTPEysw","index":"liuliancao-in-action"}],"type":"index_closed_exception","reason":"closed","index_uuid":"MrfdgdmuSjOoMGZwTPEysw","index":"liuliancao-in-action"},"status":400}
root@lqx-elk-test-all-in-one:~# curl  -XPOST -H "Content-Type: application/json"  "192.168.8.150:9200/liuliancao-in-action/_open"
{"acknowledged":true,"shards_acknowledged":true}
root@lqx-elk-test-all-in-one:~# curl  -H "Content-Type: application/json"  "192.168.8.150:9200/liuliancao-in-action/_search"
{"took":2,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":3,"max_score":1.0,"hits":[{"_index":"liuliancao-in-action","_type":"_doc","_id":"Y_5pkpEBPb4xbihR2RN8","_score":1.0,"_source":{"name":"asd","description":"a big wolf", "gender":"female"}},{"_index":"liuliancao-in-action","_type":"_doc","_id":"cP7klpEBPb4xbihRsj-4","_score":1.0,"_source":{"name":"ysg","descriptio-storedn":"a big wolf", "gender":"male"}},{"_index":"liuliancao-in-action","_type":"_doc","_id":"4v70kZEBPb4xbihRsA4c","_score":1.0,"_source":{"name":"lnh","description":"a very sly fox","gender":"male"}}]}}

Chapter4 Searching your data

search url

关于搜索dsl, 你可以使用rest api,关键词是 _search

可以整个集群搜,也可以指定index进行搜索,另外可以使用*进行blob匹配

1
2
root@lqx-elk-test-all-in-one:~# curl -XGET -H "Content-Type: application/json"  "192.168.8.150:9200/_search?q=ysg&pretty"
root@lqx-elk-test-all-in-one:~# curl -XGET -H "Content-Type: application/json"  "192.168.8.150:9200/liuliancao-in-action/_search?q=ysg&pretty"

关于一个搜索,你可能需要注意你的查询条件(query),查询希望返回的条目数目(size),翻页 的开始(from),是否返回所有的字段(_resource),和希望的排序方式(sort)

比较常见的有(摘自书里面),内容比较长我就不写出来了

1
2
3
root@lqx-elk-test-all-in-one:~# curl -XGET -H "Content-Type: application/json"  "192.168.8.150:9200/get-together/_search?from=10&size=10" #获取从第10个和后面的共10个内容
root@lqx-elk-test-all-in-one:~# curl -XGET -H "Content-Type: application/json"  "192.168.8.150:9200/get-together/_search?sort=date:asc" # 日期升序
root@lqx-elk-test-all-in-one:~# curl -XGET -H "Content-Type: application/json"  "192.168.8.150:9200/get-together/_search?sort=date:desc&_source=host,title,date&pretty" # 日期降序,并且只获取host, title, date

search url with request body

通过带上各种参数,我们可以实现更强大的搜索功能

上面我们的uri参数也能放进去, 其中query match_all那个也可以省略

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
root@lqx-elk-test-all-in-one:~# curl -XGET -H "Content-Type: application/json"  "192.168.8.150:9200/get-together/_search?pretty" -d '{"query":{"match_all":{}}, "from":10,"size":2, "_source":"host,title,date"}'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 20,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "get-together",
        "_type" : "_doc",
        "_id" : "102",
        "_score" : 1.0,
        "_routing" : "1",
        "_source" : { }
      },
      {
        "_index" : "get-together",
        "_type" : "_doc",
        "_id" : "103",
        "_score" : 1.0,
        "_routing" : "2",
        "_source" : { }
      }
    ]
  }
}

在指定_source的时候我们可以使用include和exclude

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
root@lqx-elk-test-all-in-one:~# curl -XGET -H "Content-Type: application/json"  "192.168.8.150:9200/get-together/_search?pretty" -d '{"from":10,"size":1, "_source":{"include":["host","title","date"]}}'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 20,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "get-together",
        "_type" : "_doc",
        "_id" : "102",
        "_score" : 1.0,
        "_routing" : "1",
        "_source" : {
          "date" : "2013-07-11T18:00",
          "host" : "Daniel",
          "title" : "10 Clojure coding techniques you should know, and project openbike"
        }
      }
    ]
  }
}
root@lqx-elk-test-all-in-one:~# curl -XGET -H "Content-Type: application/json"  "192.168.8.150:9200/get-together/_search?pretty" -d '{"from":10,"size":1, "_source":{"exclude":["host","title","date"]}}'
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 20,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "get-together",
        "_type" : "_doc",
        "_id" : "102",
        "_score" : 1.0,
        "_routing" : "1",
        "_source" : {
          "relationship_type" : {
            "parent" : "1",
            "name" : "event"
          },
          "reviews" : 3,
          "attendees" : [
            "Lee",
            "Tyler",
            "Daniel",
            "Stuart",
            "Lance"
          ],
          "location_event" : {
            "name" : "Stoneys Full Steam Tavern",
            "geolocation" : "39.752337,-105.00083"
          },
          "description" : "What are ten Clojure coding techniques that you wish everyone knew? We will also check on the status of Project Openbike."
        }
      }
    ]
  }
}

甚至你可以交叉取差集

首先看一个简单的query

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
root@lqx-elk-test-all-in-one:~# curl   "192.168.8.150:9200/get-together-event/_search?pretty" -d '{"query":{"match":{"title":"hadoop"}}}'
{
  "took" : 19,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.3338978,
    "hits" : [
      {
        "_index" : "get-together-event",
        "_type" : "_doc",
        "_id" : "1145",
        "_score" : 1.3338978,
        "_source" : {
          "host" : "Yann",
          "title" : "Using Hadoop with Elasticsearch",
          "description" : "We will walk through using Hadoop with Elasticsearch for big data crunching!",
          "attendees" : [
            "Yann",
            "Bill",
            "James"
          ],
          "date" : "2013-09-09T18:30",
          "location_event" : {
            "name" : "SkillsMatter Exchange",
            "geolocation" : "51.524806,-0.099095"
          },
          "reviews" : 2
        }
      },
      {
        "_index" : "get-together-event",
        "_type" : "_doc",
        "_id" : "111",
        "_score" : 0.9227539,
        "_source" : {
          "host" : "Andy",
          "title" : "Moving Hadoop to the mainstream",
          "description" : "Come hear about how Hadoop is moving to the main stream",
          "attendees" : [
            "Andy",
            "Matt",
            "Bill"
          ],
          "date" : "2013-07-21T18:00",
          "location_event" : {
            "name" : "Courtyard Boulder Louisville",
            "geolocation" : "39.959409,-105.163497"
          },
          "reviews" : 4
        }
      },
      {
        "_index" : "get-together-event",
        "_type" : "_doc",
        "_id" : "109",
        "_score" : 0.6333549,
        "_source" : {
          "host" : "Andy",
          "title" : "Hortonworks, the future of Hadoop and big data",
          "description" : "Presentation on the work that hortonworks is doing on Hadoop",
          "attendees" : [
            "Andy",
            "Simon",
            "David",
            "Sam"
          ],
          "date" : "2013-06-19T18:00",
          "location_event" : {
            "name" : "SendGrid Denver office",
            "geolocation" : "39.748477,-104.998852"
          },
          "reviews" : 2
        }
      }
    ]
  }
}

这里书里面强调filter和query不同,filter不会计算score的数值,所以filter 会比score要快一点,并且由于filter的时候会对比bit位,如果满足才是1,这个 搜索也是会缓存的,但是由于版本迭代书里面的query已经失效了,现在可以参 考https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html

这里发现分为几个关键词 must 相当于之前的query了,会计算score filter 不会计算score must_not score为0 不包含 should 可以包含,包含以后score会更高,不包含也会被返回

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
root@lqx-elk-test-all-in-one:~# curl   "192.168.8.150:9200/get-together-event/_search?pretty" -d '{
  "query": {
    "bool": {
      "must": {
         "term": {
            "title": "hadoop"
          }
        },
      "filter": {
          "term": {"host": "andy"}
      }
      }
    }
  }'
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.9227539,
    "hits" : [
      {
        "_index" : "get-together-event",
        "_type" : "_doc",
        "_id" : "111",
        "_score" : 0.9227539,
        "_source" : {
          "host" : "Andy",
          "title" : "Moving Hadoop to the mainstream",
          "description" : "Come hear about how Hadoop is moving to the main stream",
          "attendees" : [
            "Andy",
            "Matt",
            "Bill"
          ],
          "date" : "2013-07-21T18:00",
          "location_event" : {
            "name" : "Courtyard Boulder Louisville",
            "geolocation" : "39.959409,-105.163497"
          },
          "reviews" : 4
        }
      },
      {
        "_index" : "get-together-event",
        "_type" : "_doc",
        "_id" : "109",
        "_score" : 0.6333549,
        "_source" : {
          "host" : "Andy",
          "title" : "Hortonworks, the future of Hadoop and big data",
          "description" : "Presentation on the work that hortonworks is doing on Hadoop",
          "attendees" : [
            "Andy",
            "Simon",
            "David",
            "Sam"
          ],
          "date" : "2013-06-19T18:00",
          "location_event" : {
            "name" : "SendGrid Denver office",
            "geolocation" : "39.748477,-104.998852"
          },
          "reviews" : 2
        }
      }
    ]
  }
}

search queries

match_all

match_all 表示返回所有的数据,是最简单的查询

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
root@lqx-elk-test-all-in-one:~# curl   "192.168.8.150:9200/liuliancao-in-action/_search?pretty" -d '{"query":{"match_all":{}}}'
{
  "took" : 18,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "Y_5pkpEBPb4xbihR2RN8",
        "_score" : 1.0,
        "_source" : {
          "name" : "asd",
          "description" : "a big wolf",
          "gender" : "female"
        }
      },
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "cP7klpEBPb4xbihRsj-4",
        "_score" : 1.0,
        "_source" : {
          "name" : "ysg",
          "descriptio-storedn" : "a big wolf",
          "gender" : "male"
        }
      },
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "4v70kZEBPb4xbihRsA4c",
        "_score" : 1.0,
        "_source" : {
          "name" : "lnh",
          "description" : "a very sly fox",
          "gender" : "male"
        }
      },
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "__51nZEBPb4xbihR7H9p",
        "_score" : 1.0,
        "_source" : {
          "name" : "ysg",
          "descriptio-storedn" : "a big wolf",
          "gender" : "male"
        }
      }
    ]
  }
}
filter query

接着我们用上面说的filter query方式,获取name必须是ysg的结果

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
root@lqx-elk-test-all-in-one:~# curl   "192.168.8.150:9200/liuliancao-in-action/_search?pretty" -d '{"query":{"bool":{"must":{"term":{"name":"ysg"}}}}}'
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "cP7klpEBPb4xbihRsj-4",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "ysg",
          "descriptio-storedn" : "a big wolf",
          "gender" : "male"
        }
      },
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "__51nZEBPb4xbihR7H9p",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "ysg",
          "descriptio-storedn" : "a big wolf",
          "gender" : "male"
        }
      }
    ]
  }
}

root@lqx-elk-test-all-in-one:~# curl   "192.168.8.150:9200/liuliancao-in-action/_search?pretty" -d '{"query":{"bool":{"filter":{"term":{"name":"ysg"}}}}}'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.0,
    "hits" : [
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "cP7klpEBPb4xbihRsj-4",
        "_score" : 0.0,
        "_source" : {
          "name" : "ysg",
          "descriptio-storedn" : "a big wolf",
          "gender" : "male"
        }
      },
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "__51nZEBPb4xbihR7H9p",
        "_score" : 0.0,
        "_source" : {
          "name" : "ysg",
          "descriptio-storedn" : "a big wolf",
          "gender" : "male"
        }
      }
    ]
  }
}
query_string

也可以类似uri的方式 _search?q=

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
root@lqx-elk-test-all-in-one:~# curl   "192.168.8.150:9200/liuliancao-in-action/_search?q=ysg&pretty"
{
  "took" : 54,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "cP7klpEBPb4xbihRsj-4",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "ysg",
          "descriptio-storedn" : "a big wolf",
          "gender" : "male"
        }
      },
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "__51nZEBPb4xbihR7H9p",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "ysg",
          "descriptio-storedn" : "a big wolf",
          "gender" : "male"
        }
      }
    ]
  }
}

root@lqx-elk-test-all-in-one:~# curl   "192.168.8.150:9200/liuliancao-in-action/_search?pretty" -d '{"query":{"query_string":{"query":"ysg"}}}'
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "cP7klpEBPb4xbihRsj-4",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "ysg",
          "descriptio-storedn" : "a big wolf",
          "gender" : "male"
        }
      },
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "__51nZEBPb4xbihR7H9p",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "ysg",
          "descriptio-storedn" : "a big wolf",
          "gender" : "male"
        }
      }
    ]
  }
}

match_phrase

match 是分词查询 match_phrase 是全文本匹配 match_phrase_prefix 是全文匹配,但是最后一个可以部分匹配i like j 能匹 配i like john multi_match 可以针对多个字段同时匹配某个关键词 看官方例子,subject和 message都需要包含this is a test这些分词

1
2
3
4
5
6
7
8
9
GET /_search
{
  "query": {
    "multi_match" : {
      "query":    "this is a test", 
      "fields": [ "subject", "message" ] 
    }
  }
}

range query

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
root@lqx-elk-test-all-in-one:~# curl   "192.168.8.150:9200/get-together-event/_search?pretty" -d '{
  "query": {
     "range": {
        "date": {
            "gte": "2013-06-19T18:00",
            "lt": "2013-06-20T18:00"
         }
      }
    }
 }'
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "get-together-event",
        "_type" : "_doc",
        "_id" : "109",
        "_score" : 1.0,
        "_source" : {
          "host" : "Andy",
          "title" : "Hortonworks, the future of Hadoop and big data",
          "description" : "Presentation on the work that hortonworks is doing on Hadoop",
          "attendees" : [
            "Andy",
            "Simon",
            "David",
            "Sam"
          ],
          "date" : "2013-06-19T18:00",
          "location_event" : {
            "name" : "SendGrid Denver office",
            "geolocation" : "39.748477,-104.998852"
          },
          "reviews" : 2
        }
      }
    ]
  }
}

summary

最后是文中的总结

Use case Query type to use
You want to take input from a user, similar to a Google-style interface, and search for documents with the input. Use a match query or the simple_query_string query if you want to support +/- and search in specific fields.
You want to take input as a phrase and search for documents containing that phrase, perhaps with some amount of leniency (slop). Use a match_phrase query with an amount of slop to find phrases similar to what the user is searching for.
You want to search for a single word in a not_analyzed field, knowing exactly how the word should appear. Use a term query because query terms aren’t analyzed.
You want to combine many different searches or types of searches, creating a single search out of them. Use the bool query to combine any number of subqueries into a single query.
You want to search for certain words across many fields in a document. Use the multi_match query, which behaves similarly to the match query but on multiple fields.
You want to return every document from a search. Use the match_all query to return all documents from a search.
You want to search a field for values that are between two specified values. Use a range query to search within documents with values between a certain range.
You want to search a field for values that start with a specified string. Use a prefix query to search for terms starting with a given string.
You want to autocomplete the value of a single word based on what the user has already typed in. Use a prefix query to send what the user has typed in and get back exact matches starting with the text.
You want to search for all documents that have no value for a specified field. Use the missing filter to filter out documents that are missing fields.
  1. 如果你只是想类似搜索引擎一样查询结果,那么用match就好了,也可以用_search?q=或者query:{"query_string":{"query":""}}方式
  2. 如果你想根据匹配具体的文本或者是全文匹配,那么使用match_phrase
  3. 如果你想完全匹配分词不包含模糊等,并且字段是not anylazed的情况下, 可以用term
  4. 如果你想同时用多个查询,可以使用bool
  5. 如果你相对多个字段进行同一个查询,使用multi_match
  6. 如果你想返回所有结果 match_all
  7. 如果你想返回一个具体的时间或者数字间隔的数据使用range
  8. 如果你想查询以某个字符串开头的使用prefix查询

9 如果你想查询特定字段没有数值的情况下使用missing查询

Chapter5 Anaylzing your data

关于一个数据插入index,需要经历character filter(识别字母), tokenizer(句 子分词), token filter(对分词进行进一步处理,比如lowercase等),index(发 送给lucene存储)

settings or config file

这里我就粘贴一下了哈 具体可以看https://www.elastic.co/guide/en/elasticsearch/reference/6.8/analyzer-anatomy.html

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
index:
  analysis:
    analyzer:
      myCustomAnalyzer:
        type: custom
        tokenizer: myCustomTokenizer
        filter: [myCustomFilter1, myCustomFilter2]
        char_filter: myCustomCharFilter
    tokenizer:
      myCustomTokenizer:
        type: letter
    filter:
      myCustomFilter1:
        type: lowercase
      myCustomFilter2:
        type: kstem
    char_filter:
      myCustomCharFilter:
        type: mapping
        mappings: ["ph=>f", "u =>you"]

analyze

如果我们对某一句话想测试下分析的情况,可以用_analyze测试下,新 的_analyze已经有所变化,具体请看https://www.elastic.co/guide/en/elasticsearch/reference/6.8/_testing_analyzers.html

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
root@lqx-elk-test-all-in-one:~# curl  -XPOST  "192.168.8.150:9200/_analyze?pretty" -d '{"analyzer":"standard","text":"share your experience with NoSql & big data technologies"}'
{
  "tokens" : [
    {
      "token" : "share",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "your",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "experience",
      "start_offset" : 11,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "with",
      "start_offset" : 22,
      "end_offset" : 26,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "nosql",
      "start_offset" : 27,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "big",
      "start_offset" : 35,
      "end_offset" : 38,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "data",
      "start_offset" : 39,
      "end_offset" : 43,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "technologies",
      "start_offset" : 44,
      "end_offset" : 56,
      "type" : "<ALPHANUM>",
      "position" : 7
    }
  ]
}
root@lqx-elk-test-all-in-one:~# curl  -XPOST  "192.168.8.150:9200/_analyze?pretty" -d '{"tokenizer":"whitespace","filter":["lowercase","reverse"],"text":"share your experience with NoSql & big data technologies"}'
{
  "tokens" : [
    {
      "token" : "erahs",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ruoy",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "ecneirepxe",
      "start_offset" : 11,
      "end_offset" : 21,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "htiw",
      "start_offset" : 22,
      "end_offset" : 26,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "lqson",
      "start_offset" : 27,
      "end_offset" : 32,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "&",
      "start_offset" : 33,
      "end_offset" : 34,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "gib",
      "start_offset" : 35,
      "end_offset" : 38,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "atad",
      "start_offset" : 39,
      "end_offset" : 43,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "seigolonhcet",
      "start_offset" : 44,
      "end_offset" : 56,
      "type" : "word",
      "position" : 8
    }
  ]
}

termvector

正常情况下你可以使用_analyze来测试你的分析情况,有的时候想看下index的 实际的分词情况,这个时候可以使用_termvector,感兴趣可以看看这个里面 https://www.cnblogs.com/huangying2124/p/12854592.html

总结下来关于termvector用法有两种 一种是你定义在index的settings的字段里面,增加类似这种 "term_vector":"with_positions_offsets", 表明记录位置和偏移量,这种叫 index-time

还有一种是和上面的analyze有点像,及时分析,这种叫 query-time

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
root@lqx-elk-test-all-in-one:~# curl  '192.168.8.150:9200/liuliancao-in-action/_doc/_termvectors?pretty' -d '{"doc":{"name":"lb","description":"an old cat called lb","gener":"male"}}'
{
  "_index" : "liuliancao-in-action",
  "_type" : "_doc",
  "_version" : 0,
  "found" : true,
  "took" : 1,
  "term_vectors" : {
    "description" : {
      "field_statistics" : {
        "sum_doc_freq" : 0,
        "doc_count" : 0,
        "sum_ttf" : 0
      },
      "terms" : {
        "an" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 2
            }
          ]
        },
        "called" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 3,
              "start_offset" : 11,
              "end_offset" : 17
            }
          ]
        },
        "cat" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 7,
              "end_offset" : 10
            }
          ]
        },
        "lb" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 4,
              "start_offset" : 18,
              "end_offset" : 20
            }
          ]
        },
        "old" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 3,
              "end_offset" : 6
            }
          ]
        }
      }
    }
  }
}

总结termvector用来分析各个term的统计信息,比如term在index出现的频率(ttf), 位置(position), 有多少个文档包含这个term(doc_freq),term_freq这个在 当前文档出现的频率

多个文档进行统计的时候可以使用 _mtermvectors

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
root@lqx-elk-test-all-in-one:~# curl  '192.168.8.150:9200/_mtermvectors?pretty' -d '{
  "docs": [
    {
      "_index": "liuliancao-in-action",
      "_type": "_doc",
      "term_statistics": true,
      "doc": {
          "name": "lb1",
          "description": "langbo 1",
          "gender": "male"
      }
    },
    {
      "_index": "liuliancao-in-action",
      "_type": "_doc",
      "term_statistics": true,
      "doc": {
          "name": "lb2",
          "description": "langbo 2",
          "gender": "female"
      }
    }
  ]
}'

{
  "docs" : [
    {
      "_index" : "liuliancao-in-action",
      "_type" : "_doc",
      "_version" : 0,
      "found" : true,
      "took" : 0,
      "term_vectors" : {
        "gender" : {
          "field_statistics" : {
            "sum_doc_freq" : 1,
            "doc_count" : 1,
            "sum_ttf" : 1
          },
          "terms" : {
            "male" : {
              "doc_freq" : 1,
              "ttf" : 1,
              "term_freq" : 1,
              "tokens" : [
                {
                  "position" : 0,
                  "start_offset" : 0,
                  "end_offset" : 4
                }
              ]
            }
          }
        },
        "description" : {
          "field_statistics" : {
            "sum_doc_freq" : 0,
            "doc_count" : 0,
            "sum_ttf" : 0
          },
          "terms" : {
            "1" : {
              "term_freq" : 1,
              "tokens" : [
                {
                  "position" : 1,
                  "start_offset" : 7,
                  "end_offset" : 8
                }
              ]
            },
            "langbo" : {
              "term_freq" : 1,
              "tokens" : [
                {
                  "position" : 0,
                  "start_offset" : 0,
                  "end_offset" : 6
                }
              ]
            }
          }
        }
      }
    },
    {
      "_index" : "liuliancao-in-action",
      "_type" : "_doc",
      "_version" : 0,
      "found" : true,
      "took" : 0,
      "term_vectors" : {
        "description" : {
          "field_statistics" : {
            "sum_doc_freq" : 0,
            "doc_count" : 0,
            "sum_ttf" : 0
          },
          "terms" : {
            "2" : {
              "term_freq" : 1,
              "tokens" : [
                {
                  "position" : 1,
                  "start_offset" : 7,
                  "end_offset" : 8
                }
              ]
            },
            "langbo" : {
              "term_freq" : 1,
              "tokens" : [
                {
                  "position" : 0,
                  "start_offset" : 0,
                  "end_offset" : 6
                }
              ]
            }
          }
        },
        "gender" : {
          "field_statistics" : {
            "sum_doc_freq" : 1,
            "doc_count" : 1,
            "sum_ttf" : 1
          },
          "terms" : {
            "female" : {
              "term_freq" : 1,
              "tokens" : [
                {
                  "position" : 0,
                  "start_offset" : 0,
                  "end_offset" : 6
                }
              ]
            }
          }
        }
      }
    }
  ]
}

built-in analyzers

standard

combining the standard tokenizer, the standard token filter, the lowercase token filter, and the stop token filter

simple

This analyzer doesn’t work well for Asian languages that don’t separate words with whitespace, though, so use it only for European languages.

whitespace

The whitespace analyzer does nothing but split text into tokens around whitespace—very simple!

stop

The stop analyzer behaves like the simple analyzer but additionally filters out stopwords from the token stream.

keyword

The keyword analyzer takes the entire field and generates a single token on it. Keep in mind that rather than using the keyword tokenizer in your mappings, it’s better to set the index setting to not_analyzed.

pattern

The pattern analyzer allows you to specify a pattern for tokens to be broken apart. But because the pattern would have to be specified regardless, it often makes more sense to use a custom analyzer and combine the existing pattern tokenizer with any needed token filters.

language and multilingual

Elasticsearch supports a wide variety of language-specific analyzers out of the box. There are analyzers for arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, irish, hindi, hungarian, indonesian, italian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, and thai. You can specify the language-specific analyzer by using one of those names, but make sure you use the lowercase name! If you want to analyze a language not included in this list, there may be a plugin for it as well.

Snowball

The snowball analyzer uses the standard tokenizer and token filter (like the standard analyzer), with the lowercase token filter and the stop filter; it also stems the text using the snowball stemmer. Don’t worry if you aren’t sure what stemming is; we’ll discuss it in more detail near the end of this chapter.

Tokenization

这里我还是觉得书里面的描述更贴切一点,就贴下书里面的文字。

Standard tokenizer

The standard tokenizer is a grammar-based tokenizer that’s good for most European languages; it also handles segmenting Unicode text but with a default max token length of 255. It also removes punctuation like commas and periods:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
root@lqx-elk-test-all-in-one:~# curl -XPOST 'localhost:9200/_analyze?pretty' -d '{"tokenizer":"standard", "text":"I have,potatoes."}'
{
  "tokens" : [
    {
      "token" : "I",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "have",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "potatoes",
      "start_offset" : 7,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}
keyword
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
root@lqx-elk-test-all-in-one:~# curl -XPOST 'localhost:9200/_analyze?pretty' -d '{"tokenizer":"keyword", "text":"I have,potatoes."}'
{
  "tokens" : [
    {
      "token" : "I have,potatoes.",
      "start_offset" : 0,
      "end_offset" : 16,
      "type" : "word",
      "position" : 0
    }
  ]
}
letter

The letter tokenizer takes the text and divides it into tokens at things that are not letters. For example, with the sentence “Hi, there.” the tokens would be Hi and there because the comma, space, and period are all nonletters:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
root@lqx-elk-test-all-in-one:~# curl -XPOST 'localhost:9200/_analyze?pretty' -d '{"tokenizer":"letter", "text":"I have,potatoes."}'
{
  "tokens" : [
    {
      "token" : "I",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "have",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "potatoes",
      "start_offset" : 7,
      "end_offset" : 15,
      "type" : "word",
      "position" : 2
    }
  ]
}
lowercase
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
root@lqx-elk-test-all-in-one:~# curl -XPOST 'localhost:9200/_analyze?pretty' -d '{"tokenizer":"lowercase", "text":"I have,potatoes."}'
{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "have",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "potatoes",
      "start_offset" : 7,
      "end_offset" : 15,
      "type" : "word",
      "position" : 2
    }
  ]
}
whitespace
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
root@lqx-elk-test-all-in-one:~# curl -XPOST 'localhost:9200/_analyze?pretty' -d '{"tokenizer":"whitespace", "text":"I have,potatoes."}'
{
  "tokens" : [
    {
      "token" : "I",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "have,potatoes.",
      "start_offset" : 2,
      "end_offset" : 16,
      "type" : "word",
      "position" : 1
    }
  ]
}
pattern

首先添加一个索引并且设置索引的analyzer

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
root@lqx-elk-test-all-in-one:~# curl -XPUT 'localhost:9200/liuliancao-in-action-special' -d '{
   "settings": {
       "analysis": {
         "analyzer": {
           "liuliancao-special-analyzer": {
                   "tokenizer": "liuliancao-special-tokenizer"
           }
         },
         "tokenizer": {
           "liuliancao-special-tokenizer": {
             "type": "pattern",
             "pattern": ","
           }
         }
       }
     }
  }'
{"acknowledged":true,"shards_acknowledged":true,"index":"liuliancao-in-action-special"}

然后测试这个pattern

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
root@lqx-elk-test-all-in-one:~# curl -XPOST 'localhost:9200/liuliancao-in-action-special/_analyze?pretty' -d '{"tokenizer":"liuliancao-special-tokenizer", "text":"I have,potatoes."}'
{
  "tokens" : [
    {
      "token" : "I have",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "potatoes.",
      "start_offset" : 7,
      "end_offset" : 16,
      "type" : "word",
      "position" : 1
    }
  ]
}
path hierarchy

按照路径树进行生成

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
root@lqx-elk-test-all-in-one:~# curl -XPOST 'localhost:9200/liuliancao-in-action-special/_analyze?pretty' -d '{"tokenizer":"path_hierarchy", "text":"/var/log/nginx.log"}'
{
  "tokens" : [
    {
      "token" : "/var",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/var/log",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/var/log/nginx.log",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
  ]
}

Token filters

Standard

什么也不做

Lowercase
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
root@lqx-elk-test-all-in-one:~# curl -XPOST 'localhost:9200/liuliancao-in-action-special/_analyze?pretty' -d '{"filter":["lowercase"], "text":"I have,potatoes."}'
{
  "tokens" : [
    {
      "token" : "i have,potatoes.",
      "start_offset" : 0,
      "end_offset" : 16,
      "type" : "word",
      "position" : 0
    }
  ]
}
length filter

定义length filter

1
2
root@lqx-elk-test-all-in-one:~# curl -XPUT 'localhost:9200/liuliancao-in-action-length' -d '{"settings":{"index":{"analysis":{"filter":{"liuliancao-length-filter":{"type":"length", "max":8, "min":2}}}}}}'
{"acknowledged":true,"shards_acknowledged":true,"index":"liuliancao-in-action-length"}

使用length filter,这里注意需要指定tokenizer,否则会报错

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
root@lqx-elk-test-all-in-one:~# curl -XPOST 'localhost:9200/liuliancao-in-action-length/_analyze?pretty' -d '{"tokenizer":"standard", "filter":["liuliancao-length-filter"], "text":"I have a book"}'
{
  "tokens" : [
    {
      "token" : "have",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "book",
      "start_offset" : 9,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}
Stop

stop的意思就是忽略这个,也可以配置在一个文件里面

1
2
root@lqx-elk-test-all-in-one:~# curl -XPUT 'localhost:9200/liuliancao-in-action-stop' -d '{"settings":{"index":{"analysis":{"filter":{"liuliancao-stop-filter":{"type":"stop", "stopwords":"a"}}}}}}'
{"acknowledged":true,"shards_acknowledged":true,"index":"liuliancao-in-action-stop"}

测试下 a 就没了

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
root@lqx-elk-test-all-in-one:~# curl -XPOST 'localhost:9200/liuliancao-in-action-stop/_analyze?pretty' -d '{"tokenizer":"standard", "filter":["liuliancao-stop-filter"], "text":"I have a book"}'
{
  "tokens" : [
    {
      "token" : "I",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "have",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "book",
      "start_offset" : 9,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}
Reverse

顾名思义,反转,每个token都进行反转

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
root@lqx-elk-test-all-in-one:~# curl -XPOST 'localhost:9200/liuliancao-in-action-stop/_analyze?pretty' -d '{"tokenizer":"standard", "filter":["reverse"], "text":"I have a book"}'
{
  "tokens" : [
    {
      "token" : "I",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "evah",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "koob",
      "start_offset" : 9,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}
Unique
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
root@lqx-elk-test-all-in-one:~# curl -XPOST 'localhost:9200/liuliancao-in-action-stop/_analyze?pretty' -d '{"tokenizer":"standard", "filter":["unique"], "text":"I book a book"}'
{
  "tokens" : [
    {
      "token" : "I",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "book",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}
Ascii folding

The ascii folding token filter converts Unicode characters that aren’t part of the regular ASCII character set into the ASCII equivalent, if one exists for the character. For example, you can convert the Unicode “ü” into an ASCII “u” as shown here:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
root@lqx-elk-test-all-in-one:~# curl -XPOST 'localhost:9200/liuliancao-in-action-stop/_analyze?pretty' -d '{"tokenizer":"standard", "filter":["asciifolding"],"text":"hello ünicode"}'
{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "unicode",
      "start_offset" : 6,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}
synonym

同义词

先定义同义词 treasure写错了 不好意思

1
2
root@lqx-elk-test-all-in-one:~# curl -XPUT 'localhost:9200/liuliancao-in-action-synonym' -d '{"settings":{"index":{"analysis":{"filter":{"liuliancao-synonym-filter":{"type":"synonym", "expand":true, "synonyms":["book=>tresure","pig=>lovely"]}}}}}}'
{"acknowledged":true,"shards_acknowledged":true,"index":"liuliancao-in-action-synonym"}

测试分词

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
root@lqx-elk-test-all-in-one:~# curl -XPOST 'localhost:9200/liuliancao-in-action-synonym/_analyze?pretty' -d '{"tokenizer":"standard", "filter":["liuliancao-synonym-filter"],"text":"i have a pig book."}'
{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "have",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "lovely",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "SYNONYM",
      "position" : 3
    },
    {
      "token" : "tresure",
      "start_offset" : 13,
      "end_offset" : 17,
      "type" : "SYNONYM",
      "position" : 4
    }
  ]
}

Ngrams, edge ngrams, and shingles

分词的形式,我们以书中单词举例,分为ngrams, edge ngrams, and shingles, 具体举例子就好明白了。

Ngrams

假设单词是spaghetti 1-grams按照1个字符轮换拆分为s,p,a,g,h,e,t,t,i bigrams拆分为sp, pa, ag, gh, he, et, tt, ti 可以理解为滚轮,2个字符的 滚轮 trigrams拆分为spa, pag, agh, ghe, het, ett, tti

Ngrams的好处可以做模糊匹配或者纠正,当我们输入错误的时候,这个时候对比 score的值就会比较高

Edge ngrams

如果只从开头开始构建,那么就是edge ngrams,同样是单词spaghetti 当min_gram 2 max_gram 6 变成sp, spa, spag, spagh, spaghe

先创建一个edgeNGram filter

1
2
root@lqx-elk-test-all-in-one:~# curl -XPUT 'localhost:9200/liuliancao-in-action-edge-ngram' -d '{"settings":{"index":{"analysis":{"filter":{"liuliancao-edgengram-filter":{"type":"edgeNGram", "min_gram":2, "max_gram":6}}}}}}'
{"acknowledged":true,"shards_acknowledged":true,"index":"liuliancao-in-action-edge-ngram"}

测试

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
root@lqx-elk-test-all-in-one:~# curl -XPOST 'localhost:9200/liuliancao-in-action-edge-ngram/_analyze?pretty' -d '{"tokenizer":"standard", "filter":["liuliancao-edgengram-filter"],"text":"spaghetti"}'
{
  "tokens" : [
    {
      "token" : "sp",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "spa",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "spag",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "spagh",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "spaghe",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

当然你也可以定义一个完整的analyzer,注意是filter而不是filters哈

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
root@lqx-elk-test-all-in-one:~# curl -XPUT 'localhost:9200/liuliancao-in-action-edge-ngram-with-analyzer' -d '{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "egde": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": [
              "reverse",
              "liuliancao-edgengram-filter",
              "reverse"
            ]
          }
        },
        "filter": {
          "liuliancao-edgengram-filter": {
            "type": "edgeNGram",
            "min_gram": 2,
            "max_gram": 6
          }
        }
      }
    }
  }
}'
{"acknowledged":true,"shards_acknowledged":true,"index":"liuliancao-in-action-edge-ngram-with-analyzer"}

测试下

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
root@lqx-elk-test-all-in-one:~# curl -XPOST 'localhost:9200/liuliancao-in-action-edge-ngram-with-analyzer/_analyze?pretty' -d '{"analyzer":"egde","text":"spaghetti"}'
{
  "tokens" : [
    {
      "token" : "ti",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "tti",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "etti",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "hetti",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "ghetti",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}
Shingles

shingle相当于是基于单词的Ngrams

创建一个shingle filter

1
2
root@lqx-elk-test-all-in-one:~# curl -XPUT 'localhost:9200/liuliancao-in-action-with-shingle' -d '{"settings":{"index":{"analysis":{"analyzer":{"liuliancao-shingle":{"type":"custom","tokenizer":"standard","filter":["liuliancao-shingle-filter"]}},"filter":{"liuliancao-shingle-filter":{"type":"shingle", "min_shingle_size":2, "max_shingle_size":3, "output_unigrams": false}}}}}}'
{"acknowledged":true,"shards_acknowledged":true,"index":"liuliancao-in-action-with-shingle"}

测试下

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
root@lqx-elk-test-all-in-one:~# curl -XPOST 'localhost:9200/liuliancao-in-action-with-shingle/_analyze?pretty' -d '{"analyzer":"liuliancao-shingle","text":"hello i am john"}'
{
  "tokens" : [
    {
      "token" : "hello i",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "hello i am",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "shingle",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "i am",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "shingle",
      "position" : 1
    },
    {
      "token" : "i am john",
      "start_offset" : 6,
      "end_offset" : 15,
      "type" : "shingle",
      "position" : 1,
      "positionLength" : 2
    },
    {
      "token" : "am john",
      "start_offset" : 8,
      "end_offset" : 15,
      "type" : "shingle",
      "position" : 2
    }
  ]
}

Stemming

词干/词根提取,比如administrations在某些算法下,它的词根是administr, 这样你就可以匹配administrator,administration,administrate而不是全部 简单匹配,这样可以让你搜索范围更大而不是全部匹配

Algorithmic stemming

算法式词根,常见的算法 snowball, porter stem, kstem 等。这些算法在于对 词根的宽容性通过aggressive来表示

stemmer administrations administrators Administrate
snowball administr administr Administer
porter_stem administr administr Administer
kstem administration administrator Administrate

可以感觉更不aggressive的是kstem

测试stemming
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
root@lqx-elk-test-all-in-one:~# curl -XPOST 'localhost:9200/liuliancao-in-action-with-shingle/_analyze?pretty' -d '{"tokenizer":"standard","filter":["kstem"], "text":"administrators"}'
{
  "tokens" : [
    {
      "token" : "administrator",
      "start_offset" : 0,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}
root@lqx-elk-test-all-in-one:~# curl -XPOST 'localhost:9200/liuliancao-in-action-with-shingle/_analyze?pretty' -d '{"tokenizer":"standard","filter":["snowball"], "text":"administrators"}'
{
  "tokens" : [
    {
      "token" : "administr",
      "start_offset" : 0,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}
root@lqx-elk-test-all-in-one:~# curl -XPOST 'localhost:9200/liuliancao-in-action-with-shingle/_analyze?pretty' -d '{"tokenizer":"standard","filter":["porter_stem"], "text":"administrators"}'
{
  "tokens" : [
    {
      "token" : "administr",
      "start_offset" : 0,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}
Stemming with dictionaries

有时候算法算词根会比较奇怪,对于具体的语言可以用字典来查询对应的词根, 这种过滤器叫做 hunspell

具体比较麻烦我就不演示了。

Prevent stemming

有的时候不想被词根,这个时候在filter链前面增加一个keyword filter,这样 就不会被分词了。也可以看看对应stemming filter是否有相关的override或者 其他相关的参数。

Chapter6 Searching with relevancy

在ES里面,我们根据一个查询可能会有很多的结果,这些结果有一个相关性,es 描述这种相关性的强弱是用score的方式,计算score的方法是TF-IDF,其中TF表 示term frequency,IDF表示inverse document frequency。

term frequency

词频,表示关键词在文本出现的次数 比如Elasticsearch关键词,其中

  1. We will discuss Elasticsearch 有一个Elasticsearch
  2. Tuesday the Elasticsearch team will gather to answer questions about

Elasticsearch 有两个Elasticsearch,相关度会更高一点

1
{\displaystyle \mathrm {tf_{i,j}} ={\frac {n_{i,j}}{\sum _{k}n_{k,j}}}}

公式表示统计你这个分词在文档中出现的数目/所有分词在文档出现数目之和

inverse document frequency

逆向文件频率

  1. We use Elasticsearch to power the search for our website
  2. The developers like Elasticsearch so far
  3. The scoring of documents is calculated by the scoring field

在这里the出现在三个文档里面,所有the的出现不应该影响更高的分数

idf表示如果一个词在各个地方出现得越多,那么它的分数应该降低

公式

1
{\displaystyle \mathrm {idf_{i}} =\lg {\frac {|D|}{|\{j:t_{i}\in d_{j}\}|}}}

用总文档的数目除以包含该文档的数量,是一个反除数,如果你出现多,分数就 会降低,最后再取一个对数,由于担心除数可能为0,所以分母一般还会+1

至于为什么使用对数,一般是如果词频很高,比如a the这种或者的啥的,这个 时候,TF就会很大,IDF差不多就是 D/文档 数量接近1左右,这个与我们意图不 一致,我们是希望能筛出来这些的,所以加个对数就是接近0了,这样分数会合 理一点,另外一个极端是很少出现的词,这个时候TF很小,但是IDF会很大,差 不多是接近文档的数量,比如只出现1次,这个时候如果文档数量又很多就会明 显增高,如果加个对数以后会小很多。应该也是统计里面很常用的收敛手段,毕 竟自然对数这些更符合描述很多规律性的东西。

最终tfidf=tf*idf

lucene scoring formula

lucence对于score的计算公式 ../images/elasticsearch/lucene-score-formula.png◎ ../images/elasticsearch/lucene-score-formula.png

感兴趣的可以继续google下相关内容

TFIDF认为关键词出现概率越大越重要,同时在文档中不能都出现,否则会降低, TFIDF更强调词语的重要性。

other scoring methods

其他的一些算相似度的算法

  • Okapi BM25
  • Divergence from randomness, or DFR similarity
  • Information based, or IB similarity
  • LM Dirichlet similarity
  • LM Jelinek Mercer similarity

如何修改scoring methods,一个办法是修改索引的mapping,字段的similarity 修改成需要的比如BM25,当然你也可以和filter一样设置自定义的BM25

如果你希望全局修改,可以在elasticsearch里面修改配置,添加 index.similarity.default.type: BM25

关于BM25, 下面是从书里面粘贴的 BM25 has three main settings—k1, b, and discount_overlaps:

k1 and b are numeric settings used to tweak how the scoring is calculated. k1 controls how important term frequency is to the score (how often the term occurs in the document, or TF from earlier in this chapter). b is a number between 0 and 1 that controls what degrees of impact the length of the document has on the score. k1 is set to 1.2 and b is set to 0.75 by default.

Boosting

比如当我们希望在查询的时候title当匹配的时候应该更重要的时候,我们这个 时候可以使用boost,更优雅的对一些情况进行加分。

boost也分为index的时候和query的时候

index的时候,在定义字段的时候设置"boost": 2类似这样就好了,但是一般不 建议这样做,这样的数值后续也无法修改

query的时候,在你的查询里面添加"boost"就好了

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
root@lqx-elk-test-all-in-one:~# curl   "localhost:9200/liuliancao-in-action/_search?pretty" -d '{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "name": {
              "query": "ysg",
              "boost": 2
            }
          }
        },
        {
          "match": {
            "description": "wolf"
          }
        }
      ]
    }
  }
}'
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.5753642,
    "hits" : [
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "cP7klpEBPb4xbihRsj-4",
        "_score" : 0.5753642,
        "_source" : {
          "name" : "ysg",
          "descriptio-storedn" : "a big wolf",
          "gender" : "male"
        }
      },
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "__51nZEBPb4xbihR7H9p",
        "_score" : 0.5753642,
        "_source" : {
          "name" : "ysg",
          "descriptio-storedn" : "a big wolf",
          "gender" : "male"
        }
      },
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "Y_5pkpEBPb4xbihR2RN8",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "asd",
          "description" : "a big wolf",
          "gender" : "female"
        }
      }
    ]
  }
}

如果不boost

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
root@lqx-elk-test-all-in-one:~# curl   "localhost:9200/liuliancao-in-action/_search?pretty" -d '{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "name": {
              "query": "ysg"
            }
          }
        },
        {
          "match": {
            "description": "wolf"
          }
        }
      ]
    }
  }
}'
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "Y_5pkpEBPb4xbihR2RN8",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "asd",
          "description" : "a big wolf",
          "gender" : "female"
        }
      },
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "cP7klpEBPb4xbihRsj-4",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "ysg",
          "descriptio-storedn" : "a big wolf",
          "gender" : "male"
        }
      },
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "__51nZEBPb4xbihR7H9p",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "ysg",
          "descriptio-storedn" : "a big wolf",
          "gender" : "male"
        }
      }
    ]
  }
}

可以发现boost以后name更重要了

如果是多个字段,可以这样写 ["name^3", "description"]

可以参考

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
root@lqx-elk-test-all-in-one:~# curl   "localhost:9200/liuliancao-in-action/_search?pretty" -d '{
  "query": {
    "multi_match": {
          "query": "ysg",
           "fields": ["name", "description"]
    }
  }
}'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "cP7klpEBPb4xbihRsj-4",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "ysg",
          "descriptio-storedn" : "a big wolf",
          "gender" : "male"
        }
      },
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "__51nZEBPb4xbihR7H9p",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "ysg",
          "descriptio-storedn" : "a big wolf",
          "gender" : "male"
        }
      }
    ]
  }
}

root@lqx-elk-test-all-in-one:~# curl   "localhost:9200/liuliancao-in-action/_search?pretty" -d '{
  "query": {
    "multi_match": {
          "query": "ysg",
           "fields": ["name^2", "description"]
    }
  }
}'
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.5753642,
    "hits" : [
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "cP7klpEBPb4xbihRsj-4",
        "_score" : 0.5753642,
        "_source" : {
          "name" : "ysg",
          "descriptio-storedn" : "a big wolf",
          "gender" : "male"
        }
      },
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "__51nZEBPb4xbihR7H9p",
        "_score" : 0.5753642,
        "_source" : {
          "name" : "ysg",
          "descriptio-storedn" : "a big wolf",
          "gender" : "male"
        }
      }
    ]
  }
}

query string里面也可以指定

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
root@lqx-elk-test-all-in-one:~# curl   "192.168.8.150:9200/liuliancao-in-action/_search?pretty" -d '{"query":{"query_string":{"query":"ysg^3 AND wolf"}}}'
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.1507283,
    "hits" : [
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "cP7klpEBPb4xbihRsj-4",
        "_score" : 1.1507283,
        "_source" : {
          "name" : "ysg",
          "descriptio-storedn" : "a big wolf",
          "gender" : "male"
        }
      },
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "__51nZEBPb4xbihR7H9p",
        "_score" : 1.1507283,
        "_source" : {
          "name" : "ysg",
          "descriptio-storedn" : "a big wolf",
          "gender" : "male"
        }
      }
    ]
  }
}

针对分数我们可以进行explain,书里面的方法已经失效了,可以放到命令行里面

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
root@lqx-elk-test-all-in-one:~# curl   "localhost:9200/liuliancao-in-action/_search?pretty&explain=true" -d '{
  "query": {
    "query_string": {
          "query": "ysg"    }
  }        
}'
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_shard" : "[liuliancao-in-action][2]",
        "_node" : "nT9mFU16SKSUraEe8v6zjQ",
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "cP7klpEBPb4xbihRsj-4",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "ysg",
          "descriptio-storedn" : "a big wolf",
          "gender" : "male"
        },
        "_explanation" : {
          "value" : 0.2876821,
          "description" : "max of:",
          "details" : [
            {
              "value" : 0.2876821,
              "description" : "weight(name:ysg in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.2876821,
                  "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
                  "details" : [
                    {
                      "value" : 0.2876821,
                      "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "docFreq",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.0,
                          "description" : "docCount",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 1.0,
                      "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "termFreq=1.0",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "parameter k1",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.0,
                          "description" : "parameter b (norms omitted for field)",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard" : "[liuliancao-in-action][4]",
        "_node" : "nT9mFU16SKSUraEe8v6zjQ",
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "__51nZEBPb4xbihR7H9p",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "ysg",
          "descriptio-storedn" : "a big wolf",
          "gender" : "male"
        },
        "_explanation" : {
          "value" : 0.2876821,
          "description" : "max of:",
          "details" : [
            {
              "value" : 0.2876821,
              "description" : "weight(name:ysg in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.2876821,
                  "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
                  "details" : [
                    {
                      "value" : 0.2876821,
                      "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "docFreq",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.0,
                          "description" : "docCount",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 1.0,
                      "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "termFreq=1.0",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "parameter k1",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.0,
                          "description" : "parameter b (norms omitted for field)",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

Reducing scoring impact with query rescoring

这个部分主要是说当我们计算score的时候,一般是不太会影响性能的,但是当 我们发现文档数越来越多,自定义脚本函数或者短语匹配等可能影响性能的时候, 这个时候如何去减少这个影响。

elasticsearch里面有一种情况,是你可以顺序不断计算score, 这个方式叫做 rescoring 书里面的例子如下 ../images/elasticsearch/rescoring.png◎ ../images/elasticsearch/rescoring.png

先所有文档过滤前20满足的,再进行rescore查询(里面的查询可能会负载高), 说白了就是进行多次的数据处理而不使用额外手段

function_score

对于score,elasticsearch还支持更高级别的自定义,function_score

weight
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
root@lqx-elk-test-all-in-one:~# curl   "localhost:9200/liuliancao-in-action/_search?pretty" -d '{
  "query": {
    "function_score": {
        "query": { "match": {"description": "wolf" }},
         "functions": [
             {
                "weight": 1.5,
                "filter": {"term":{"description": "big"}}
             }      
      ]
    }
  }
}'
{
  "took" : 49,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.43152314,
    "hits" : [
      {
        "_index" : "liuliancao-in-action",
        "_type" : "_doc",
        "_id" : "Y_5pkpEBPb4xbihR2RN8",
        "_score" : 0.43152314,
        "_source" : {
          "name" : "asd",
          "description" : "a big wolf",
          "gender" : "female"
        }
      }
    ]
  }
}

functions里面可以定义列表

field_value_factor
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
  root@lqx-elk-test-all-in-one:~# curl -XGET 'localhost:9200/get-together-event/_search?pretty' -d '{
  "query": {
    "function_score": {
      "query": {
        "match": {
          "description": "elasticsearch"
        }
      },
      "functions": [
         {
           "field_value_factor": {
                "field": "reviews",
                "factor": 2.5,
                "modifier": "ln"
            }
         }
      ]
    }
  }
}'
{
  "took" : 109,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 7,
    "max_score" : 1.7507017,
    "hits" : [
      {
        "_index" : "get-together-event",
        "_type" : "_doc",
        "_id" : "103",
        "_score" : 1.7507017,
        "_source" : {
          "host" : "Lee",
          "title" : "Introduction to Elasticsearch",
          "description" : "An introduction to ES and each other. We can meet and greet and I will present on some Elasticsearch basics and how we use it.",
          "attendees" : [
            "Lee",
            "Martin",
            "Greg",
            "Mike"
          ],
          "date" : "2013-04-17T19:00",
          "location_event" : {
            "name" : "Stoneys Full Steam Tavern",
            "geolocation" : "39.752337,-105.00083"
          },
          "reviews" : 5
        }
      },
      {
        "_index" : "get-together-event",
        "_type" : "_doc",
        "_id" : "1125",
        "_score" : 1.6013412,
        "_source" : {
          "host" : "Dave Nolan",
          "title" : "real-time Elasticsearch",
          "description" : "We will discuss using Elasticsearch to index data in real time",
          "attendees" : [
            "Dave",
            "Shay",
            "John",
            "Harry"
          ],
          "date" : "2013-02-18T18:30",
          "location_event" : {
            "name" : "SkillsMatter Exchange",
            "geolocation" : "51.524806,-0.099095"
          },
          "reviews" : 3
        }
      },
      {
        "_index" : "get-together-event",
        "_type" : "_doc",
        "_id" : "108",
        "_score" : 1.5690377,
        "_source" : {
          "host" : "Elyse",
          "title" : "Piggyback on Elasticsearch training in San Francisco",
          "description" : "We can piggyback on training by Elasticsearch to have some Q&A time with the ES devs",
          "attendees" : [
            "Shay",
            "Igor",
            "Uri",
            "Elyse"
          ],
          "date" : "2013-05-23T19:00",
          "location_event" : {
            "name" : "NoSQL Roadshow",
            "geolocation" : "37.787742,-122.398964"
          },
          "reviews" : 5
        }
      },
      {
        "_index" : "get-together-event",
        "_type" : "_doc",
        "_id" : "1145",
        "_score" : 1.1603596,
        "_source" : {
          "host" : "Yann",
          "title" : "Using Hadoop with Elasticsearch",
          "description" : "We will walk through using Hadoop with Elasticsearch for big data crunching!",
          "attendees" : [
            "Yann",
            "Bill",
            "James"
          ],
          "date" : "2013-09-09T18:30",
          "location_event" : {
            "name" : "SkillsMatter Exchange",
            "geolocation" : "51.524806,-0.099095"
          },
          "reviews" : 2
        }
      },
      {
        "_index" : "get-together-event",
        "_type" : "_doc",
        "_id" : "1135",
        "_score" : 1.0098797,
        "_source" : {
          "host" : "Dave",
          "title" : "Elasticsearch at Rangespan and Exonar",
          "description" : "Representatives from Rangespan and Exonar will come and discuss how they use Elasticsearch",
          "attendees" : [
            "Dave",
            "Andrew",
            "David",
            "Clint"
          ],
          "date" : "2013-06-24T18:30",
          "location_event" : {
            "name" : "Alumni Theatre",
            "geolocation" : "51.51558,-0.117699"
          },
          "reviews" : 3
        }
      },
      {
        "_index" : "get-together-event",
        "_type" : "_doc",
        "_id" : "107",
        "_score" : 0.80400664,
        "_source" : {
          "host" : "Mik",
          "title" : "Logging and Elasticsearch",
          "description" : "Get a deep dive for what Elasticsearch is and how it can be used for logging with Logstash as well as Kibana!",
          "attendees" : [
            "Shay",
            "Rashid",
            "Erik",
            "Grant",
            "Mik"
          ],
          "date" : "2013-04-08T18:00",
          "location_event" : {
            "name" : "Salesforce headquarters",
            "geolocation" : "37.793592,-122.397033"
          },
          "reviews" : 3
        }
      },
      {
        "_index" : "get-together-event",
        "_type" : "_doc",
        "_id" : "104",
        "_score" : 0.70427096,
        "_source" : {
          "host" : "Lee",
          "title" : "Queries and Filters",
          "description" : "A get together to talk about different ways to query Elasticsearch, what works best for different kinds of applications.",
          "attendees" : [
            "Lee",
            "Greg",
            "Richard"
          ],
          "date" : "2013-06-17T18:00",
          "location_event" : {
            "name" : "Stoneys Full Steam Tavern",
            "geolocation" : "39.752337,-105.00083"
          },
          "reviews" : 1
        }
      }
    ]
  }
}

这里意思就是把某个字段的值能影响score,而之前boost是通过匹配进行影响, 通过ln(VIEWS)进行一定的乘法,具体我也不太清楚咋算的。

script

更多关于script的可以参考这里 https://www.elastic.co/guide/en/elasticsearch/reference/6.8/modules-scripting-using.html

这里就是理解成可以引用变量或者字段,可以进行数学运算,书里面的例子我总 是报错就不复现了

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
curl -XGET 'localhost:9200/get-together-event/_search?pretty' -d '{
  "query": {
    "function_score": {
      "query": {
        "match": {
          "description": "elasticsearch"
        }
      },
      "functions": [
         {
           "script_score": {
                "script": {
                    "lang": "painless",
                    "source": "3 * _score",
		     "params": {
			"myweight": 3
		     }
                 }
            }
	 }
      ]
    }
  }
}'
random

也可以使用随机score

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
root@lqx-elk-test-all-in-one:~# curl -XGET 'localhost:9200/get-together-event/_search?pretty' -d '{
  "query": {
    "function_score": {
      "query": {
        "match": {
          "description": "elasticsearch"
        }
      },
      "functions": [
         {
           "random_score": {
               "seed": 1234
            }
         }
      ]
    }
  }
}'
{
  "took" : 19,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 7,
    "max_score" : 0.31896096,
    "hits" : [
      {
        "_index" : "get-together-event",
        "_type" : "_doc",
        "_id" : "103",
        "_score" : 0.31896096,
        "_source" : {
          "host" : "Lee",
          "title" : "Introduction to Elasticsearch",
          "description" : "An introduction to ES and each other. We can meet and greet and I will present on some Elasticsearch basics and how we use it.",
          "attendees" : [
            "Lee",
            "Martin",
            "Greg",
            "Mike"
          ],
          "date" : "2013-04-17T19:00",
          "location_event" : {
            "name" : "Stoneys Full Steam Tavern",
            "geolocation" : "39.752337,-105.00083"
          },
          "reviews" : 5
        }
      },
      {
        "_index" : "get-together-event",
        "_type" : "_doc",
        "_id" : "107",
        "_score" : 0.3066414,
        "_source" : {
          "host" : "Mik",
          "title" : "Logging and Elasticsearch",
          "description" : "Get a deep dive for what Elasticsearch is and how it can be used for logging with Logstash as well as Kibana!",
          "attendees" : [
            "Shay",
            "Rashid",
            "Erik",
            "Grant",
            "Mik"
          ],
          "date" : "2013-04-08T18:00",
          "location_event" : {
            "name" : "Salesforce headquarters",
            "geolocation" : "37.793592,-122.397033"
          },
          "reviews" : 3
        }
      },
      {
        "_index" : "get-together-event",
        "_type" : "_doc",
        "_id" : "1135",
        "_score" : 0.21054034,
        "_source" : {
          "host" : "Dave",
          "title" : "Elasticsearch at Rangespan and Exonar",
          "description" : "Representatives from Rangespan and Exonar will come and discuss how they use Elasticsearch",
          "attendees" : [
            "Dave",
            "Andrew",
            "David",
            "Clint"
          ],
          "date" : "2013-06-24T18:30",
          "location_event" : {
            "name" : "Alumni Theatre",
            "geolocation" : "51.51558,-0.117699"
          },
          "reviews" : 3
        }
      },
      {
        "_index" : "get-together-event",
        "_type" : "_doc",
        "_id" : "1125",
        "_score" : 0.18382785,
        "_source" : {
          "host" : "Dave Nolan",
          "title" : "real-time Elasticsearch",
          "description" : "We will discuss using Elasticsearch to index data in real time",
          "attendees" : [
            "Dave",
            "Shay",
            "John",
            "Harry"
          ],
          "date" : "2013-02-18T18:30",
          "location_event" : {
            "name" : "SkillsMatter Exchange",
            "geolocation" : "51.524806,-0.099095"
          },
          "reviews" : 3
        }
      },
      {
        "_index" : "get-together-event",
        "_type" : "_doc",
        "_id" : "104",
        "_score" : 0.16013168,
        "_source" : {
          "host" : "Lee",
          "title" : "Queries and Filters",
          "description" : "A get together to talk about different ways to query Elasticsearch, what works best for different kinds of applications.",
          "attendees" : [
            "Lee",
            "Greg",
            "Richard"
          ],
          "date" : "2013-06-17T18:00",
          "location_event" : {
            "name" : "Stoneys Full Steam Tavern",
            "geolocation" : "39.752337,-105.00083"
          },
          "reviews" : 1
        }
      },
      {
        "_index" : "get-together-event",
        "_type" : "_doc",
        "_id" : "1145",
        "_score" : 0.092752025,
        "_source" : {
          "host" : "Yann",
          "title" : "Using Hadoop with Elasticsearch",
          "description" : "We will walk through using Hadoop with Elasticsearch for big data crunching!",
          "attendees" : [
            "Yann",
            "Bill",
            "James"
          ],
          "date" : "2013-09-09T18:30",
          "location_event" : {
            "name" : "SkillsMatter Exchange",
            "geolocation" : "51.524806,-0.099095"
          },
          "reviews" : 2
        }
      },
      {
        "_index" : "get-together-event",
        "_type" : "_doc",
        "_id" : "108",
        "_score" : 0.020456363,
        "_source" : {
          "host" : "Elyse",
          "title" : "Piggyback on Elasticsearch training in San Francisco",
          "description" : "We can piggyback on training by Elasticsearch to have some Q&A time with the ES devs",
          "attendees" : [
            "Shay",
            "Igor",
            "Uri",
            "Elyse"
          ],
          "date" : "2013-05-23T19:00",
          "location_event" : {
            "name" : "NoSQL Roadshow",
            "geolocation" : "37.787742,-122.398964"
          },
          "reviews" : 5
        }
      }
    ]
  }
}
Decay Functions

衰变函数 感兴趣的可以看https://www.elastic.co/guide/en/elasticsearch/reference/6.8/query-dsl-function-score-query.html 意思就是你的score会随着各种衰变函数进行score的变化 ../images/elasticsearch/function-score-decay.png◎ ../images/elasticsearch/function-score-decay.png

我这里报错了,其实原因就是我导入的时候没有按照人家的mapping,geolocation这个默认自动了,应该设置成type geopoint才可以

意思就是这个地址附近100m的保持不变,如果隔2km则*0.5

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
curl -XGET 'localhost:9200/get-together-event/_search?pretty' -d '{
  "query": {
    "function_score": {
      "query": {
        "match": {
          "description": "elasticsearch"
        }
      },
      "functions": [
         {
           "gauss": {
	       "location_event.geolocation": {
                   "origin": "40.018528, -105.275806",
                   "offset": "100m",
                   "scale": "2km",
                   "decay": 0.5
                }
            }
	 }
      ]
    }
  }
}'
  "reason" : "field [location_event.geolocation] is of type [indexed,tokenized], but only numeric types are supported.",