在Elasticsearch调研中,已经学习了es的similarity、plugin和在idea中debug启动。有了以上,就可以自己制作es评分插件。这篇博客说下如何自定义script_score实现自己的评分算法。
老规矩,先上项目地址es-score-plugin ——仅适用于elasticsearch6.6.2版本
目的
es评分bm25或者tf-rdf考虑了分出来的token的频率和该文档在整个索引中的表现(权重),虽然专业但是有点重,并且在某些特别的需求下不太好。
在某些情况下,不如只考虑两个字符串之间的关系。相关的算法也挺多的,horspool是一种,我自己也实现了一种。
es插件项目的代码结构(maven项目)
跟着Elasticsearch自定义插件开发这个成功走通。
maven package后会在target中生成.zip压缩文件,把这个zip解压缩到elasticsearch的plugins文件夹,然后重启es即可(集群每个node都需要重启)
核心代码
package org.elasticsearch.plugin.score;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import org.apache.lucene.index.LeafReader;
import org.apache.lucene.index.LeafReaderContext;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.plugins.Plugin;
import org.elasticsearch.plugins.ScriptPlugin;
import org.elasticsearch.script.ScoreScript;
import org.elasticsearch.script.ScoreScript.LeafFactory;
import org.elasticsearch.script.ScriptContext;
import org.elasticsearch.script.ScriptEngine;
import org.elasticsearch.search.lookup.LeafDocLookup;
import org.elasticsearch.search.lookup.SearchLookup;
import org.elasticsearch.search.lookup.SourceLookup;
import java.io.IOException;
import java.util.Collection;
import java.util.List;
import java.util.Map;
/**
* An example script plugin that adds a {@link ScriptEngine} implementing expert scoring.
*/
public class ExpertScriptPlugin extends Plugin implements ScriptPlugin {
protected static final Logger logger = LogManager.getLogger(ExpertScriptPlugin.class);
@Override
public ScriptEngine getScriptEngine(Settings settings, Collection<ScriptContext<?>> contexts) {
return new MyExpertScriptEngine();
}
/**
* An example {@link ScriptEngine} that uses Lucene segment details to implement pure document frequency scoring.
*/
// tag::expert_engine
private static class MyExpertScriptEngine implements ScriptEngine {
@Override
public String getType() {
return "expert_scripts";
}
@Override
public <T> T compile(String scriptName, String scriptSource,
ScriptContext<T> context, Map<String, String> params) {
if (context.equals(ScoreScript.CONTEXT) == false) {
throw new IllegalArgumentException(getType()
+ " scripts cannot be used for context ["
+ context.name + "]");
}
// we use the script "source" as the script identifier
if ("horspool".equals(scriptSource)) {
ScoreScript.Factory factory = PureDfLeafFactory::new;
return context.factoryClazz.cast(factory);
}
throw new IllegalArgumentException("Unknown script name "
+ scriptSource);
}
@Override
public void close() {
// optionally close resources
}
private static class PureDfLeafFactory implements LeafFactory {
private final Map<String, Object> params;
private final SearchLookup lookup;
private final List<String> fields;
private final String query;
private PureDfLeafFactory(
Map<String, Object> params, SearchLookup lookup) {
if (params.containsKey("field") == false) {
throw new IllegalArgumentException(
"Missing parameter [field]");
}
if (params.containsKey("query") == false) {
throw new IllegalArgumentException(
"Missing parameter [query]");
}
this.params = params;
this.lookup = lookup;
fields = (List<String>) params.get("field");
query = params.get("query").toString();
}
@Override
public boolean needs_score() {
return true; // Return true if the script needs the score
}
@Override
public ScoreScript newInstance(LeafReaderContext context)
throws IOException {
LeafReader reader = context.reader();
SourceLookup source = lookup.getLeafSearchLookup(context).source();
LeafDocLookup doc = lookup.getLeafSearchLookup(context).doc();
return new ScoreScript(params, lookup, context) {
int currentDocid = -1;
@Override
public void setDocument(int docid) {
currentDocid = docid;
}
@Override
public double execute() {
//获取原来的评分
double rawScore = this.get_score();
final double[] maxScore = {0.0};
fields.forEach(fieldWeight -> {
String[] split = fieldWeight.split("\\^");
String field = split[0];
double weight = split.length == 2 ? Double.parseDouble(split[1]) : 1;
source.setSegmentAndDocument(context, currentDocid);
String value = "";
if (source.containsKey(field)) {
value = String.valueOf(source.get(field));
} else {
return;
}
double horspool = Horspool.calHorspoolScoreWrapper(value, query);
horspool = horspool * weight;
if (horspool > maxScore[0]) {
maxScore[0] = horspool;
}
});
return maxScore[0];
}
};
}
}
}
// end::expert_engine
}
horspool算法入参就两个文本,不受其他文档的影响,这是与es内置评分算法的主要区别。es的内置评分算法肯定是考虑最周全的,在这里自定义评分参数也许只能对某些场景有一些优化,但至少提供了一种可能,毕竟es自己也出于某种考虑给我们提供了这种expertScriptEngine。
使用
{
"query": {
"function_score": {
"query": {
"match_all": {}
},
"boost_mode": "replace",
"functions": [
{
"script_score": {
"script": {
"source": "horspool",
"lang": "expert_scripts",
"params": {
"field": ["title^2","body^1"],
"query": "{{query}}"
}
}
}
}
]
}
}
}
参考文档
Advanced scripts using script engines
elasticsearch6.6.2 javaDoc ——专家脚本模式有可能需要查查看
gist
暂时没有用到但是有些作用的代码:
// 打印当前document某field的分词结果(terms)
Terms terms = null;
String temp="";
try {
terms = reader.terms(field);
TermsEnum iterator = terms.iterator();
for (int i = 0; i < terms.size(); i++) {
BytesRef next = iterator.next();
if (next==null){
break;
}
temp+="|"+Term.toString(next);
}
System.out.println(value+"=="+temp);
} catch (IOException e) {
e.printStackTrace();
}
// 获取某字段值(不推荐,推荐使用sourceLookup)
Document document = reader.document(currentDocid);
String doc=new String(document.getBinaryValue("_source").bytes);
JSONObject jsonObject=JSONObject.parseObject(doc);
String value=jsonObject.getString(field);