Example using WikipediaTokenizer in Lucene(在 Lucene 中使用 WikipediaTokenizer 的示例)
问题描述
我想在 lucene 项目中使用 WikipediaTokenizer - http://lucene.apache.org/java/3_0_2/api/contrib-wikipedia/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.html 但我从未使用过 lucene.我只想将维基百科字符串转换为令牌列表.但是,我看到这个类中只有四种方法可用,end、incrementToken、reset、reset(reader).谁能给我举个例子来使用它.
I want to use WikipediaTokenizer in lucene project - http://lucene.apache.org/java/3_0_2/api/contrib-wikipedia/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.html But I never used lucene. I just want to convert a wikipedia string into a list of tokens. But, I see that there are only four methods available in this class, end, incrementToken, reset, reset(reader). Can someone point me to an example to use it.
谢谢.
推荐答案
在 Lucene 3.0 中,next() 方法被移除.现在您应该使用 incrementToken 来遍历令牌,当您到达输入流的末尾时它会返回 false.要获取每个令牌,您应该使用 AttributeSource 类.根据您要获取的属性(术语、类型、有效负载等),您需要使用 addAttribute 方法将相应属性的类类型添加到您的分词器中.
In Lucene 3.0, next() method is removed. Now you should use incrementToken to iterate through the tokens and it returns false when you reach the end of the input stream. To obtain the each token, you should use the methods of the AttributeSource class. Depending on the attributes that you want to obtain (term, type, payload etc), you need to add the class type of the corresponding attribute to your tokenizer using addAttribute method.
以下部分代码示例来自WikipediaTokenizer的测试类,您可以在下载Lucene的源代码时找到它.
Following partial code sample is from the test class of the WikipediaTokenizer which you can find if you download the source code of the Lucene.
...
WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(test));
int count = 0;
int numItalics = 0;
int numBoldItalics = 0;
int numCategory = 0;
int numCitation = 0;
TermAttribute termAtt = tf.addAttribute(TermAttribute.class);
TypeAttribute typeAtt = tf.addAttribute(TypeAttribute.class);
while (tf.incrementToken()) {
String tokText = termAtt.term();
//System.out.println("Text: " + tokText + " Type: " + token.type());
String expectedType = (String) tcm.get(tokText);
assertTrue("expectedType is null and it shouldn't be for: " + tf.toString(), expectedType != null);
assertTrue(typeAtt.type() + " is not equal to " + expectedType + " for " + tf.toString(), typeAtt.type().equals(expectedType) == true);
count++;
if (typeAtt.type().equals(WikipediaTokenizer.ITALICS) == true){
numItalics++;
} else if (typeAtt.type().equals(WikipediaTokenizer.BOLD_ITALICS) == true){
numBoldItalics++;
} else if (typeAtt.type().equals(WikipediaTokenizer.CATEGORY) == true){
numCategory++;
}
else if (typeAtt.type().equals(WikipediaTokenizer.CITATION) == true){
numCitation++;
}
}
...
这篇关于在 Lucene 中使用 WikipediaTokenizer 的示例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:在 Lucene 中使用 WikipediaTokenizer 的示例


- Eclipse 插件更新错误日志在哪里? 2022-01-01
- Jersey REST 客户端:发布多部分数据 2022-01-01
- C++ 和 Java 进程之间的共享内存 2022-01-01
- Safepoint+stats 日志,输出 JDK12 中没有 vmop 操作 2022-01-01
- value & 是什么意思?0xff 在 Java 中做什么? 2022-01-01
- Java包名称中单词分隔符的约定是什么? 2022-01-01
- 从 finally 块返回时 Java 的奇怪行为 2022-01-01
- 如何使用WebFilter实现授权头检查 2022-01-01
- 将log4j 1.2配置转换为log4j 2配置 2022-01-01
- Spring Boot连接到使用仲裁器运行的MongoDB副本集 2022-01-01