Tokenize, remove stop words using Lucene with Java(使用 Lucene 和 Java 标记、删除停用词)
问题描述
我正在尝试使用 Lucene 从 txt 文件中标记和删除停用词.我有这个:
I am trying to tokenize and remove stop words from a txt file with Lucene. I have this:
public String removeStopWords(String string) throws IOException {
Set<String> stopWords = new HashSet<String>();
    stopWords.add("a");
    stopWords.add("an");
    stopWords.add("I");
    stopWords.add("the");
    TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_43, new StringReader(string));
    tokenStream = new StopFilter(Version.LUCENE_43, tokenStream, stopWords);
    StringBuilder sb = new StringBuilder();
    CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
    while (tokenStream.incrementToken()) {
        if (sb.length() > 0) {
            sb.append(" ");
        }
        sb.append(token.toString());
    System.out.println(sb);    
    }
    return sb.toString();
}}
我的主要看起来像这样:
My main looks like this:
    String file = "..../datatest.txt";
    TestFileReader fr = new TestFileReader();
    fr.imports(file);
    System.out.println(fr.content);
    String text = fr.content;
    Stopwords stopwords = new Stopwords();
    stopwords.removeStopWords(text);
    System.out.println(stopwords.removeStopWords(text));
这给了我一个错误,但我不知道为什么.
This is giving me an error but I can't figure out why.
推荐答案
我遇到了同样的问题.要使用 Lucene 删除停用词,您可以使用方法 EnglishAnalyzer.getDefaultStopSet(); 使用它们的默认停止集.否则,您可以创建自己的自定义停用词列表.
I had The same problem. To remove stop-words using Lucene you could either use their Default Stop Set using the method EnglishAnalyzer.getDefaultStopSet();. Otherwise, you could create your own custom stop-words list. 
下面的代码显示了 removeStopWords() 的正确版本:
The code below shows the correct version of your removeStopWords():
public static String removeStopWords(String textFile) throws Exception {
    CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet();
    TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_48, new StringReader(textFile.trim()));
    tokenStream = new StopFilter(Version.LUCENE_48, tokenStream, stopWords);
    StringBuilder sb = new StringBuilder();
    CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();
    while (tokenStream.incrementToken()) {
        String term = charTermAttribute.toString();
        sb.append(term + " ");
    }
    return sb.toString();
}
要使用自定义停用词列表,请使用以下内容:
To use a custom list of stop words use the following:
//CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet(); //this is Lucene set 
final List<String> stop_Words = Arrays.asList("fox", "the");
final CharArraySet stopSet = new CharArraySet(Version.LUCENE_48, stop_Words, true);
                        这篇关于使用 Lucene 和 Java 标记、删除停用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:使用 Lucene 和 Java 标记、删除停用词
				
        
 
            
        - 如何使用WebFilter实现授权头检查 2022-01-01
 - 从 finally 块返回时 Java 的奇怪行为 2022-01-01
 - Spring Boot连接到使用仲裁器运行的MongoDB副本集 2022-01-01
 - Java包名称中单词分隔符的约定是什么? 2022-01-01
 - value & 是什么意思?0xff 在 Java 中做什么? 2022-01-01
 - 将log4j 1.2配置转换为log4j 2配置 2022-01-01
 - Eclipse 插件更新错误日志在哪里? 2022-01-01
 - Safepoint+stats 日志,输出 JDK12 中没有 vmop 操作 2022-01-01
 - C++ 和 Java 进程之间的共享内存 2022-01-01
 - Jersey REST 客户端:发布多部分数据 2022-01-01
 
						
						
						
						
						