How to extract Document Term Vector in Lucene 3.5.0(如何在 Lucene 3.5.0 中提取文档术语向量)
问题描述
我正在使用 Lucene 3.5.0,我想输出每个文档的术语向量.例如,我想知道一个词在所有文档和每个特定文档中的频率.我的索引代码是:
I am using Lucene 3.5.0 and I want to output term vectors of each document. For example I want to know the frequency of a term in all documents and in each specific document. My indexing code is:
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;
import java.io.File;
import java.io.FileReader;
import java.io.BufferedReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Document;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
public class Indexer {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
throw new IllegalArgumentException("Usage: java " + Indexer.class.getName() + " <index dir> <data dir>");
}
String indexDir = args[0];
String dataDir = args[1];
long start = System.currentTimeMillis();
Indexer indexer = new Indexer(indexDir);
int numIndexed;
try {
numIndexed = indexer.index(dataDir, new TextFilesFilter());
} finally {
indexer.close();
}
long end = System.currentTimeMillis();
System.out.println("Indexing " + numIndexed + " files took " + (end - start) + " milliseconds");
}
private IndexWriter writer;
public Indexer(String indexDir) throws IOException {
Directory dir = FSDirectory.open(new File(indexDir));
writer = new IndexWriter(dir,
new StandardAnalyzer(Version.LUCENE_35),
true,
IndexWriter.MaxFieldLength.UNLIMITED);
}
public void close() throws IOException {
writer.close();
}
public int index(String dataDir, FileFilter filter) throws Exception {
File[] files = new File(dataDir).listFiles();
for (File f: files) {
if (!f.isDirectory() &&
!f.isHidden() &&
f.exists() &&
f.canRead() &&
(filter == null || filter.accept(f))) {
BufferedReader inputStream = new BufferedReader(new FileReader(f.getName()));
String url = inputStream.readLine();
inputStream.close();
indexFile(f, url);
}
}
return writer.numDocs();
}
private static class TextFilesFilter implements FileFilter {
public boolean accept(File path) {
return path.getName().toLowerCase().endsWith(".txt");
}
}
protected Document getDocument(File f, String url) throws Exception {
Document doc = new Document();
doc.add(new Field("contents", new FileReader(f)));
doc.add(new Field("urls", url, Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("filename", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("fullpath", f.getCanonicalPath(), Field.Store.YES, Field.Index.NOT_ANALYZED));
return doc;
}
private void indexFile(File f, String url) throws Exception {
System.out.println("Indexing " + f.getCanonicalPath());
Document doc = getDocument(f, url);
writer.addDocument(doc);
}
}
谁能帮我写一个程序来做到这一点?谢谢.
can anybody help me in writing a program to do that? thanks.
推荐答案
首先,你不需要为了只知道词在文档中出现的频率而存储词向量.尽管如此,Lucene 还是存储了这些数字以用于 TF-IDF 计算.您可以通过调用 IndexReader.termDocs(term)
并遍历结果来访问此信息.
First of all, you don't need to store term vectors in order to know solely the frequency of term in documents. Lucene stores these numbers nevertheless to use in TF-IDF calculation. You can access this information by calling IndexReader.termDocs(term)
and iterating over the result.
如果您有其他目的并且您确实需要访问术语向量,那么您需要告诉 Lucene 存储它们,方法是将 Field.TermVector.YES
作为Field
构造函数.然后,您可以检索向量,例如与 IndexReader.getTermFreqVector()
.
If you have some other purpose in mind and you actually need to access the term vectors, then you need to tell Lucene to store them, by passing Field.TermVector.YES
as the last argument of Field
constructor. Then, you can retrieve the vectors e.g. with IndexReader.getTermFreqVector()
.
这篇关于如何在 Lucene 3.5.0 中提取文档术语向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:如何在 Lucene 3.5.0 中提取文档术语向量


- Spring Boot连接到使用仲裁器运行的MongoDB副本集 2022-01-01
- 将log4j 1.2配置转换为log4j 2配置 2022-01-01
- C++ 和 Java 进程之间的共享内存 2022-01-01
- value & 是什么意思?0xff 在 Java 中做什么? 2022-01-01
- 从 finally 块返回时 Java 的奇怪行为 2022-01-01
- 如何使用WebFilter实现授权头检查 2022-01-01
- Safepoint+stats 日志,输出 JDK12 中没有 vmop 操作 2022-01-01
- Jersey REST 客户端:发布多部分数据 2022-01-01
- Java包名称中单词分隔符的约定是什么? 2022-01-01
- Eclipse 插件更新错误日志在哪里? 2022-01-01