初识Lucene 4.5全文搜索

共计 13189 个字符，预计需要花费 33 分钟才能阅读完成。

近期想研究下 lucene，但网络上的教程大多都是 lucne 3.x 版本的讲解。可是 lucene 版本的更新速度快的惊人，目前已经到了 4.8 版了，只好去查阅官方文档。虽然英文不大好，但稍微对比了下发现 3.x 版本至 4.x 版本的修改非常之大。接下来我就以 4.5 版来操作，分享下我对 luence 的初步认识。

先给大家看一张图（来至《Lucene in action》http://www.linuxidc.com/Linux/2013-10/91052.htm）：

初识 Lucene 4.5 全文搜索

此图很形象的描述了 lucene 的基本流程，简而言之就是：1、创建索引；2、检索索引。

————————————– 分割线 ————————————–

基于 Lucene 多索引进行索引和搜索 http://www.linuxidc.com/Linux/2012-05/59757.htm

Lucene 实战(第 2 版) 中文版配套源代码 http://www.linuxidc.com/Linux/2013-10/91055.htm

Lucene 实战(第 2 版) PDF 高清中文版 http://www.linuxidc.com/Linux/2013-10/91052.htm

使用 Lucene-Spatial 实现集成地理位置的全文检索 http://www.linuxidc.com/Linux/2012-02/53117.htm

Lucene + Hadoop 分布式搜索运行框架 Nut 1.0a9 http://www.linuxidc.com/Linux/2012-02/53113.htm

Lucene + Hadoop 分布式搜索运行框架 Nut 1.0a8 http://www.linuxidc.com/Linux/2012-02/53111.htm

Lucene + Hadoop 分布式搜索运行框架 Nut 1.0a7 http://www.linuxidc.com/Linux/2012-02/53110.htm

Project 2-1: 配置 Lucene, 建立 WEB 查询系统[Ubuntu 10.10] http://www.linuxidc.com/Linux/2010-11/30103.htm

————————————– 分割线 ————————————–

太深的道理与原理我目前也还是一知半解，所以就以小白的思维来阐述。

Lucene 与数据库有许多相通之处，以下我们做个简单对比:

	数据库	Luecene
基本概念	列 / 字段	Field
基本概念	行 / 记录	Document
基本操作	查询(SELECT)	Searcher
	添加(INSERT)	IndexWriter. addDocument
	删除(DELETE)	IndexReader.delete
	修改(UPDATE)	不支持 ( 可删除后重新添加)

上面的表格式某位网友博文里的对比，我觉得挺好理解的。可以这么去认为吧，lucene 把你数据库里的数据做了个索引，以后你要全文查找某数据时就可以从索引中查找，就好比字典的索引目录！

废话说了一大堆，还是用代码来说话，首先要往代码里导入三个包：

lucene-analyzers-common-4.5.0、lucene-core-4.5.0、lucene-queryparser-4.5.0

要有面向对象的思维嘛，先创建一个 javabean：luceneBeans，用来存放你所要的数据

package pojo;

public class LuceneBeans {

private String id;

private String title;

private String introduce;

private String addtime;

private String category;

public LuceneBeans() {

super();

}

public LuceneBeans(String id, String title, String introduce,

String addtime, String category) {

super();

this.id = id;

this.title = title;

this.introduce = introduce;

this.addtime = addtime;

this.category = category;

}

public String getId() {

return id;

}

public void setId(String id) {

this.id = id;

}

public String getTitle() {

return title;

}

public void setTitle(String title) {

this.title = title;

}

public String getIntroduce() {

return introduce;

}

public void setIntroduce(String introduce) {

this.introduce = introduce;

}

public String getAddtime() {

return addtime;

}

public void setAddtime(String addtime) {

this.addtime = addtime;

}

public String getCategory() {

return category;

}

public void setCategory(String category) {

this.category = category;

}

更多详情见请继续阅读下一页的精彩内容：http://www.linuxidc.com/Linux/2014-07/104532p2.htm

以下代码把 lucene 基本操作方法封装在了 IndexUtil 类中，其中考虑到创建 indexReader 开销过大，就设计了单实例模式。注释中会穿插些 4.x 版本与 3.x 版本的不同之处。

package lucene;

import java.io.File;

import java.io.IOException;

import java.util.ArrayList;

import java.util.Date;

import java.util.List;

import org.apache.lucene.analysis.Analyzer;

import org.apache.lucene.analysis.standard.StandardAnalyzer;

import org.apache.lucene.document.Document;

import org.apache.lucene.document.Field;

import org.apache.lucene.document.StringField;

import org.apache.lucene.document.TextField;

import org.apache.lucene.document.Field.Store;

import org.apache.lucene.index.DirectoryReader;

import org.apache.lucene.index.IndexReader;

import org.apache.lucene.index.IndexWriter;

import org.apache.lucene.index.IndexWriterConfig;

import org.apache.lucene.index.Term;

import org.apache.lucene.index.IndexWriterConfig.OpenMode;

import org.apache.lucene.queryparser.classic.ParseException;

import org.apache.lucene.queryparser.classic.QueryParser;

import org.apache.lucene.search.IndexSearcher;

import org.apache.lucene.search.Query;

import org.apache.lucene.search.ScoreDoc;

import org.apache.lucene.search.TopDocs;

import org.apache.lucene.store.Directory;

import org.apache.lucene.store.FSDirectory;

import org.apache.lucene.util.Version;

import pojo.LuceneBeans;

public class IndexUtil {

private static Directory directory = null;

private static IndexReader reader = null;

/**

* 设置存储路径

* @return

public static String getIndexDir(){

// 得到.class 文件所在路径

String classpath = LuceneUtil.class.getResource(“/”).getPath();

// 将 class 中的 %20 替换成为空格

classpath = classpath.replaceAll(“%20″, ” “);

// 索引存储位置：WEB-INF/classes/index

String path = classpath+”index/”;

return path;

}

public IndexUtil(){

try {

// 存储方式有 CompoundFileDirectory, FileSwitchDirectory, FSDirectory,

//NRTCachingDirectory, RAMDirectory……等分别对应的不同方式

// 用 FSDirectory 的好处就在于它会自动帮你分配该使用哪种方式

directory = FSDirectory.open(new File(getIndexDir()));

//3.x 是 reader=IndexReader.open(directory);

// 但 4.x 后已经不建议使用，改成了 DirectoryReader.open(directory)

reader = DirectoryReader.open(directory);

} catch (IOException e) {

e.printStackTrace();

}

/**

* 创建 IndexWriter

* @return

public static IndexWriter getIndexWriter(){

// 创建分词器，分词器可根据自己需求去自定义创建，此处以 lucene 自带的标准分词器分词

Analyzer analyzer=new StandardAnalyzer(Version.LUCENE_45);

IndexWriter indexWriter = null;

try {

IndexWriterConfig iwConfig = new IndexWriterConfig(Version.LUCENE_45, analyzer);

iwConfig.setOpenMode(OpenMode.CREATE_OR_APPEND);

indexWriter = new IndexWriter(directory, iwConfig);

return indexWriter;

} catch (IOException e) {

e.printStackTrace();

return null;

}

//——– 省略余下方法代码

嗯~ 创建索引前我们需要先获得 IndexWriter，其实步骤就像你往数据库中插入一条数据（document）, 得再这条数据中加入字段（Field），得给这个字段做个说明（Store、Index）……

/**

* 批量创建索引

* @param list

* @return

public boolean createIndexList(List<LuceneBeans> list){

boolean result = false;

IndexWriter indexWriter = IndexUtil.getIndexWriter();

try {

if(list!=null&&list.size()>0){

Document doc = null;

for(int i=0;i<list.size();i++){

doc = new Document();

LuceneBeans lb = list.get(i);

doc.add(new StringField(“id”,lb.getId(),Store.YES));

doc.add(new TextField(“title”,lb.getTitle(),Store.YES));

doc.add(new TextField(“introduce”,lb.getIntroduce(),Store.YES));

doc.add(new StringField(“addtime”,lb.getAddtime(),Store.YES));

doc.add(new StringField(“category”,lb.getCategory(),Store.YES));

indexWriter.addDocument(doc);

indexWriter.commit();

result = true;

}

} catch (IOException e) {

e.printStackTrace();

}finally{

if(indexWriter!=null){

try {

indexWriter.close();

} catch (IOException e) {

e.printStackTrace();

}

return result;

}

之前 3.x 版本是这样写的：

doc.add(new Field(“id”,lb.getI(),Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS));

那我们就顺便来分析下 store、index

（存储域选项）Store 判断是否把域存入文件（yes 存入，可还原；no 不存入，不可还原但可被索引）

（索引域选项）Index:

//|—-ANALYZED 进行分词和索引，适用于标题、内容等

//|—-NO_ANALYZED 进行索引但是不进行分词，身份号、姓名、ID 等，适用于精确搜索

//|—-ANALYZED_NOT_NORMS 进行分词但不存储 norms 信息，这个 norms 中包括了创建索引的时间和权值等信息

//|—-NOT_ANALYZED_NOT_NORMS 既不进行分词也不存储 norms 信息

但 4.x 之后就改变了，它需要更为精确的 StringField、TextField、IntField……

初识 Lucene 4.5 全文搜索

那为什么 4.x 版本只写了 Store 而没 Index? 还是让我们看看官方文档上的描述吧：

先看 TxetField：自动分词

初识 Lucene 4.5 全文搜索

再看看 StringField：不分词

初识 Lucene 4.5 全文搜索

所以，4.x 有很多写法是与 3.x 不同的，虽然还可以向下兼容，但总是要跟着时代的脚步前进嘛！