In Lucene, a Document is the unit of search and index.
- An index consists of a sequence of documents.
- A document contains of one or more Fields.
- A field is a named sequence of terms.
- A term is a string.
In order to index data with Lucene, you must convert it to a stream of plain-text tokens firstly. Based on the stream, the document containing fields will be created. After that, you can call addDocument
method in IndexWriter
class to transmit it to index. During this procedure, analyzer may be used to make the data more suitable for indexing.
Field
In Lucene, fields may be stored, in which case their text is stored in the index literally, in a non-inverted manner. Fields that are inverted are called indexed. A field may be both stored and indexed.
The text of a field may be tokenized into terms to be indexed, or the text of a field may be used literally as a term to be indexed. Most fields are tokenized, but sometimes it is useful for certain identifier fields to be indexed literally.
Each field has three parts: name, type and value. Values may be text (String, Reader or pre-analyzed TokenStream), binary (byte[]), or numeric (a Number). Fields are optionally stored in the index, so that they may be returned with hits on the document.
Then, Let’s deep into source code,
public class Field implements IndexableField {
protected final FieldType type;
protected final String name;
protected float boost = 1.0f;
/**
* data can be converted to
* 1. Numberic Value
* 2. Binary Value
* 3. String Value
* 4. Reader Value
*/
protected Object fieldsData;
/** Pre-analyzed tokenStream for indexed fields; this is
* separate from fieldsData because you are allowed to
* have both; eg maybe field has a String value but you
* customize how it's tokenized */
protected TokenStream tokenStream;
/**
*...
* getters and settins for above attributes.
*/
/**
* Creates the TokenStream used for indexing this field. If appropriate,
* @param analyzer Analyzer that should be used to create the TokenStreams from
* @param reuse TokenStream for a previous instance of this field name. This allows
* custom field types (like StringField and NumericField) that do not use
* the analyzer to still have good performance. Note: the passed-in type
* may be inappropriate, for example if you mix up different types of Fields
* for the same field name. So it's the responsibility of the implementation to
* check.
* @return TokenStream value for indexing the document. Should always return
*/
@Override
public TokenStream tokenStream(Analyzer analyzer, TokenStream reuse) {
if (fieldType().indexOptions() == IndexOptions.NONE) {
// Not indexed
return null;
}
final FieldType.LegacyNumericType numericType = fieldType().numericType();
if (numericType != null) {
if (!(reuse instanceof LegacyNumericTokenStream && ((LegacyNumericTokenStream)reuse).getPrecisionStep() == type.numericPrecisionStep())) {
// lazy init the TokenStream as it is heavy to instantiate
// (attributes,...) if not needed (stored field loading)
reuse = new LegacyNumericTokenStream(type.numericPrecisionStep());
}
final LegacyNumericTokenStream nts = (LegacyNumericTokenStream) reuse;
// initialize value in TokenStream
final Number val = (Number) fieldsData;
switch (numericType) {
case INT:
nts.setIntValue(val.intValue());
break;
case LONG:
nts.setLongValue(val.longValue());
break;
case FLOAT:
nts.setFloatValue(val.floatValue());
break;
case DOUBLE:
nts.setDoubleValue(val.doubleValue());
break;
default:
throw new AssertionError("Should never get here");
}
return reuse;
}
if (!fieldType().tokenized()) {
if (stringValue() != null) {
if (!(reuse instanceof StringTokenStream)) {
// lazy init the TokenStream as it is heavy to instantiate
// (attributes,...) if not needed
reuse = new StringTokenStream();
}
((StringTokenStream) reuse).setValue(stringValue());
return reuse;
} else if (binaryValue() != null) {
if (!(reuse instanceof BinaryTokenStream)) {
// lazy init the TokenStream as it is heavy to instantiate
// (attributes,...) if not needed
reuse = new BinaryTokenStream();
}
((BinaryTokenStream) reuse).setValue(binaryValue());
return reuse;
} else {
throw new IllegalArgumentException("Non-Tokenized Fields must have a String value");
}
}
if (tokenStream != null) {
return tokenStream;
} else if (readerValue() != null) {
return analyzer.tokenStream(name(), readerValue());
} else if (stringValue() != null) {
return analyzer.tokenStream(name(), stringValue());
}
throw new IllegalArgumentException("Field must have either TokenStream, String, Reader or Number value; got " + this);
}
}
Document
Documents are the unit of indexing and search. A Document is a set of fields. Each field has a name and a textual value. A field may be stored with the document, in which case it is returned with search hits on the document. Thus each document should typically contain one or more stored fields which uniquely identify it.
public final class Document implements Iterable<IndexableField> {
private final List<IndexableField> fields = new ArrayList<>();
@Override
public Iterator<IndexableField> iterator() {
return fields.iterator();
}
/**
Adds a field to a document.
Several fields may be added with the same name.
*/
public final void add(IndexableField field) {
fields.add(field);
}
public final void removeField(String name) {
Iterator<IndexableField> it = fields.iterator();
while (it.hasNext()) {
IndexableField field = it.next();
if (field.name().equals(name)) {
it.remove();
return;
}
}
}
/**
* Returns an array of bytes for the first (or only) field that has the name
* specified as the method parameter.
*/
public final BytesRef getBinaryValue(String name) {
for (IndexableField field : fields) {
if (field.name().equals(name)) {
final BytesRef bytes = field.binaryValue();
if (bytes != null) {
return bytes;
}
}
}
return null;
}
public final IndexableField getField(String name) {
for (IndexableField field : fields) {
if (field.name().equals(name)) {
return field;
}
}
return null;
}
}