Paul Masurel ab0c8ed2b7 | 8 years ago | |
---|---|---|
src | 8 years ago | |
.gitignore | 8 years ago | |
Cargo.lock | 8 years ago | |
Cargo.toml | 8 years ago | |
README.md | 8 years ago | |
build-static-binary.sh | 8 years ago |
tantivy-cli
is the project hosting the command line interface for tantivy, a search engine project.
In this tutorial, we will create a brand new index with the articles of English wikipedia in it.
There are a couple ways to install tantivy-cli
.
If you are a Rust programmer, you probably have cargo
installed and you can just
run cargo install tantivy-cli
.
Alternatively, if you are on 64-bit Linux, you can directly download a
static binary: binaries/linux_x86_64/,
and save it in a directory on your system's PATH
.
new
Let's create a directory in which your index will be stored.
# create the directory
mkdir wikipedia-index
We will now initialize the index and create its schema. The schema defines the list of your fields, and for each field:
u32
or str
You can find more information about the latter on [tantivy's schema documentation page](http://fulmicoton.com/tantivy/tantivy/schema/index.html
In our case, our documents will contain
We want the title and the body to be tokenized and indexed. We also want to add the term frequency and term positions to our index. (To be honest, phrase queries are not yet implemented in tantivy, so the positions won't be really useful in this tutorial.)
Running tantivy new
will start a wizard that will help you
define the schema of the new index.
Like all the other commands of tantivy
, you will have to
pass it your index directory via the -i
or --index
parameter as follows:
tantivy new -i wikipedia-index
Answer the questions as follows:
Creating new index
Let's define its schema!
New field name ? title
Text or unsigned 32-bit integer (T/I) ? T
Should the field be stored (Y/N) ? Y
Should the field be indexed (Y/N) ? Y
Should the field be tokenized (Y/N) ? Y
Should the term frequencies (per doc) be in the index (Y/N) ? Y
Should the term positions (per doc) be in the index (Y/N) ? Y
Add another field (Y/N) ? Y
New field name ? body
Text or unsigned 32-bit integer (T/I) ? T
Should the field be stored (Y/N) ? Y
Should the field be indexed (Y/N) ? Y
Should the field be tokenized (Y/N) ? Y
Should the term frequencies (per doc) be in the index (Y/N) ? Y
Should the term positions (per doc) be in the index (Y/N) ? Y
Add another field (Y/N) ? Y
New field name ? url
Text or unsigned 32-bit integer (T/I) ? T
Should the field be stored (Y/N) ? Y
Should the field be indexed (Y/N) ? N
Add another field (Y/N) ? N
[
{
"name": "title",
"type": "text",
"options": {
"indexing": "position",
"stored": true
}
},
{
"name": "body",
"type": "text",
"options": {
"indexing": "position",
"stored": true
}
},
{
"name": "url",
"type": "text",
"options": {
"indexing": "unindexed",
"stored": true
}
}
]
After the wizard has finished, a meta.json
should exist in wikipedia-index/meta.json
.
It is a fairly human readable JSON, so you can check its content.
It contains two sections:
index
Tantivy's index
command offers a way to index a json file.
The file must contain one JSON object per line.
The structure of this JSON object must match that of our schema definition.
{"body": "some text", "title": "some title", "url": "http://somedomain.com"}
For this tutorial, you can download a corpus with the 5 million+ English Wikipedia articles in the right format here: wiki-articles.json (2.34 GB). Make sure to decompress the file
bunzip2 wiki-articles.json.bz2
If you are in a rush you can download 100 articles in the right format here.
The index
command will index your document.
By default it will use as many threads as there are cores on your machine.
You can change the number of threads by passing it the -t
parameter.
On my computer (8 core Xeon(R) CPU X3450 @ 2.67GHz), it will take around 6 minutes.
cat wiki-articles.json | tantivy index -i ./wikipedia-index
While it is indexing, you can peek at the index directory to check what is happening.
ls ./wikipedia-index
If you indexed the 5 million articles, you should see a lot of new files, all with the following format:
The main file is meta.json
.
Our index is in fact divided in segments. Each segment acts as an individual smaller index. Its name is simply a uuid.
serve
Tantivy's cli also embeds a search server. You can run it with the following command.
tantivy serve -i wikipedia-index
By default, it will serve on port 3000
.
You can search for the top 20 most relevant documents for the query Barack Obama
by accessing
the following url in your browser
http://localhost:3000/api/?q=barack+obama&explain=true&nhits=20
merge
Each of tantivy's indexer threads is building its own independant segment. When its buffer is full, it closes its running segment, and starts working on a new one. You should currently have more than 50 segments in your directory.
Having that many segments can hurt your query performance.
Calling tantivy merge
will merge your segments into one.
tantivy merge -i ./wikipedia-index
(The command takes less than 4 minutes on my computer)
Note that your files are still there even after having run the command.
However, meta.json
only lists one of the segments.
You will still need to remove the files manually.