|
@@ -13,13 +13,16 @@ In this tutorial, we will create a brand new index with the articles of English |
|
|
## Install |
|
|
## Install |
|
|
|
|
|
|
|
|
There are two ways to get `tantivy`. |
|
|
There are two ways to get `tantivy`. |
|
|
If you are a rust programmer, you can run `cargo install tantivy-cli`. |
|
|
|
|
|
Alternatively, if you are on `Linux 64bits`, you can download a |
|
|
|
|
|
|
|
|
If you are a rust programmer, you probably have `cargo` installed and you can just |
|
|
|
|
|
run `cargo install tantivy-cli`. |
|
|
|
|
|
|
|
|
|
|
|
Alternatively, if you are on `Linux 64bits`, you can directly try and download a |
|
|
static binary: [binaries/linux_x86_64/](http://fulmicoton.com/tantivy-files/binaries/linux_x86_64/tantivy) |
|
|
static binary: [binaries/linux_x86_64/](http://fulmicoton.com/tantivy-files/binaries/linux_x86_64/tantivy) |
|
|
|
|
|
|
|
|
## Creating the index |
|
|
|
|
|
|
|
|
|
|
|
Create a directory in which your index will be stored. |
|
|
|
|
|
|
|
|
## Creating the index: `new` |
|
|
|
|
|
|
|
|
|
|
|
Let's create a directory in which your index will be stored. |
|
|
|
|
|
|
|
|
```bash |
|
|
```bash |
|
|
# create the directory |
|
|
# create the directory |
|
@@ -27,21 +30,41 @@ Create a directory in which your index will be stored. |
|
|
``` |
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
We will now initialize the index and create it's schema. |
|
|
|
|
|
|
|
|
We will now initialize the index and create its schema. |
|
|
|
|
|
The [schema](http://fulmicoton.com/tantivy/tantivy/schema/index.html) defines |
|
|
|
|
|
the list of your fields, and for each field : |
|
|
|
|
|
- its name |
|
|
|
|
|
- its type, currently `u32` or `str` |
|
|
|
|
|
- how it should be indexed. |
|
|
|
|
|
|
|
|
|
|
|
You can find more information about the latter on |
|
|
|
|
|
[tantivy's schema documentation page](http://fulmicoton.com/tantivy/tantivy/schema/index.html |
|
|
|
|
|
|
|
|
Our documents will contain |
|
|
|
|
|
|
|
|
In our case, our documents will contain |
|
|
* a title |
|
|
* a title |
|
|
* a body |
|
|
* a body |
|
|
* a url |
|
|
* a url |
|
|
|
|
|
|
|
|
|
|
|
We want the title and the body to be tokenized and index. We want |
|
|
|
|
|
to also add the term frequency and term positions to our index. |
|
|
|
|
|
(To be honest, phrase queries are not yet implemented in tantivy, |
|
|
|
|
|
so the positions won't be really useful in this tutorial.) |
|
|
|
|
|
|
|
|
Running `tantivy new` will start a wizard that will help you go through |
|
|
Running `tantivy new` will start a wizard that will help you go through |
|
|
the definition of the schema of our new index. |
|
|
the definition of the schema of our new index. |
|
|
|
|
|
|
|
|
|
|
|
Like all the other commands of `tantivy`, you will have to |
|
|
|
|
|
pass it your index directory via the `-i` or `--index` |
|
|
|
|
|
parameter as follows. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
```bash |
|
|
```bash |
|
|
tantivy new -i wikipedia-index |
|
|
tantivy new -i wikipedia-index |
|
|
``` |
|
|
``` |
|
|
|
|
|
|
|
|
When asked answer to the question as follows: |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
When asked answer to the question, answer as follows: |
|
|
|
|
|
|
|
|
```none |
|
|
```none |
|
|
|
|
|
|
|
@@ -83,24 +106,24 @@ When asked answer to the question as follows: |
|
|
"name": "title", |
|
|
"name": "title", |
|
|
"type": "text", |
|
|
"type": "text", |
|
|
"options": { |
|
|
"options": { |
|
|
"indexing": "position", |
|
|
|
|
|
"stored": true |
|
|
|
|
|
|
|
|
"indexing": "position", |
|
|
|
|
|
"stored": true |
|
|
} |
|
|
} |
|
|
}, |
|
|
}, |
|
|
{ |
|
|
{ |
|
|
"name": "body", |
|
|
"name": "body", |
|
|
"type": "text", |
|
|
"type": "text", |
|
|
"options": { |
|
|
"options": { |
|
|
"indexing": "position", |
|
|
|
|
|
"stored": true |
|
|
|
|
|
|
|
|
"indexing": "position", |
|
|
|
|
|
"stored": true |
|
|
} |
|
|
} |
|
|
}, |
|
|
}, |
|
|
{ |
|
|
{ |
|
|
"name": "url", |
|
|
"name": "url", |
|
|
"type": "text", |
|
|
"type": "text", |
|
|
"options": { |
|
|
"options": { |
|
|
"indexing": "unindexed", |
|
|
|
|
|
"stored": true |
|
|
|
|
|
|
|
|
"indexing": "unindexed", |
|
|
|
|
|
"stored": true |
|
|
} |
|
|
} |
|
|
} |
|
|
} |
|
|
] |
|
|
] |
|
@@ -108,14 +131,20 @@ When asked answer to the question as follows: |
|
|
|
|
|
|
|
|
``` |
|
|
``` |
|
|
|
|
|
|
|
|
If you want to know more about the meaning of these options, you can check out the [schema doc page](http://fulmicoton.com/tantivy/tantivy/schema/index.html). |
|
|
|
|
|
|
|
|
After the wizard has finished, a `meta.json` has been written in `wikipedia-index/meta.json`. |
|
|
|
|
|
It is a fairly human readable JSON, so you may check its content. |
|
|
|
|
|
|
|
|
|
|
|
It contains two sections : |
|
|
|
|
|
- segments (currently empty, but we will change that soon) |
|
|
|
|
|
- schema |
|
|
|
|
|
|
|
|
The json displayed at the end has been written in `wikipedia-index/meta.json`. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# Get the documents to index |
|
|
|
|
|
|
|
|
# Indexing the document : `index` |
|
|
|
|
|
|
|
|
Tantivy's index command offers a way to index a json file. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Tantivy's `index` command offers a way to index a json file. |
|
|
More accurately, the file must contain one document per line, in a json format. |
|
|
More accurately, the file must contain one document per line, in a json format. |
|
|
The structure of this JSON object must match that of our schema definition. |
|
|
The structure of this JSON object must match that of our schema definition. |
|
|
|
|
|
|
|
@@ -123,49 +152,51 @@ The structure of this JSON object must match that of our schema definition. |
|
|
{"body": "some text", "title": "some title", "url": "http://somedomain.com"} |
|
|
{"body": "some text", "title": "some title", "url": "http://somedomain.com"} |
|
|
``` |
|
|
``` |
|
|
|
|
|
|
|
|
You can download a corpus of more than 5 millions articles from wikipedia |
|
|
|
|
|
|
|
|
For this tutorial, you can download a corpus with the 5 millions+ English articles of wikipedia |
|
|
formatted in the right format here : [wiki-articles.json (2.34 GB)](https://www.dropbox.com/s/wwnfnu441w1ec9p/wiki-articles.json.bz2?dl=0). |
|
|
formatted in the right format here : [wiki-articles.json (2.34 GB)](https://www.dropbox.com/s/wwnfnu441w1ec9p/wiki-articles.json.bz2?dl=0). |
|
|
If you are in a rush you can [download 100 articles in the right format here](http://fulmicoton.com/tantivy-files/wiki-articles-1000.json). |
|
|
|
|
|
|
|
|
|
|
|
Make sure to uncompress the file |
|
|
Make sure to uncompress the file |
|
|
|
|
|
|
|
|
```bash |
|
|
```bash |
|
|
bunzip2 wiki-articles.json.bz2 |
|
|
bunzip2 wiki-articles.json.bz2 |
|
|
``` |
|
|
|
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
# Index the documents. |
|
|
|
|
|
|
|
|
If you are in a rush you can [download 100 articles in the right format here](http://fulmicoton.com/tantivy-files/wiki-articles-1000.json). |
|
|
|
|
|
|
|
|
The `index` command will index your document. |
|
|
The `index` command will index your document. |
|
|
By default it will use as many threads as there are core on your machine. |
|
|
|
|
|
|
|
|
By default it will use as many threads as there are cores on your machine. |
|
|
|
|
|
You can change the number of threads by passing it the `-t` parameter. |
|
|
|
|
|
|
|
|
On my computer (8 core Xeon(R) CPU X3450 @ 2.67GHz), it only takes 7 minutes. |
|
|
|
|
|
|
|
|
On my computer (8 core Xeon(R) CPU X3450 @ 2.67GHz), it will take around 6 minutes. |
|
|
|
|
|
|
|
|
``` |
|
|
``` |
|
|
cat /data/wiki-articles | tantivy index -i wikipedia-index |
|
|
|
|
|
|
|
|
cat wiki-articles.json | tantivy index -i ./wikipedia-index |
|
|
``` |
|
|
``` |
|
|
|
|
|
|
|
|
While it is indexing, you can peek at the index directory |
|
|
While it is indexing, you can peek at the index directory |
|
|
to check what is happening. |
|
|
to check what is happening. |
|
|
|
|
|
|
|
|
```bash |
|
|
```bash |
|
|
ls wikipedia-index |
|
|
|
|
|
|
|
|
ls ./wikipedia-index |
|
|
``` |
|
|
``` |
|
|
|
|
|
|
|
|
If you indexed the 5 millions articles, you should see a lot of files, all with the following format |
|
|
|
|
|
|
|
|
If you indexed the 5 millions articles, you should see a lot of new files, all with the following format |
|
|
The main file is `meta.json`. |
|
|
The main file is `meta.json`. |
|
|
|
|
|
|
|
|
Our index is in fact divided in segments. Each segment acts as an individual smaller index. |
|
|
Our index is in fact divided in segments. Each segment acts as an individual smaller index. |
|
|
It is named by a uuid. |
|
|
|
|
|
Each different files is storing a different datastructure for the index. |
|
|
|
|
|
|
|
|
Its named is simply a uuid. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# Serve the search index |
|
|
# Serve the search index |
|
|
|
|
|
|
|
|
|
|
|
Tantivy's cli also embeds a search server. |
|
|
|
|
|
You can run it with the following command. |
|
|
|
|
|
|
|
|
``` |
|
|
``` |
|
|
tantivy serve -i wikipedia-index |
|
|
tantivy serve -i wikipedia-index |
|
|
``` |
|
|
``` |
|
|
|
|
|
|
|
|
You can start a small server with a JSON API to search into wikipedia. |
|
|
|
|
|
By default, the server is serving on the port `3000`. |
|
|
By default, the server is serving on the port `3000`. |
|
|
|
|
|
|
|
|
|
|
|
|