Browse Source

some grammar/punctuation fixes to README

develop
Michael J. Curry 8 years ago
parent
commit
f1ffcb7ef1
No known key found for this signature in database GPG Key ID: 58EEF5BB97F3E791
1 changed files with 26 additions and 32 deletions
  1. +26
    -32
      README.md

+ 26
- 32
README.md View File

@@ -12,14 +12,14 @@ In this tutorial, we will create a brand new index with the articles of English


## Installing the tantivy CLI. ## Installing the tantivy CLI.


There are simple way to add the `tantivy` CLI to your computer.
There are a couple ways to add the `tantivy` CLI to your computer.


If you are a rust programmer, you probably have `cargo` installed and you can just If you are a rust programmer, you probably have `cargo` installed and you can just
run `cargo install tantivy-cli`. run `cargo install tantivy-cli`.


Alternatively, if you are on `Linux 64bits`, you can directly download a
Alternatively, if you are on 64-bit Linux, you can directly download a
static binary: [binaries/linux_x86_64/](http://fulmicoton.com/tantivy-files/binaries/linux_x86_64/tantivy), static binary: [binaries/linux_x86_64/](http://fulmicoton.com/tantivy-files/binaries/linux_x86_64/tantivy),
and save it in a directory of your system's `PATH`.
and save it in a directory on your system's `PATH`.






@@ -36,7 +36,7 @@ Let's create a directory in which your index will be stored.


We will now initialize the index and create its schema. We will now initialize the index and create its schema.
The [schema](http://fulmicoton.com/tantivy/tantivy/schema/index.html) defines The [schema](http://fulmicoton.com/tantivy/tantivy/schema/index.html) defines
the list of your fields, and for each field :
the list of your fields, and for each field:
- its name - its name
- its type, currently `u32` or `str` - its type, currently `u32` or `str`
- how it should be indexed. - how it should be indexed.
@@ -49,17 +49,17 @@ In our case, our documents will contain
* a body * a body
* a url * a url


We want the title and the body to be tokenized and index. We want
to also add the term frequency and term positions to our index.
We want the title and the body to be tokenized and indexed. We also want
to add the term frequency and term positions to our index.
(To be honest, phrase queries are not yet implemented in tantivy, (To be honest, phrase queries are not yet implemented in tantivy,
so the positions won't be really useful in this tutorial.) so the positions won't be really useful in this tutorial.)


Running `tantivy new` will start a wizard that will help you go through
the definition of the schema of our new index.
Running `tantivy new` will start a wizard that will help you
define the schema of the new index.


Like all the other commands of `tantivy`, you will have to Like all the other commands of `tantivy`, you will have to
pass it your index directory via the `-i` or `--index` pass it your index directory via the `-i` or `--index`
parameter as follows.
parameter as follows:




```bash ```bash
@@ -68,7 +68,7 @@ parameter as follows.






When asked answer to the question, answer as follows:
Answer the questions as follows:


```none ```none


@@ -135,30 +135,29 @@ When asked answer to the question, answer as follows:


``` ```


After the wizard has finished, a `meta.json` has been written in `wikipedia-index/meta.json`.
After the wizard has finished, a `meta.json` should exist in `wikipedia-index/meta.json`.
It is a fairly human readable JSON, so you may check its content. It is a fairly human readable JSON, so you may check its content.


It contains two sections :
It contains two sections:
- segments (currently empty, but we will change that soon) - segments (currently empty, but we will change that soon)
- schema - schema






# Indexing the document : `index`
# Indexing the document: `index`




Tantivy's `index` command offers a way to index a json file. Tantivy's `index` command offers a way to index a json file.
More accurately, the file must contain one document per line, in a json format.
The file must contain one JSON object per line.
The structure of this JSON object must match that of our schema definition. The structure of this JSON object must match that of our schema definition.


```json ```json
{"body": "some text", "title": "some title", "url": "http://somedomain.com"} {"body": "some text", "title": "some title", "url": "http://somedomain.com"}
``` ```


For this tutorial, you can download a corpus with the 5 millions+ English articles of wikipedia
formatted in the right format here : [wiki-articles.json (2.34 GB)](https://www.dropbox.com/s/wwnfnu441w1ec9p/wiki-articles.json.bz2?dl=0).
Make sure to uncompress the file
For this tutorial, you can download a corpus with the 5 million+ English Wikipedia articles in the right format here: [wiki-articles.json (2.34 GB)](https://www.dropbox.com/s/wwnfnu441w1ec9p/wiki-articles.json.bz2?dl=0).
Make sure to decompress the file


```bash ```bash
bunzip2 wiki-articles.json.bz2 bunzip2 wiki-articles.json.bz2
@@ -183,7 +182,7 @@ to check what is happening.
ls ./wikipedia-index ls ./wikipedia-index
``` ```


If you indexed the 5 millions articles, you should see a lot of new files, all with the following format
If you indexed the 5 million articles, you should see a lot of new files, all with the following format
The main file is `meta.json`. The main file is `meta.json`.


Our index is in fact divided in segments. Each segment acts as an individual smaller index. Our index is in fact divided in segments. Each segment acts as an individual smaller index.
@@ -192,7 +191,7 @@ Its named is simply a uuid.






# Serve the search index : `serve`
# Serve the search index: `serve`


Tantivy's cli also embeds a search server. Tantivy's cli also embeds a search server.
You can run it with the following command. You can run it with the following command.
@@ -201,7 +200,7 @@ You can run it with the following command.
tantivy serve -i wikipedia-index tantivy serve -i wikipedia-index
``` ```


By default, the server is serving on the port `3000`.
By default, it will serve on port `3000`.


You can search for the top 20 most relevant documents for the query `Barack Obama` by accessing You can search for the top 20 most relevant documents for the query `Barack Obama` by accessing
the following [url](http://localhost:3000/api/?q=barack+obama&explain=true&nhits=20) in your browser the following [url](http://localhost:3000/api/?q=barack+obama&explain=true&nhits=20) in your browser
@@ -209,13 +208,13 @@ the following [url](http://localhost:3000/api/?q=barack+obama&explain=true&nhits
http://localhost:3000/api/?q=barack+obama&explain=true&nhits=20 http://localhost:3000/api/?q=barack+obama&explain=true&nhits=20




# Optimizing the index : `merge`
# Optimizing the index: `merge`


Each tantivy's indexer thread is closing a new segment every 100K documents (this is completely arbitrary at the moment).
You should have more than 50 segments in your dictionary at the moment.
Each of tantivy's indexer threads closes a new segment every 100K documents (this is completely arbitrary at the moment).
You should have more than 50 segments in your dictionary.


Having that many queries is hurting your query performance (well, mostly the fast ones).
Tantivy merge will merge your segment into one.
Having that many segments hurts your query performance (well, mostly the fast ones).
Tantivy merge will merge your segments into one.


``` ```
tantivy merge -i ./wikipedia-index tantivy merge -i ./wikipedia-index
@@ -224,10 +223,5 @@ Tantivy merge will merge your segment into one.
(The command takes around 7 minutes on my computer) (The command takes around 7 minutes on my computer)


Note that your files are still there even after having run the command. Note that your files are still there even after having run the command.
`meta.json` however only lists one of the segments.
You will still need to remove the files manually.




However, `meta.json` only lists one of the segments.
You will still need to remove the files manually.

Loading…
Cancel
Save