From e09b91cfd8975a9cc1cdbf460082b89b6ecf9ddb Mon Sep 17 00:00:00 2001 From: Paul Masurel Date: Mon, 15 Aug 2016 00:52:20 +0900 Subject: [PATCH] update readme --- README.md | 91 +++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 61 insertions(+), 30 deletions(-) diff --git a/README.md b/README.md index 41bb03e..4d3a1f2 100644 --- a/README.md +++ b/README.md @@ -13,13 +13,16 @@ In this tutorial, we will create a brand new index with the articles of English ## Install There are two ways to get `tantivy`. -If you are a rust programmer, you can run `cargo install tantivy-cli`. -Alternatively, if you are on `Linux 64bits`, you can download a +If you are a rust programmer, you probably have `cargo` installed and you can just +run `cargo install tantivy-cli`. + +Alternatively, if you are on `Linux 64bits`, you can directly try and download a static binary: [binaries/linux_x86_64/](http://fulmicoton.com/tantivy-files/binaries/linux_x86_64/tantivy) -## Creating the index -Create a directory in which your index will be stored. +## Creating the index: `new` + +Let's create a directory in which your index will be stored. ```bash # create the directory @@ -27,21 +30,41 @@ Create a directory in which your index will be stored. ``` -We will now initialize the index and create it's schema. +We will now initialize the index and create its schema. +The [schema](http://fulmicoton.com/tantivy/tantivy/schema/index.html) defines +the list of your fields, and for each field : +- its name +- its type, currently `u32` or `str` +- how it should be indexed. + +You can find more information about the latter on +[tantivy's schema documentation page](http://fulmicoton.com/tantivy/tantivy/schema/index.html -Our documents will contain +In our case, our documents will contain * a title * a body * a url +We want the title and the body to be tokenized and index. We want +to also add the term frequency and term positions to our index. +(To be honest, phrase queries are not yet implemented in tantivy, +so the positions won't be really useful in this tutorial.) + Running `tantivy new` will start a wizard that will help you go through the definition of the schema of our new index. +Like all the other commands of `tantivy`, you will have to +pass it your index directory via the `-i` or `--index` +parameter as follows. + + ```bash tantivy new -i wikipedia-index ``` -When asked answer to the question as follows: + + +When asked answer to the question, answer as follows: ```none @@ -83,24 +106,24 @@ When asked answer to the question as follows: "name": "title", "type": "text", "options": { - "indexing": "position", - "stored": true + "indexing": "position", + "stored": true } }, { "name": "body", "type": "text", "options": { - "indexing": "position", - "stored": true + "indexing": "position", + "stored": true } }, { "name": "url", "type": "text", "options": { - "indexing": "unindexed", - "stored": true + "indexing": "unindexed", + "stored": true } } ] @@ -108,14 +131,20 @@ When asked answer to the question as follows: ``` -If you want to know more about the meaning of these options, you can check out the [schema doc page](http://fulmicoton.com/tantivy/tantivy/schema/index.html). +After the wizard has finished, a `meta.json` has been written in `wikipedia-index/meta.json`. +It is a fairly human readable JSON, so you may check its content. + +It contains two sections : +- segments (currently empty, but we will change that soon) +- schema -The json displayed at the end has been written in `wikipedia-index/meta.json`. + -# Get the documents to index +# Indexing the document : `index` -Tantivy's index command offers a way to index a json file. + +Tantivy's `index` command offers a way to index a json file. More accurately, the file must contain one document per line, in a json format. The structure of this JSON object must match that of our schema definition. @@ -123,49 +152,51 @@ The structure of this JSON object must match that of our schema definition. {"body": "some text", "title": "some title", "url": "http://somedomain.com"} ``` -You can download a corpus of more than 5 millions articles from wikipedia +For this tutorial, you can download a corpus with the 5 millions+ English articles of wikipedia formatted in the right format here : [wiki-articles.json (2.34 GB)](https://www.dropbox.com/s/wwnfnu441w1ec9p/wiki-articles.json.bz2?dl=0). -If you are in a rush you can [download 100 articles in the right format here](http://fulmicoton.com/tantivy-files/wiki-articles-1000.json). - Make sure to uncompress the file ```bash bunzip2 wiki-articles.json.bz2 -``` +``` -# Index the documents. +If you are in a rush you can [download 100 articles in the right format here](http://fulmicoton.com/tantivy-files/wiki-articles-1000.json). The `index` command will index your document. -By default it will use as many threads as there are core on your machine. +By default it will use as many threads as there are cores on your machine. +You can change the number of threads by passing it the `-t` parameter. -On my computer (8 core Xeon(R) CPU X3450 @ 2.67GHz), it only takes 7 minutes. +On my computer (8 core Xeon(R) CPU X3450 @ 2.67GHz), it will take around 6 minutes. ``` - cat /data/wiki-articles | tantivy index -i wikipedia-index + cat wiki-articles.json | tantivy index -i ./wikipedia-index ``` While it is indexing, you can peek at the index directory to check what is happening. ```bash - ls wikipedia-index + ls ./wikipedia-index ``` -If you indexed the 5 millions articles, you should see a lot of files, all with the following format +If you indexed the 5 millions articles, you should see a lot of new files, all with the following format The main file is `meta.json`. Our index is in fact divided in segments. Each segment acts as an individual smaller index. -It is named by a uuid. -Each different files is storing a different datastructure for the index. +Its named is simply a uuid. + + # Serve the search index +Tantivy's cli also embeds a search server. +You can run it with the following command. + ``` tantivy serve -i wikipedia-index ``` -You can start a small server with a JSON API to search into wikipedia. By default, the server is serving on the port `3000`.