Tantivy-cli is command line interface for [tantivy search engine](https://github.com/fulmicoton/tantivy). # Tutorial: Indexing Wikipedia with Tantivy CLI ## Introduction In this tutorial, we will create a brand new index with the articles of English wikipedia in it. ## Install There are two ways to get `tantivy`. If you are a rust programmer, you can run `cargo install tantivy-cli`. Alternatively, if you are on `Linux 64bits`, you can download a static binary: [binaries/linux_x86_64/](http://fulmicoton.com/tantivy-files/binaries/linux_x86_64/tantivy) ## Creating the index Create a directory in which your index will be stored. ```bash # create the directory mkdir wikipedia-index ``` We will now initialize the index and create it's schema. Our documents will contain * a title * a body * a url Running `tantivy new` will start a wizard that will help you go through the definition of the schema of our new index. ```bash tantivy new -i wikipedia-index ``` When asked answer to the question as follows: ```none Creating new index Let's define it's schema! New field name ? title Text or unsigned 32-bit Integer (T/I) ? T Should the field be stored (Y/N) ? Y Should the field be indexed (Y/N) ? Y Should the field be tokenized (Y/N) ? Y Should the term frequencies (per doc) be in the index (Y/N) ? Y Should the term positions (per doc) be in the index (Y/N) ? Y Add another field (Y/N) ? Y New field name ? body Text or unsigned 32-bit Integer (T/I) ? T Should the field be stored (Y/N) ? Y Should the field be indexed (Y/N) ? Y Should the field be tokenized (Y/N) ? Y Should the term frequencies (per doc) be in the index (Y/N) ? Y Should the term positions (per doc) be in the index (Y/N) ? Y Add another field (Y/N) ? Y New field name ? url Text or unsigned 32-bit Integer (T/I) ? T Should the field be stored (Y/N) ? Y Should the field be indexed (Y/N) ? N Add another field (Y/N) ? N [ { "name": "title", "type": "text", "options": { "indexing": "position", "stored": true } }, { "name": "body", "type": "text", "options": { "indexing": "position", "stored": true } }, { "name": "url", "type": "text", "options": { "indexing": "unindexed", "stored": true } } ] ``` If you want to know more about the meaning of these options, you can check out the [schema doc page](http://fulmicoton.com/tantivy/tantivy/schema/index.html). The json displayed at the end has been written in `wikipedia-index/meta.json`. # Get the documents to index Tantivy's index command offers a way to index a json file. More accurately, the file must contain one document per line, in a json format. The structure of this JSON object must match that of our schema definition. ```json {"body": "some text", "title": "some title", "url": "http://somedomain.com"} ``` You can download a corpus of more than 5 millions articles from wikipedia formatted in the right format here : [wiki-articles.json (2.34 GB)](https://www.dropbox.com/s/wwnfnu441w1ec9p/wiki-articles.json.bz2?dl=0). If you are in a rush you can [download 100 articles in the right format here](http://fulmicoton.com/tantivy-files/wiki-articles-1000.json). Make sure to uncompress the file ```bash bunzip2 wiki-articles.json.bz2 ``` # Index the documents. The `index` command will index your document. By default it will use as many threads as there are core on your machine. On my computer (8 core Xeon(R) CPU X3450 @ 2.67GHz), it only takes 7 minutes. ``` cat /data/wiki-articles | tantivy index -i wikipedia-index ``` While it is indexing, you can peek at the index directory to check what is happening. ```bash ls wikipedia-index ``` If you indexed the 5 millions articles, you should see a lot of files, all with the following format The main file is `meta.json`. Our index is in fact divided in segments. Each segment acts as an individual smaller index. It is named by a uuid. Each different files is storing a different datastructure for the index. # Serve the search index ``` tantivy serve -i wikipedia-index ``` You can start a small server with a JSON API to search into wikipedia. By default, the server is serving on the port `3000`.