|
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169 |
- Tantivy-cli is command line interface for [tantivy search engine](https://github.com/fulmicoton/tantivy).
-
-
-
- # Tutorial: Indexing Wikipedia with Tantivy CLI
-
- ## Introduction
-
- In this tutorial, we will create a brand new index with the articles of English wikipedia in it.
-
- ## Install
-
- There are two ways to get `tantivy`.
- If you are a rust programmer, you can run `cargo install tantivy-cli`.
- Alternatively, if you are on `Linux 64bits`, you can download a
- static binary: [binaries/linux_x86_64/](http://fulmicoton.com/tantivy-files/binaries/linux_x86_64/tantivy)
-
- ## Creating the index
-
- Create a directory in which your index will be stored.
-
- ```bash
- # create the directory
- mkdir wikipedia-index
- ```
-
-
- We will now initialize the index and create it's schema.
-
- Our documents will contain
- * a title
- * a body
- * a url
-
- Running `tantivy new` will start a wizard that will help you go through
- the definition of the schema of our new index.
-
- ```bash
- tantivy new -i wikipedia-index
- ```
-
- When asked answer to the question as follows:
-
- ```none
-
- Creating new index
- Let's define it's schema!
-
-
-
- New field name ? title
- Text or unsigned 32-bit Integer (T/I) ? T
- Should the field be stored (Y/N) ? Y
- Should the field be indexed (Y/N) ? Y
- Should the field be tokenized (Y/N) ? Y
- Should the term frequencies (per doc) be in the index (Y/N) ? Y
- Should the term positions (per doc) be in the index (Y/N) ? Y
- Add another field (Y/N) ? Y
-
-
-
- New field name ? body
- Text or unsigned 32-bit Integer (T/I) ? T
- Should the field be stored (Y/N) ? Y
- Should the field be indexed (Y/N) ? Y
- Should the field be tokenized (Y/N) ? Y
- Should the term frequencies (per doc) be in the index (Y/N) ? Y
- Should the term positions (per doc) be in the index (Y/N) ? Y
- Add another field (Y/N) ? Y
-
-
-
- New field name ? url
- Text or unsigned 32-bit Integer (T/I) ? T
- Should the field be stored (Y/N) ? Y
- Should the field be indexed (Y/N) ? N
- Add another field (Y/N) ? N
-
- [
- {
- "name": "title",
- "type": "text",
- "options": {
- "indexing": "position",
- "stored": true
- }
- },
- {
- "name": "body",
- "type": "text",
- "options": {
- "indexing": "position",
- "stored": true
- }
- },
- {
- "name": "url",
- "type": "text",
- "options": {
- "indexing": "unindexed",
- "stored": true
- }
- }
- ]
-
-
- ```
-
- If you want to know more about the meaning of these options, you can check out the [schema doc page](http://fulmicoton.com/tantivy/tantivy/schema/index.html).
-
- The json displayed at the end has been written in `wikipedia-index/meta.json`.
-
-
- # Get the documents to index
-
- Tantivy's index command offers a way to index a json file.
- More accurately, the file must contain one document per line, in a json format.
- The structure of this JSON object must match that of our schema definition.
-
- ```json
- {"body": "some text", "title": "some title", "url": "http://somedomain.com"}
- ```
-
- You can download a corpus of more than 5 millions articles from wikipedia
- formatted in the right format here : [wiki-articles.json (2.34 GB)](https://www.dropbox.com/s/wwnfnu441w1ec9p/wiki-articles.json.bz2?dl=0).
- If you are in a rush you can [download 100 articles in the right format here](http://fulmicoton.com/tantivy-files/wiki-articles-1000.json).
-
- Make sure to uncompress the file
-
- ```bash
- bunzip2 wiki-articles.json.bz2
- ```
-
- # Index the documents.
-
- The `index` command will index your document.
- By default it will use as many threads as there are core on your machine.
-
- On my computer (8 core Xeon(R) CPU X3450 @ 2.67GHz), it only takes 7 minutes.
-
- ```
- cat /data/wiki-articles | tantivy index -i wikipedia-index
- ```
-
- While it is indexing, you can peek at the index directory
- to check what is happening.
-
- ```bash
- ls wikipedia-index
- ```
-
- If you indexed the 5 millions articles, you should see a lot of files, all with the following format
- The main file is `meta.json`.
-
- Our index is in fact divided in segments. Each segment acts as an individual smaller index.
- It is named by a uuid.
- Each different files is storing a different datastructure for the index.
-
-
- # Serve the search index
-
- ```
- tantivy serve -i wikipedia-index
- ```
-
- You can start a small server with a JSON API to search into wikipedia.
- By default, the server is serving on the port `3000`.
-
-
|