You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 6.5KB

8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234
  1. [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
  2. ![beacon for google analytics](https://ga-beacon.appspot.com/UA-88834340-1/tantivy-cli/README)
  3. Tantivy-cli is the project hosting the command line interface for [tantivy](https://github.com/fulmicoton/tantivy), a search engine project.
  4. # Tutorial: Indexing Wikipedia with Tantivy CLI
  5. ## Introduction
  6. In this tutorial, we will create a brand new index with the articles of English wikipedia in it.
  7. ## Installing the tantivy CLI.
  8. There are simple way to add the `tantivy` CLI to your computer.
  9. If you are a rust programmer, you probably have `cargo` installed and you can just
  10. run `cargo install tantivy-cli`.
  11. Alternatively, if you are on `Linux 64bits`, you can directly download a
  12. static binary: [binaries/linux_x86_64/](http://fulmicoton.com/tantivy-files/binaries/linux_x86_64/tantivy),
  13. and save it in a directory of your system's `PATH`.
  14. ## Creating the index: `new`
  15. Let's create a directory in which your index will be stored.
  16. ```bash
  17. # create the directory
  18. mkdir wikipedia-index
  19. ```
  20. We will now initialize the index and create its schema.
  21. The [schema](http://fulmicoton.com/tantivy/tantivy/schema/index.html) defines
  22. the list of your fields, and for each field :
  23. - its name
  24. - its type, currently `u32` or `str`
  25. - how it should be indexed.
  26. You can find more information about the latter on
  27. [tantivy's schema documentation page](http://fulmicoton.com/tantivy/tantivy/schema/index.html
  28. In our case, our documents will contain
  29. * a title
  30. * a body
  31. * a url
  32. We want the title and the body to be tokenized and index. We want
  33. to also add the term frequency and term positions to our index.
  34. (To be honest, phrase queries are not yet implemented in tantivy,
  35. so the positions won't be really useful in this tutorial.)
  36. Running `tantivy new` will start a wizard that will help you go through
  37. the definition of the schema of our new index.
  38. Like all the other commands of `tantivy`, you will have to
  39. pass it your index directory via the `-i` or `--index`
  40. parameter as follows.
  41. ```bash
  42. tantivy new -i wikipedia-index
  43. ```
  44. When asked answer to the question, answer as follows:
  45. ```none
  46. Creating new index
  47. Let's define it's schema!
  48. New field name ? title
  49. Text or unsigned 32-bit Integer (T/I) ? T
  50. Should the field be stored (Y/N) ? Y
  51. Should the field be indexed (Y/N) ? Y
  52. Should the field be tokenized (Y/N) ? Y
  53. Should the term frequencies (per doc) be in the index (Y/N) ? Y
  54. Should the term positions (per doc) be in the index (Y/N) ? Y
  55. Add another field (Y/N) ? Y
  56. New field name ? body
  57. Text or unsigned 32-bit Integer (T/I) ? T
  58. Should the field be stored (Y/N) ? Y
  59. Should the field be indexed (Y/N) ? Y
  60. Should the field be tokenized (Y/N) ? Y
  61. Should the term frequencies (per doc) be in the index (Y/N) ? Y
  62. Should the term positions (per doc) be in the index (Y/N) ? Y
  63. Add another field (Y/N) ? Y
  64. New field name ? url
  65. Text or unsigned 32-bit Integer (T/I) ? T
  66. Should the field be stored (Y/N) ? Y
  67. Should the field be indexed (Y/N) ? N
  68. Add another field (Y/N) ? N
  69. [
  70. {
  71. "name": "title",
  72. "type": "text",
  73. "options": {
  74. "indexing": "position",
  75. "stored": true
  76. }
  77. },
  78. {
  79. "name": "body",
  80. "type": "text",
  81. "options": {
  82. "indexing": "position",
  83. "stored": true
  84. }
  85. },
  86. {
  87. "name": "url",
  88. "type": "text",
  89. "options": {
  90. "indexing": "unindexed",
  91. "stored": true
  92. }
  93. }
  94. ]
  95. ```
  96. After the wizard has finished, a `meta.json` has been written in `wikipedia-index/meta.json`.
  97. It is a fairly human readable JSON, so you may check its content.
  98. It contains two sections :
  99. - segments (currently empty, but we will change that soon)
  100. - schema
  101. # Indexing the document : `index`
  102. Tantivy's `index` command offers a way to index a json file.
  103. More accurately, the file must contain one document per line, in a json format.
  104. The structure of this JSON object must match that of our schema definition.
  105. ```json
  106. {"body": "some text", "title": "some title", "url": "http://somedomain.com"}
  107. ```
  108. For this tutorial, you can download a corpus with the 5 millions+ English articles of wikipedia
  109. formatted in the right format here : [wiki-articles.json (2.34 GB)](https://www.dropbox.com/s/wwnfnu441w1ec9p/wiki-articles.json.bz2?dl=0).
  110. Make sure to uncompress the file
  111. ```bash
  112. bunzip2 wiki-articles.json.bz2
  113. ```
  114. If you are in a rush you can [download 100 articles in the right format here (11 MB)](http://fulmicoton.com/tantivy-files/wiki-articles-1000.json).
  115. The `index` command will index your document.
  116. By default it will use as many threads as there are cores on your machine.
  117. You can change the number of threads by passing it the `-t` parameter.
  118. On my computer (8 core Xeon(R) CPU X3450 @ 2.67GHz), it will take around 6 minutes.
  119. ```
  120. cat wiki-articles.json | tantivy index -i ./wikipedia-index
  121. ```
  122. While it is indexing, you can peek at the index directory
  123. to check what is happening.
  124. ```bash
  125. ls ./wikipedia-index
  126. ```
  127. If you indexed the 5 millions articles, you should see a lot of new files, all with the following format
  128. The main file is `meta.json`.
  129. Our index is in fact divided in segments. Each segment acts as an individual smaller index.
  130. Its named is simply a uuid.
  131. # Serve the search index : `serve`
  132. Tantivy's cli also embeds a search server.
  133. You can run it with the following command.
  134. ```
  135. tantivy serve -i wikipedia-index
  136. ```
  137. By default, the server is serving on the port `3000`.
  138. You can search for the top 20 most relevant documents for the query `Barack Obama` by accessing
  139. the following [url](http://localhost:3000/api/?q=barack+obama&explain=true&nhits=20) in your browser
  140. http://localhost:3000/api/?q=barack+obama&explain=true&nhits=20
  141. # Optimizing the index : `merge`
  142. Each tantivy's indexer thread is closing a new segment every 100K documents (this is completely arbitrary at the moment).
  143. You should have more than 50 segments in your dictionary at the moment.
  144. Having that many queries is hurting your query performance (well, mostly the fast ones).
  145. Tantivy merge will merge your segment into one.
  146. ```
  147. tantivy merge -i ./wikipedia-index
  148. ```
  149. (The command takes around 7 minutes on my computer)
  150. Note that your files are still there even after having run the command.
  151. `meta.json` however only lists one of the segments.
  152. You will still need to remove the files manually.