You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 7.2KB

8 years ago
8 years ago
8 years ago
8 years ago
7 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239
  1. [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
  2. ![beacon for google analytics](https://ga-beacon.appspot.com/UA-88834340-1/tantivy-cli/README)
  3. `tantivy-cli` is the project hosting the command line interface for [tantivy](https://github.com/tantivy-search/tantivy), a search engine project.
  4. # Tutorial: Indexing Wikipedia with Tantivy CLI
  5. ## Introduction
  6. In this tutorial, we will create a brand new index with the articles of English wikipedia in it.
  7. ## Installing the tantivy CLI.
  8. There are a couple ways to install `tantivy-cli`.
  9. If you are a Rust programmer, you probably have `cargo` and `rustup` installed and you can just
  10. run `rustup run nightly cargo install tantivy-cli`. (`cargo install tantivy-cli` will work
  11. as well if nightly is your default toolchain).
  12. Alternatively, if you are on 64-bit Linux, you can directly download a
  13. static binary for [Linux x86 64](https://github.com/tantivy-search/tantivy-cli/releases/download/0.4.0/tantivy-cli-0.4.0-x86_64-unknown-linux-musl.tar.gz) or for [Mac OS X](https://github.com/tantivy-search/tantivy-cli/releases/download/0.4.0/tantivy-cli-0.4.0-x86_64-apple-darwin.tar.gz)
  14. and save it in a directory on your system's `PATH`.
  15. ## Creating the index: `new`
  16. Let's create a directory in which your index will be stored.
  17. ```bash
  18. # create the directory
  19. mkdir wikipedia-index
  20. ```
  21. We will now initialize the index and create its schema.
  22. The [schema](https://tantivy-search.github.io/tantivy/tantivy/schema/index.html) defines
  23. the list of your fields, and for each field:
  24. - its name
  25. - its type, currently `u64`, `i64` or `str`
  26. - how it should be indexed.
  27. You can find more information about the latter on
  28. [tantivy's schema documentation page](https://tantivy-search.github.io/tantivy/tantivy/schema/index.html)
  29. In our case, our documents will contain
  30. * a title
  31. * a body
  32. * a url
  33. We want the title and the body to be tokenized and indexed. We also want
  34. to add the term frequency and term positions to our index.
  35. (To be honest, phrase queries are not yet implemented in tantivy,
  36. so the positions won't be really useful in this tutorial.)
  37. Running `tantivy new` will start a wizard that will help you
  38. define the schema of the new index.
  39. Like all the other commands of `tantivy`, you will have to
  40. pass it your index directory via the `-i` or `--index`
  41. parameter as follows:
  42. ```bash
  43. tantivy new -i wikipedia-index
  44. ```
  45. Answer the questions as follows:
  46. ```none
  47. Creating new index
  48. Let's define its schema!
  49. New field name ? title
  50. Text or unsigned 32-bit integer (T/I) ? T
  51. Should the field be stored (Y/N) ? Y
  52. Should the field be indexed (Y/N) ? Y
  53. Should the field be tokenized (Y/N) ? Y
  54. Should the term frequencies (per doc) be in the index (Y/N) ? Y
  55. Should the term positions (per doc) be in the index (Y/N) ? Y
  56. Add another field (Y/N) ? Y
  57. New field name ? body
  58. Text or unsigned 32-bit integer (T/I) ? T
  59. Should the field be stored (Y/N) ? Y
  60. Should the field be indexed (Y/N) ? Y
  61. Should the field be tokenized (Y/N) ? Y
  62. Should the term frequencies (per doc) be in the index (Y/N) ? Y
  63. Should the term positions (per doc) be in the index (Y/N) ? Y
  64. Add another field (Y/N) ? Y
  65. New field name ? url
  66. Text or unsigned 32-bit integer (T/I) ? T
  67. Should the field be stored (Y/N) ? Y
  68. Should the field be indexed (Y/N) ? N
  69. Add another field (Y/N) ? N
  70. [
  71. {
  72. "name": "title",
  73. "type": "text",
  74. "options": {
  75. "indexing": "position",
  76. "stored": true
  77. }
  78. },
  79. {
  80. "name": "body",
  81. "type": "text",
  82. "options": {
  83. "indexing": "position",
  84. "stored": true
  85. }
  86. },
  87. {
  88. "name": "url",
  89. "type": "text",
  90. "options": {
  91. "indexing": "unindexed",
  92. "stored": true
  93. }
  94. }
  95. ]
  96. ```
  97. After the wizard has finished, a `meta.json` should exist in `wikipedia-index/meta.json`.
  98. It is a fairly human readable JSON, so you can check its content.
  99. It contains two sections:
  100. - segments (currently empty, but we will change that soon)
  101. - schema
  102. # Indexing the document: `index`
  103. Tantivy's `index` command offers a way to index a json file.
  104. The file must contain one JSON object per line.
  105. The structure of this JSON object must match that of our schema definition.
  106. ```json
  107. {"body": "some text", "title": "some title", "url": "http://somedomain.com"}
  108. ```
  109. For this tutorial, you can download a corpus with the 5 million+ English Wikipedia articles in the right format here: [wiki-articles.json (2.34 GB)](https://www.dropbox.com/s/wwnfnu441w1ec9p/wiki-articles.json.bz2?dl=0).
  110. Make sure to decompress the file
  111. ```bash
  112. bunzip2 wiki-articles.json.bz2
  113. ```
  114. If you are in a rush you can [download 100 articles in the right format here (11 MB)](http://fulmicoton.com/tantivy-files/wiki-articles-1000.json).
  115. The `index` command will index your document.
  116. By default it will use as 3 thread, each with a buffer size of 1GB split a
  117. accross these threads.
  118. ```
  119. cat wiki-articles.json | tantivy index -i ./wikipedia-index
  120. ```
  121. You can change the number of threads by passing it the `-t` parameter, and the total
  122. buffer size used by the threads heap by using the `-m`. Note that tantivy's memory usage
  123. is greater than just this buffer size parameter.
  124. On my computer (8 core Xeon(R) CPU X3450 @ 2.67GHz), on 8 threads, indexing wikipedia takes around 9 minutes.
  125. While tantivy is indexing, you can peek at the index directory to check what is happening.
  126. ```bash
  127. ls ./wikipedia-index
  128. ```
  129. The main file is `meta.json`.
  130. You should also see a lot of files with a UUID as filename, and different extensions.
  131. Our index is in fact divided in segments. Each segment acts as an individual smaller index.
  132. Its name is simply a uuid.
  133. If you decided to index the complete wikipedia, you may also see some of these files disappear.
  134. Having too many segments can hurt search performance, so tantivy actually automatically starts
  135. merging segments.
  136. # Serve the search index: `serve`
  137. Tantivy's cli also embeds a search server.
  138. You can run it with the following command.
  139. ```
  140. tantivy serve -i wikipedia-index
  141. ```
  142. By default, it will serve on port `3000`.
  143. You can search for the top 20 most relevant documents for the query `Barack Obama` by accessing
  144. the following [url](http://localhost:3000/api/?q=barack+obama&nhits=20) in your browser
  145. http://localhost:3000/api/?q=barack+obama&nhits=20
  146. By default this query is treated as `barack OR obama`.
  147. You can also search for documents that contains both term, by adding a `+` sign before the terms in your query.
  148. http://localhost:3000/api/?q=%2Bbarack%20%2Bobama%0A&nhits=20
  149. Also, `-` makes it possible to remove documents the documents containing a specific term.
  150. http://localhost:3000/api/?q=-barack%20%2Bobama%0A&nhits=20
  151. Finally tantivy handle phrase queries.
  152. http://localhost:3000/api/?q=%22barack%20obama%22&nhits=20
  153. # Search the index via the command line
  154. You may also use the `search` command to stream all documents matching a specific query.
  155. The documents are returned in an unspecified order.
  156. ```
  157. tantivy search -i wikipedia-index -q "barack obama"
  158. ```