You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 4.4KB

8 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169
  1. Tantivy-cli is command line interface for [tantivy search engine](https://github.com/fulmicoton/tantivy).
  2. # Tutorial: Indexing Wikipedia with Tantivy CLI
  3. ## Introduction
  4. In this tutorial, we will create a brand new index with the articles of English wikipedia in it.
  5. ## Install
  6. There are two ways to get `tantivy`.
  7. If you are a rust programmer, you can run `cargo install tantivy-cli`.
  8. Alternatively, if you are on `Linux 64bits`, you can download a
  9. static binary: [binaries/linux_x86_64/](http://fulmicoton.com/tantivy-files/binaries/linux_x86_64/tantivy)
  10. ## Creating the index
  11. Create a directory in which your index will be stored.
  12. ```bash
  13. # create the directory
  14. mkdir wikipedia-index
  15. ```
  16. We will now initialize the index and create it's schema.
  17. Our documents will contain
  18. * a title
  19. * a body
  20. * a url
  21. Running `tantivy new` will start a wizard that will help you go through
  22. the definition of the schema of our new index.
  23. ```bash
  24. tantivy new -i wikipedia-index
  25. ```
  26. When asked answer to the question as follows:
  27. ```none
  28. Creating new index
  29. Let's define it's schema!
  30. New field name ? title
  31. Text or unsigned 32-bit Integer (T/I) ? T
  32. Should the field be stored (Y/N) ? Y
  33. Should the field be indexed (Y/N) ? Y
  34. Should the field be tokenized (Y/N) ? Y
  35. Should the term frequencies (per doc) be in the index (Y/N) ? Y
  36. Should the term positions (per doc) be in the index (Y/N) ? Y
  37. Add another field (Y/N) ? Y
  38. New field name ? body
  39. Text or unsigned 32-bit Integer (T/I) ? T
  40. Should the field be stored (Y/N) ? Y
  41. Should the field be indexed (Y/N) ? Y
  42. Should the field be tokenized (Y/N) ? Y
  43. Should the term frequencies (per doc) be in the index (Y/N) ? Y
  44. Should the term positions (per doc) be in the index (Y/N) ? Y
  45. Add another field (Y/N) ? Y
  46. New field name ? url
  47. Text or unsigned 32-bit Integer (T/I) ? T
  48. Should the field be stored (Y/N) ? Y
  49. Should the field be indexed (Y/N) ? N
  50. Add another field (Y/N) ? N
  51. [
  52. {
  53. "name": "title",
  54. "type": "text",
  55. "options": {
  56. "indexing": "position",
  57. "stored": true
  58. }
  59. },
  60. {
  61. "name": "body",
  62. "type": "text",
  63. "options": {
  64. "indexing": "position",
  65. "stored": true
  66. }
  67. },
  68. {
  69. "name": "url",
  70. "type": "text",
  71. "options": {
  72. "indexing": "unindexed",
  73. "stored": true
  74. }
  75. }
  76. ]
  77. ```
  78. If you want to know more about the meaning of these options, you can check out the [schema doc page](http://fulmicoton.com/tantivy/tantivy/schema/index.html).
  79. The json displayed at the end has been written in `wikipedia-index/meta.json`.
  80. # Get the documents to index
  81. Tantivy's index command offers a way to index a json file.
  82. More accurately, the file must contain one document per line, in a json format.
  83. The structure of this JSON object must match that of our schema definition.
  84. ```json
  85. {"body": "some text", "title": "some title", "url": "http://somedomain.com"}
  86. ```
  87. You can download a corpus of more than 5 millions articles from wikipedia
  88. formatted in the right format here : [wiki-articles.json (2.34 GB)](https://www.dropbox.com/s/wwnfnu441w1ec9p/wiki-articles.json.bz2?dl=0).
  89. If you are in a rush you can [download 100 articles in the right format here](http://fulmicoton.com/tantivy-files/wiki-articles-1000.json).
  90. Make sure to uncompress the file
  91. ```bash
  92. bunzip2 wiki-articles.json.bz2
  93. ```
  94. # Index the documents.
  95. The `index` command will index your document.
  96. By default it will use as many threads as there are core on your machine.
  97. On my computer (8 core Xeon(R) CPU X3450 @ 2.67GHz), it only takes 7 minutes.
  98. ```
  99. cat /data/wiki-articles | tantivy index -i wikipedia-index
  100. ```
  101. While it is indexing, you can peek at the index directory
  102. to check what is happening.
  103. ```bash
  104. ls wikipedia-index
  105. ```
  106. If you indexed the 5 millions articles, you should see a lot of files, all with the following format
  107. The main file is `meta.json`.
  108. Our index is in fact divided in segments. Each segment acts as an individual smaller index.
  109. It is named by a uuid.
  110. Each different files is storing a different datastructure for the index.
  111. # Serve the search index
  112. ```
  113. tantivy serve -i wikipedia-index
  114. ```
  115. You can start a small server with a JSON API to search into wikipedia.
  116. By default, the server is serving on the port `3000`.