From f1ffcb7ef17e4c356addac4f2b4e379f7fed2106 Mon Sep 17 00:00:00 2001 From: "Michael J. Curry" Date: Fri, 30 Sep 2016 10:24:16 -0400 Subject: [PATCH 1/5] some grammar/punctuation fixes to README --- README.md | 58 +++++++++++++++++++++++++------------------------------ 1 file changed, 26 insertions(+), 32 deletions(-) diff --git a/README.md b/README.md index 3b06e31..1bf2d08 100644 --- a/README.md +++ b/README.md @@ -12,14 +12,14 @@ In this tutorial, we will create a brand new index with the articles of English ## Installing the tantivy CLI. -There are simple way to add the `tantivy` CLI to your computer. +There are a couple ways to add the `tantivy` CLI to your computer. If you are a rust programmer, you probably have `cargo` installed and you can just run `cargo install tantivy-cli`. -Alternatively, if you are on `Linux 64bits`, you can directly download a +Alternatively, if you are on 64-bit Linux, you can directly download a static binary: [binaries/linux_x86_64/](http://fulmicoton.com/tantivy-files/binaries/linux_x86_64/tantivy), -and save it in a directory of your system's `PATH`. +and save it in a directory on your system's `PATH`. @@ -36,7 +36,7 @@ Let's create a directory in which your index will be stored. We will now initialize the index and create its schema. The [schema](http://fulmicoton.com/tantivy/tantivy/schema/index.html) defines -the list of your fields, and for each field : +the list of your fields, and for each field: - its name - its type, currently `u32` or `str` - how it should be indexed. @@ -49,17 +49,17 @@ In our case, our documents will contain * a body * a url -We want the title and the body to be tokenized and index. We want -to also add the term frequency and term positions to our index. +We want the title and the body to be tokenized and indexed. We also want +to add the term frequency and term positions to our index. (To be honest, phrase queries are not yet implemented in tantivy, so the positions won't be really useful in this tutorial.) -Running `tantivy new` will start a wizard that will help you go through -the definition of the schema of our new index. +Running `tantivy new` will start a wizard that will help you +define the schema of the new index. Like all the other commands of `tantivy`, you will have to pass it your index directory via the `-i` or `--index` -parameter as follows. +parameter as follows: ```bash @@ -68,7 +68,7 @@ parameter as follows. -When asked answer to the question, answer as follows: +Answer the questions as follows: ```none @@ -135,30 +135,29 @@ When asked answer to the question, answer as follows: ``` -After the wizard has finished, a `meta.json` has been written in `wikipedia-index/meta.json`. +After the wizard has finished, a `meta.json` should exist in `wikipedia-index/meta.json`. It is a fairly human readable JSON, so you may check its content. -It contains two sections : +It contains two sections: - segments (currently empty, but we will change that soon) - schema -# Indexing the document : `index` +# Indexing the document: `index` Tantivy's `index` command offers a way to index a json file. -More accurately, the file must contain one document per line, in a json format. +The file must contain one JSON object per line. The structure of this JSON object must match that of our schema definition. ```json {"body": "some text", "title": "some title", "url": "http://somedomain.com"} ``` -For this tutorial, you can download a corpus with the 5 millions+ English articles of wikipedia -formatted in the right format here : [wiki-articles.json (2.34 GB)](https://www.dropbox.com/s/wwnfnu441w1ec9p/wiki-articles.json.bz2?dl=0). -Make sure to uncompress the file +For this tutorial, you can download a corpus with the 5 million+ English Wikipedia articles in the right format here: [wiki-articles.json (2.34 GB)](https://www.dropbox.com/s/wwnfnu441w1ec9p/wiki-articles.json.bz2?dl=0). +Make sure to decompress the file ```bash bunzip2 wiki-articles.json.bz2 @@ -183,7 +182,7 @@ to check what is happening. ls ./wikipedia-index ``` -If you indexed the 5 millions articles, you should see a lot of new files, all with the following format +If you indexed the 5 million articles, you should see a lot of new files, all with the following format The main file is `meta.json`. Our index is in fact divided in segments. Each segment acts as an individual smaller index. @@ -192,7 +191,7 @@ Its named is simply a uuid. -# Serve the search index : `serve` +# Serve the search index: `serve` Tantivy's cli also embeds a search server. You can run it with the following command. @@ -201,7 +200,7 @@ You can run it with the following command. tantivy serve -i wikipedia-index ``` -By default, the server is serving on the port `3000`. +By default, it will serve on port `3000`. You can search for the top 20 most relevant documents for the query `Barack Obama` by accessing the following [url](http://localhost:3000/api/?q=barack+obama&explain=true&nhits=20) in your browser @@ -209,13 +208,13 @@ the following [url](http://localhost:3000/api/?q=barack+obama&explain=true&nhits http://localhost:3000/api/?q=barack+obama&explain=true&nhits=20 -# Optimizing the index : `merge` +# Optimizing the index: `merge` -Each tantivy's indexer thread is closing a new segment every 100K documents (this is completely arbitrary at the moment). -You should have more than 50 segments in your dictionary at the moment. +Each of tantivy's indexer threads closes a new segment every 100K documents (this is completely arbitrary at the moment). +You should have more than 50 segments in your dictionary. -Having that many queries is hurting your query performance (well, mostly the fast ones). -Tantivy merge will merge your segment into one. +Having that many segments hurts your query performance (well, mostly the fast ones). +Tantivy merge will merge your segments into one. ``` tantivy merge -i ./wikipedia-index @@ -224,10 +223,5 @@ Tantivy merge will merge your segment into one. (The command takes around 7 minutes on my computer) Note that your files are still there even after having run the command. -`meta.json` however only lists one of the segments. -You will still need to remove the files manually. - - - - - \ No newline at end of file +However, `meta.json` only lists one of the segments. +You will still need to remove the files manually. \ No newline at end of file From 2269c76575c8ce2e8e06e6da52ea6389a5a89d8a Mon Sep 17 00:00:00 2001 From: "Michael J. Curry" Date: Fri, 30 Sep 2016 10:29:45 -0400 Subject: [PATCH 2/5] small changes to strings --- src/commands/new.rs | 4 ++-- src/main.rs | 8 ++++---- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/src/commands/new.rs b/src/commands/new.rs index 64ba920..fc2c312 100644 --- a/src/commands/new.rs +++ b/src/commands/new.rs @@ -121,7 +121,7 @@ fn ask_add_field_u32(field_name: &str, schema: &mut Schema) { fn ask_add_field(schema: &mut Schema) { println!("\n\n"); let field_name = prompt_input("New field name ", field_name_validate); - let text_or_integer = prompt_options("Text or unsigned 32-bit Integer", vec!('T', 'I')); + let text_or_integer = prompt_options("Text or unsigned 32-bit integer", vec!('T', 'I')); if text_or_integer =='T' { ask_add_field_text(&field_name, schema); } @@ -132,7 +132,7 @@ fn ask_add_field(schema: &mut Schema) { fn run_new(directory: PathBuf) -> tantivy::Result<()> { println!("\n{} ", Style::new().bold().fg(Green).paint("Creating new index")); - println!("{} ", Style::new().bold().fg(Green).paint("Let's define it's schema!")); + println!("{} ", Style::new().bold().fg(Green).paint("Let's define its schema!")); let mut schema = Schema::new(); loop { ask_add_field(&mut schema); diff --git a/src/main.rs b/src/main.rs index 56e65e9..33cfd66 100644 --- a/src/main.rs +++ b/src/main.rs @@ -64,7 +64,7 @@ fn main() { .short("t") .long("num_threads") .value_name("num_threads") - .help("Number of indexing thread. By default num cores - 1 will be used") + .help("Number of indexing threads. By default num cores - 1 will be used") .default_value("0")) ) .subcommand( @@ -75,13 +75,13 @@ fn main() { .short("q") .long("queries") .value_name("queries") - .help("File containing queries (one-per line) to run in the benchmark.") + .help("File containing queries (one per line) to run in the benchmark.") .required(true)) .arg(Arg::with_name("num_repeat") .short("n") .long("num_repeat") .value_name("num_repeat") - .help("Number of time to repeat the benchmark.") + .help("Number of times to repeat the benchmark.") .default_value("1")) ) .subcommand( @@ -109,4 +109,4 @@ fn main() { }, _ => {} } -} \ No newline at end of file +} From a09374808e2af8cfddd7f7d0c7310d7130ac605b Mon Sep 17 00:00:00 2001 From: "Michael J. Curry" Date: Fri, 30 Sep 2016 10:39:07 -0400 Subject: [PATCH 3/5] make readme consistent with small changes to text in code --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 1bf2d08..df5baf4 100644 --- a/README.md +++ b/README.md @@ -73,12 +73,12 @@ Answer the questions as follows: ```none Creating new index - Let's define it's schema! + Let's define its schema! New field name ? title - Text or unsigned 32-bit Integer (T/I) ? T + Text or unsigned 32-bit integer (T/I) ? T Should the field be stored (Y/N) ? Y Should the field be indexed (Y/N) ? Y Should the field be tokenized (Y/N) ? Y @@ -89,7 +89,7 @@ Answer the questions as follows: New field name ? body - Text or unsigned 32-bit Integer (T/I) ? T + Text or unsigned 32-bit integer (T/I) ? T Should the field be stored (Y/N) ? Y Should the field be indexed (Y/N) ? Y Should the field be tokenized (Y/N) ? Y @@ -100,7 +100,7 @@ Answer the questions as follows: New field name ? url - Text or unsigned 32-bit Integer (T/I) ? T + Text or unsigned 32-bit integer (T/I) ? T Should the field be stored (Y/N) ? Y Should the field be indexed (Y/N) ? N Add another field (Y/N) ? N From c1044aa7fc277216e4988acc26c87ea255ab5efb Mon Sep 17 00:00:00 2001 From: "Michael J. Curry" Date: Fri, 30 Sep 2016 10:46:58 -0400 Subject: [PATCH 4/5] more small changes to README --- README.md | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index df5baf4..d369592 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) -Tantivy-cli is the project hosting the command line interface for [tantivy](https://github.com/fulmicoton/tantivy), a search engine project. +`tantivy-cli` is the project hosting the command line interface for [tantivy](https://github.com/fulmicoton/tantivy), a search engine project. # Tutorial: Indexing Wikipedia with Tantivy CLI @@ -12,9 +12,9 @@ In this tutorial, we will create a brand new index with the articles of English ## Installing the tantivy CLI. -There are a couple ways to add the `tantivy` CLI to your computer. +There are a couple ways to install `tantivy-cli`. -If you are a rust programmer, you probably have `cargo` installed and you can just +If you are a Rust programmer, you probably have `cargo` installed and you can just run `cargo install tantivy-cli`. Alternatively, if you are on 64-bit Linux, you can directly download a @@ -136,7 +136,7 @@ Answer the questions as follows: ``` After the wizard has finished, a `meta.json` should exist in `wikipedia-index/meta.json`. -It is a fairly human readable JSON, so you may check its content. +It is a fairly human readable JSON, so you can check its content. It contains two sections: - segments (currently empty, but we will change that soon) @@ -182,11 +182,12 @@ to check what is happening. ls ./wikipedia-index ``` -If you indexed the 5 million articles, you should see a lot of new files, all with the following format +If you indexed the 5 million articles, you should see a lot of new files, all with the following format: + The main file is `meta.json`. Our index is in fact divided in segments. Each segment acts as an individual smaller index. -Its named is simply a uuid. +Its name is simply a uuid. @@ -211,7 +212,7 @@ the following [url](http://localhost:3000/api/?q=barack+obama&explain=true&nhits # Optimizing the index: `merge` Each of tantivy's indexer threads closes a new segment every 100K documents (this is completely arbitrary at the moment). -You should have more than 50 segments in your dictionary. +You should currently have more than 50 segments in your dictionary. Having that many segments hurts your query performance (well, mostly the fast ones). Tantivy merge will merge your segments into one. @@ -224,4 +225,4 @@ Tantivy merge will merge your segments into one. Note that your files are still there even after having run the command. However, `meta.json` only lists one of the segments. -You will still need to remove the files manually. \ No newline at end of file +You will still need to remove the files manually. From 09004de63dbddff51f1cdf4dd28a3bba65675aed Mon Sep 17 00:00:00 2001 From: Paul Masurel Date: Sat, 1 Oct 2016 00:31:53 +0900 Subject: [PATCH 5/5] Update README.md --- README.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index d369592..cea4d38 100644 --- a/README.md +++ b/README.md @@ -211,17 +211,18 @@ the following [url](http://localhost:3000/api/?q=barack+obama&explain=true&nhits # Optimizing the index: `merge` -Each of tantivy's indexer threads closes a new segment every 100K documents (this is completely arbitrary at the moment). -You should currently have more than 50 segments in your dictionary. +Each of tantivy's indexer threads is building its own independant segment. +When its buffer is full, it closes its running segment, and starts working on a new one. +You should currently have more than 50 segments in your directory. -Having that many segments hurts your query performance (well, mostly the fast ones). -Tantivy merge will merge your segments into one. +Having that many segments can hurt your query performance. +Calling `tantivy merge` will merge your segments into one. ``` tantivy merge -i ./wikipedia-index ``` -(The command takes around 7 minutes on my computer) +(The command takes less than 4 minutes on my computer) Note that your files are still there even after having run the command. However, `meta.json` only lists one of the segments.