Browse Source

Added readme

develop
Paul Masurel 8 years ago
parent
commit
27cb9d41e1
2 changed files with 220 additions and 21 deletions
  1. +169
    -0
      README.md
  2. +51
    -21
      src/commands/serve.rs

+ 169
- 0
README.md View File

@@ -0,0 +1,169 @@
Tantivy-cli is command line interface for [tantivy search engine](https://github.com/fulmicoton/tantivy).



# Tutorial: Indexing Wikipedia with Tantivy CLI

## Introduction

In this tutorial, we will create a brand new index with the articles of English wikipedia in it.

## Install

There are two ways to get `tantivy`.
If you are a rust programmer, you can run `cargo install tantivy-cli`.
Alternatively, if you are on `Linux 64bits`, you can download a
static binary: [binaries/linux_x86_64/](http://fulmicoton.com/tantivy-files/binaries/linux_x86_64/tantivy)

## Creating the index

Create a directory in which your index will be stored.

```bash
# create the directory
mkdir wikipedia-index
```


We will now initialize the index and create it's schema.

Our documents will contain
* a title
* a body
* a url

Running `tantivy new` will start a wizard that will help you go through
the definition of the schema of our new index.

```bash
tantivy new -i wikipedia-index
```

When asked answer to the question as follows:

```none

Creating new index
Let's define it's schema!



New field name ? title
Text or unsigned 32-bit Integer (T/I) ? T
Should the field be stored (Y/N) ? Y
Should the field be indexed (Y/N) ? Y
Should the field be tokenized (Y/N) ? Y
Should the term frequencies (per doc) be in the index (Y/N) ? Y
Should the term positions (per doc) be in the index (Y/N) ? Y
Add another field (Y/N) ? Y



New field name ? body
Text or unsigned 32-bit Integer (T/I) ? T
Should the field be stored (Y/N) ? Y
Should the field be indexed (Y/N) ? Y
Should the field be tokenized (Y/N) ? Y
Should the term frequencies (per doc) be in the index (Y/N) ? Y
Should the term positions (per doc) be in the index (Y/N) ? Y
Add another field (Y/N) ? Y



New field name ? url
Text or unsigned 32-bit Integer (T/I) ? T
Should the field be stored (Y/N) ? Y
Should the field be indexed (Y/N) ? N
Add another field (Y/N) ? N

[
{
"name": "title",
"type": "text",
"options": {
"indexing": "position",
"stored": true
}
},
{
"name": "body",
"type": "text",
"options": {
"indexing": "position",
"stored": true
}
},
{
"name": "url",
"type": "text",
"options": {
"indexing": "unindexed",
"stored": true
}
}
]


```

If you want to know more about the meaning of these options, you can check out the [schema doc page](http://fulmicoton.com/tantivy/tantivy/schema/index.html).

The json displayed at the end has been written in `wikipedia-index/meta.json`.


# Get the documents to index

Tantivy's index command offers a way to index a json file.
More accurately, the file must contain one document per line, in a json format.
The structure of this JSON object must match that of our schema definition.

```json
{"body": "some text", "title": "some title", "url": "http://somedomain.com"}
```

You can download a corpus of more than 5 millions articles from wikipedia
formatted in the right format here : [wiki-articles.json (2.34 GB)](https://www.dropbox.com/s/wwnfnu441w1ec9p/wiki-articles.json.bz2?dl=0).
If you are in a rush you can [download 100 articles in the right format here](http://fulmicoton.com/tantivy-files/wiki-articles-1000.json).

Make sure to uncompress the file

```bash
bunzip2 wiki-articles.json.bz2
```

# Index the documents.

The `index` command will index your document.
By default it will use as many threads as there are core on your machine.

On my computer (8 core Xeon(R) CPU X3450 @ 2.67GHz), it only takes 7 minutes.

```
cat /data/wiki-articles | tantivy index -i wikipedia-index
```

While it is indexing, you can peek at the index directory
to check what is happening.

```bash
ls wikipedia-index
```

If you indexed the 5 millions articles, you should see a lot of files, all with the following format
The main file is `meta.json`.

Our index is in fact divided in segments. Each segment acts as an individual smaller index.
It is named by a uuid.
Each different files is storing a different datastructure for the index.


# Serve the search index

```
tantivy serve -i wikipedia-index
```

You can start a small server with a JSON API to search into wikipedia.
By default, the server is serving on the port `3000`.



+ 51
- 21
src/commands/serve.rs View File

@@ -1,3 +1,20 @@
/// This tantivy command starts a http server (by default on port 3000)
///
/// Currently the only entrypoint is /api/
/// and it takes the following query string argument
///
/// - `q=` : your query
// - `nhits`: the number of hits that should be returned. (default to 10)
/// - `explain=` : if true returns some information about the score.
///
///
/// For instance, the following call should return the 20 most relevant
/// hits for fulmicoton.
///
/// http://localhost:3000/api/?q=fulmicoton&explain=false&nhits=20
///


use clap::ArgMatches;
use iron::mime::Mime;
use iron::prelude::*;
@@ -7,8 +24,11 @@ use mount::Mount;
use persistent::Read;
use rustc_serialize::json::as_pretty_json;
use std::convert::From;
use std::error::Error;
use std::fmt::{self, Debug};
use std::path::Path;
use std::path::PathBuf;
use std::str::FromStr;
use tantivy;
use tantivy::collector;
use tantivy::collector::CountCollector;
@@ -19,12 +39,12 @@ use tantivy::query::Explanation;
use tantivy::query::Query;
use tantivy::query::QueryParser;
use tantivy::Result;
use tantivy::schema::Schema;
use tantivy::schema::Field;
use tantivy::schema::FieldType;
use tantivy::schema::NamedFieldDocument;
use tantivy::schema::Schema;
use tantivy::TimerTree;
use urlencoded::UrlEncodedQuery;
use std::str::FromStr;
use std::fmt::{self, Debug};
use std::error::Error;

pub fn run_serve_cli(matches: &ArgMatches) -> tantivy::Result<()> {
let index_directory = PathBuf::from(matches.value_of("index").unwrap());
@@ -40,7 +60,7 @@ struct Serp {
q: String,
num_hits: usize,
hits: Vec<Hit>,
timings: Vec<Timing>,
timings: TimerTree,
}

#[derive(RustcEncodable)]
@@ -49,12 +69,6 @@ struct Hit {
explain: Option<Explanation>,
}

#[derive(RustcEncodable)]
struct Timing {
name: String,
duration: i64,
}

struct IndexServer {
index: Index,
query_parser: QueryParser,
@@ -66,9 +80,21 @@ impl IndexServer {
fn load(path: &Path) -> IndexServer {
let index = Index::open(path).unwrap();
let schema = index.schema();
let body_field = schema.get_field("body").unwrap();
let title_field = schema.get_field("title").unwrap();
let query_parser = QueryParser::new(schema.clone(), vec!(body_field, title_field));
let default_fields: Vec<Field> = schema
.fields()
.iter()
.enumerate()
.filter(
|&(_, ref field_entry)| {
match *field_entry.field_type() {
FieldType::Str(_) => true,
FieldType::U32(_) => false
}
}
)
.map(|(i, _)| Field(i as u8))
.collect();
let query_parser = QueryParser::new(schema.clone(), default_fields);
IndexServer {
index: index,
query_parser: query_parser,
@@ -88,14 +114,17 @@ impl IndexServer {
let searcher = self.index.searcher().unwrap();
let mut count_collector = CountCollector::new();
let mut top_collector = TopCollector::with_limit(num_hits);
let mut timer_tree = TimerTree::new();
{
let _search_timer = timer_tree.open("search");
let mut chained_collector = collector::chain()
.add(&mut top_collector)
.add(&mut count_collector);
.add(&mut top_collector)
.add(&mut count_collector);
try!(query.search(&searcher, &mut chained_collector));
}
let hits: Vec<Hit> = top_collector.docs()
let hits: Vec<Hit> = {
let _fetching_timer = timer_tree.open("fetching docs");
top_collector.docs()
.iter()
.map(|doc_address| {
let doc: Document = searcher.doc(doc_address).unwrap();
@@ -108,12 +137,13 @@ impl IndexServer {
}
self.create_hit(&doc, explanation)
})
.collect();
.collect()
};
Ok(Serp {
q: q,
hits: hits,
num_hits: count_collector.count(),
timings: Vec::new(),
hits: hits,
timings: timer_tree,
})
}
}


Loading…
Cancel
Save