xapian-core  1.4.25
Classes | Public Types | Public Member Functions | Private Attributes | List of all members
Xapian::TermGenerator Class Reference

Parses a piece of text and generate terms. More...

#include <termgenerator.h>

+ Collaboration diagram for Xapian::TermGenerator:

Classes

class  Internal
 

Public Types

enum  { FLAG_SPELLING = 128, FLAG_NGRAMS = 2048, FLAG_CJK_NGRAM = FLAG_NGRAMS }
 Flags to OR together and pass to TermGenerator::set_flags(). More...
 
enum  stem_strategy {
  STEM_NONE, STEM_SOME, STEM_ALL, STEM_ALL_Z,
  STEM_SOME_FULL_POS
}
 Stemming strategies, for use with set_stemming_strategy(). More...
 
enum  stop_strategy { STOP_NONE, STOP_ALL, STOP_STEMMED }
 Stopper strategies, for use with set_stopper_strategy(). More...
 
typedef int flags
 For backward compatibility with Xapian 1.2. More...
 

Public Member Functions

 TermGenerator (const TermGenerator &o)
 Copy constructor. More...
 
TermGeneratoroperator= (const TermGenerator &o)
 Assignment. More...
 
 TermGenerator ()
 Default constructor. More...
 
 ~TermGenerator ()
 Destructor. More...
 
void set_stemmer (const Xapian::Stem &stemmer)
 Set the Xapian::Stem object to be used for generating stemmed terms. More...
 
void set_stopper (const Xapian::Stopper *stop=NULL)
 Set the Xapian::Stopper object to be used for identifying stopwords. More...
 
void set_document (const Xapian::Document &doc)
 Set the current document. More...
 
const Xapian::Documentget_document () const
 Get the current document. More...
 
void set_database (const Xapian::WritableDatabase &db)
 Set the database to index spelling data to. More...
 
flags set_flags (flags toggle, flags mask=flags(0))
 Set flags. More...
 
void set_stemming_strategy (stem_strategy strategy)
 Set the stemming strategy. More...
 
void set_stopper_strategy (stop_strategy strategy)
 Set the stopper strategy. More...
 
void set_max_word_length (unsigned max_word_length)
 Set the maximum length word to index. More...
 
void index_text (const Xapian::Utf8Iterator &itor, Xapian::termcount wdf_inc=1, const std::string &prefix=std::string())
 Index some text. More...
 
void index_text (const std::string &text, Xapian::termcount wdf_inc=1, const std::string &prefix=std::string())
 Index some text in a std::string. More...
 
void index_text_without_positions (const Xapian::Utf8Iterator &itor, Xapian::termcount wdf_inc=1, const std::string &prefix=std::string())
 Index some text without positional information. More...
 
void index_text_without_positions (const std::string &text, Xapian::termcount wdf_inc=1, const std::string &prefix=std::string())
 Index some text in a std::string without positional information. More...
 
void increase_termpos (Xapian::termpos delta=100)
 Increase the term position used by index_text. More...
 
Xapian::termpos get_termpos () const
 Get the current term position. More...
 
void set_termpos (Xapian::termpos termpos)
 Set the current term position. More...
 
std::string get_description () const
 Return a string describing this object. More...
 

Private Attributes

Xapian::Internal::intrusive_ptr< Internalinternal
 

Detailed Description

Parses a piece of text and generate terms.

This module takes a piece of text and parses it to produce words which are then used to generate suitable terms for indexing. The terms generated are suitable for use with Query objects produced by the QueryParser class.

Definition at line 48 of file termgenerator.h.

Member Typedef Documentation

◆ flags

For backward compatibility with Xapian 1.2.

Definition at line 98 of file termgenerator.h.

Member Enumeration Documentation

◆ anonymous enum

anonymous enum

Flags to OR together and pass to TermGenerator::set_flags().

Enumerator
FLAG_SPELLING 

Index data required for spelling correction.

FLAG_NGRAMS 

Generate n-grams for scripts without explicit word breaks.

Spans of characters in such scripts are split into unigrams and bigrams, with the unigrams carrying positional information. Text in other scripts is split into words as normal.

The QueryParser::FLAG_NGRAMS flag needs to be passed to QueryParser.

This mode can also be enabled in 1.2.8 and later by setting environment variable XAPIAN_CJK_NGRAM to a non-empty value (but doing so was deprecated in 1.4.11).

In 1.4.x this feature was specific to CJK (Chinese, Japanese and Korean), but in 1.5.0 it's been extended to other languages. To reflect this change the new and preferred name is FLAG_NGRAMS, which was added as an alias for forward compatibility in Xapian 1.4.23. Use FLAG_CJK_NGRAM instead if you aim to support Xapian < 1.4.23.

Since
Added in Xapian 1.4.23.
FLAG_CJK_NGRAM 

Generate n-grams for scripts without explicit word breaks.

Old name - use FLAG_NGRAMS instead unless you aim to support Xapian < 1.4.23.

Since
Added in Xapian 1.3.4 and 1.2.22.

Definition at line 101 of file termgenerator.h.

◆ stem_strategy

Stemming strategies, for use with set_stemming_strategy().

Enumerator
STEM_NONE 
STEM_SOME 
STEM_ALL 
STEM_ALL_Z 
STEM_SOME_FULL_POS 

Definition at line 140 of file termgenerator.h.

◆ stop_strategy

Stopper strategies, for use with set_stopper_strategy().

Enumerator
STOP_NONE 
STOP_ALL 
STOP_STEMMED 

Definition at line 145 of file termgenerator.h.

Constructor & Destructor Documentation

◆ TermGenerator() [1/2]

TermGenerator::TermGenerator ( const TermGenerator o)

Copy constructor.

Definition at line 34 of file termgenerator.cc.

◆ TermGenerator() [2/2]

TermGenerator::TermGenerator ( )

Default constructor.

Definition at line 47 of file termgenerator.cc.

Referenced by operator=().

◆ ~TermGenerator()

TermGenerator::~TermGenerator ( )

Destructor.

Definition at line 49 of file termgenerator.cc.

Member Function Documentation

◆ get_description()

string TermGenerator::get_description ( ) const

Return a string describing this object.

Definition at line 143 of file termgenerator.cc.

References internal, and Xapian::Internal::str().

◆ get_document()

const Xapian::Document & TermGenerator::get_document ( ) const

Get the current document.

Definition at line 71 of file termgenerator.cc.

◆ get_termpos()

Xapian::termpos TermGenerator::get_termpos ( ) const

Get the current term position.

Definition at line 131 of file termgenerator.cc.

◆ increase_termpos()

void TermGenerator::increase_termpos ( Xapian::termpos  delta = 100)

Increase the term position used by index_text.

This can be used between indexing text from different fields or other places to prevent phrase searches from spanning between them (e.g. between the title and body text, or between two chapters in a book).

Parameters
deltaAmount to increase the term position by (default: 100).

Definition at line 125 of file termgenerator.cc.

Referenced by DEFINE_TESTCASE().

◆ index_text() [1/2]

void TermGenerator::index_text ( const Xapian::Utf8Iterator itor,
Xapian::termcount  wdf_inc = 1,
const std::string &  prefix = std::string() 
)

Index some text.

Parameters
itorUtf8Iterator pointing to the text to index.
wdf_incThe wdf increment (default 1).
prefixThe term prefix to use (default is no prefix).

Definition at line 109 of file termgenerator.cc.

Referenced by DEFINE_TESTCASE(), main(), make_netstats1_db(), and make_tg_db().

◆ index_text() [2/2]

void Xapian::TermGenerator::index_text ( const std::string &  text,
Xapian::termcount  wdf_inc = 1,
const std::string &  prefix = std::string() 
)
inline

Index some text in a std::string.

Parameters
textThe text to index.
wdf_incThe wdf increment (default 1).
prefixThe term prefix to use (default is no prefix).

Definition at line 234 of file termgenerator.h.

◆ index_text_without_positions() [1/2]

void TermGenerator::index_text_without_positions ( const Xapian::Utf8Iterator itor,
Xapian::termcount  wdf_inc = 1,
const std::string &  prefix = std::string() 
)

Index some text without positional information.

Just like index_text, but no positional information is generated. This means that the database will be significantly smaller, but that phrase searching and NEAR won't be supported.

Parameters
itorUtf8Iterator pointing to the text to index.
wdf_incThe wdf increment (default 1).
prefixThe term prefix to use (default is no prefix).

Definition at line 117 of file termgenerator.cc.

Referenced by DEFINE_TESTCASE().

◆ index_text_without_positions() [2/2]

void Xapian::TermGenerator::index_text_without_positions ( const std::string &  text,
Xapian::termcount  wdf_inc = 1,
const std::string &  prefix = std::string() 
)
inline

Index some text in a std::string without positional information.

Just like index_text, but no positional information is generated. This means that the database will be significantly smaller, but that phrase searching and NEAR won't be supported.

Parameters
textThe text to index.
wdf_incThe wdf increment (default 1).
prefixThe term prefix to use (default is no prefix).

Definition at line 264 of file termgenerator.h.

◆ operator=()

TermGenerator & TermGenerator::operator= ( const TermGenerator o)
default

Assignment.

Definition at line 37 of file termgenerator.cc.

References internal, and TermGenerator().

◆ set_database()

void TermGenerator::set_database ( const Xapian::WritableDatabase db)

Set the database to index spelling data to.

Definition at line 77 of file termgenerator.cc.

Referenced by DEFINE_TESTCASE().

◆ set_document()

void TermGenerator::set_document ( const Xapian::Document doc)

Set the current document.

Definition at line 64 of file termgenerator.cc.

Referenced by DEFINE_TESTCASE(), main(), make_netstats1_db(), and make_tg_db().

◆ set_flags()

TermGenerator::flags TermGenerator::set_flags ( flags  toggle,
flags  mask = flags(0) 
)

Set flags.

The new value of flags is: (flags & mask) ^ toggle

To just set the flags, pass the new flags in toggle and the default value for mask.

Parameters
toggleFlags to XOR.
maskFlags to AND with first.
Returns
The old flags setting.

Definition at line 83 of file termgenerator.cc.

Referenced by DEFINE_TESTCASE().

◆ set_max_word_length()

void TermGenerator::set_max_word_length ( unsigned  max_word_length)

Set the maximum length word to index.

The limit is on the length of a word prior to stemming and prior to adding any term prefix.

The backends mostly impose a limit on the length of terms (often of about 240 bytes), but it's generally useful to have a lower limit to help prevent the index being bloated by useless junk terms from trying to indexing things like binary data, uuencoded data, ASCII art, etc.

This method was new in Xapian 1.3.1.

Parameters
max_word_lengthThe maximum length word to index, in bytes in UTF-8 representation. Default is 64.

Definition at line 103 of file termgenerator.cc.

Referenced by DEFINE_TESTCASE().

◆ set_stemmer()

void TermGenerator::set_stemmer ( const Xapian::Stem stemmer)

Set the Xapian::Stem object to be used for generating stemmed terms.

Definition at line 52 of file termgenerator.cc.

References stemmer.

Referenced by DEFINE_TESTCASE(), main(), make_netstats1_db(), and make_tg_db().

◆ set_stemming_strategy()

void TermGenerator::set_stemming_strategy ( stem_strategy  strategy)

Set the stemming strategy.

This method controls how the stemming algorithm is applied. It was new in Xapian 1.3.1.

Parameters
strategyThe strategy to use - possible values are:
  • STEM_NONE: Don't perform any stemming - only unstemmed terms are generated.
  • STEM_SOME: Generate both stemmed (with a "Z" prefix) and unstemmed terms. No positional information is stored for unstemmed terms. This is the default strategy.
  • STEM_SOME_FULL_POS: Like STEM_SOME but positional information is stored for both stemmed and unstemmed terms. Added in Xapian 1.4.8.
  • STEM_ALL: Generate only stemmed terms (but without a "Z" prefix).
  • STEM_ALL_Z: Generate only stemmed terms (with a "Z" prefix).

Definition at line 91 of file termgenerator.cc.

Referenced by DEFINE_TESTCASE(), main(), and make_netstats1_db().

◆ set_stopper()

void TermGenerator::set_stopper ( const Xapian::Stopper stop = NULL)

Set the Xapian::Stopper object to be used for identifying stopwords.

Stemmed forms of stopwords aren't indexed, but unstemmed forms still are so that searches for phrases including stop words still work.

Parameters
stopThe Stopper object to set (default NULL, which means no stopwords).

Definition at line 58 of file termgenerator.cc.

Referenced by DEFINE_TESTCASE().

◆ set_stopper_strategy()

void TermGenerator::set_stopper_strategy ( stop_strategy  strategy)

Set the stopper strategy.

The method controls how the stopper is used. It was added in Xapian 1.4.1.

You need to also call set_stopper() for this to have any effect.

Parameters
strategyThe strategy to use - possible values are:
  • STOP_NONE: Don't use the stopper.
  • STOP_ALL: If a word is identified as a stop word, skip it completely.
  • STOP_STEMMED: If a word is identified as a stop word, index its unstemmed form but skip the stem. Unstemmed forms are indexed with positional information by default, so this allows searches for phrases containing stopwords to be supported. (This is the default mode).

Definition at line 97 of file termgenerator.cc.

Referenced by DEFINE_TESTCASE().

◆ set_termpos()

void TermGenerator::set_termpos ( Xapian::termpos  termpos)

Set the current term position.

Parameters
termposThe new term position to set.

Definition at line 137 of file termgenerator.cc.

Member Data Documentation

◆ internal

Xapian::Internal::intrusive_ptr<Internal> Xapian::TermGenerator::internal
private

Reference counted internals.

Definition at line 51 of file termgenerator.h.

Referenced by get_description(), and operator=().


The documentation for this class was generated from the following files: