org.apache.solr.analysis
Class PatternTokenizerFactory

java.lang.Object
  extended by org.apache.solr.analysis.PatternTokenizerFactory
All Implemented Interfaces:
TokenizerFactory

public class PatternTokenizerFactory
extends java.lang.Object
implements TokenizerFactory

This tokenizer uses regex pattern matching to construct distinct tokens for the input stream. It takes two arguments: "pattern" and "group" "pattern" is the regular expression. "group" says which group to extract into tokens. group=-1 (the default) is equivalent to "split". In this case, the tokens will be equivalent to the output from: http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html#split(java.lang.String) Using group >= 0 selects the matching group as the token. For example, if you have: pattern = \'([^\']+)\' group = 0 input = aaa 'bbb' 'ccc' the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input but using group=1, the output would be: bbb and ccc (no ' marks)

Since:
solr1.2
Version:
$Id:$
Author:
ryan

Field Summary
protected  java.util.Map<java.lang.String,java.lang.String> args
           
protected  int group
           
static java.lang.String GROUP
           
protected  java.util.regex.Pattern pattern
           
static java.lang.String PATTERN
           
 
Constructor Summary
PatternTokenizerFactory()
           
 
Method Summary
 org.apache.lucene.analysis.TokenStream create(java.io.Reader input)
          Split the input using configured pattern
 java.util.Map<java.lang.String,java.lang.String> getArgs()
          The arguments passed to init()
static java.util.List<org.apache.lucene.analysis.Token> group(java.util.regex.Matcher matcher, java.lang.String input, int group)
          Create tokens from the matches in a matcher
 void init(java.util.Map<java.lang.String,java.lang.String> args)
          Require a configured pattern
static java.util.List<org.apache.lucene.analysis.Token> split(java.util.regex.Matcher matcher, java.lang.String input)
          This behaves just like String.split( ), but returns a list of Tokens rather then an array of strings
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

PATTERN

public static final java.lang.String PATTERN
See Also:
Constant Field Values

GROUP

public static final java.lang.String GROUP
See Also:
Constant Field Values

args

protected java.util.Map<java.lang.String,java.lang.String> args

pattern

protected java.util.regex.Pattern pattern

group

protected int group
Constructor Detail

PatternTokenizerFactory

public PatternTokenizerFactory()
Method Detail

init

public void init(java.util.Map<java.lang.String,java.lang.String> args)
Require a configured pattern

Specified by:
init in interface TokenizerFactory

getArgs

public java.util.Map<java.lang.String,java.lang.String> getArgs()
The arguments passed to init()

Specified by:
getArgs in interface TokenizerFactory

create

public org.apache.lucene.analysis.TokenStream create(java.io.Reader input)
Split the input using configured pattern

Specified by:
create in interface TokenizerFactory

split

public static java.util.List<org.apache.lucene.analysis.Token> split(java.util.regex.Matcher matcher,
                                                                     java.lang.String input)
This behaves just like String.split( ), but returns a list of Tokens rather then an array of strings


group

public static java.util.List<org.apache.lucene.analysis.Token> group(java.util.regex.Matcher matcher,
                                                                     java.lang.String input,
                                                                     int group)
Create tokens from the matches in a matcher



Copyright © 2006 - 2008 The Apache Software Foundation