org.apache.commons.csv

Class CSVParser


public class CSVParser
extends java.lang.Object

Parses CSV files according to the specified configuration. Because CSV appears in many different dialects, the parser supports many configuration settings by allowing the specification of a CSVStrategy.

Parsing of a csv-string having tabs as separators, '"' as an optional value encapsulator, and comments starting with '#':

  String[][] data = 
   (new CSVParser(new StringReader("a\tb\nc\td"), new CSVStrategy('\t','"','#'))).getAllValues();
 

Parsing of a csv-string in Excel CSV format

  String[][] data =
   (new CSVParser(new StringReader("a;b\nc;d"), CSVStrategy.EXCEL_STRATEGY)).getAllValues();
 

Internal parser state is completely covered by the strategy and the reader-state.

see package documentation for more details

Nested Class Summary

(package private) static class
CSVParser.Token
Token is an internal token representation.

Field Summary

private static String[]
EMPTY_STRING_ARRAY
Immutable empty String array.
private static int
INITIAL_TOKEN_LENGTH
length of the initial token (content-)buffer
protected static int
TT_EOF
Token (which can have content) when end of file is reached.
protected static int
TT_EORECORD
Token with content when end of a line is reached.
protected static int
TT_INVALID
Token has no valid content, i.e.
protected static int
TT_TOKEN
Token with content, at beginning or in the middle of a line.
private CharBuffer
code
private ExtendedBufferedReader
in
private ArrayList
record
A record buffer for getLine().
private CSVParser.Token
reusableToken
private CSVStrategy
strategy
private CharBuffer
wsBuf

Constructor Summary

CSVParser(InputStream input)
Deprecated. use CSVParser(Reader).
CSVParser(Reader input)
CSV parser using the default CSVStrategy.
CSVParser(Reader input, char delimiter)
Deprecated. use CSVParser(Reader,CSVStrategy).
CSVParser(Reader input, char delimiter, char encapsulator, char commentStart)
Deprecated. use CSVParser(Reader,CSVStrategy).
CSVParser(Reader input, CSVStrategy strategy)
Customized CSV parser using the given CSVStrategy

Method Summary

private CSVParser.Token
encapsulatedTokenLexer(CSVParser.Token tkn, int c)
An encapsulated token lexer Encapsulated tokens are surrounded by the given encapsulating-string.
String[][]
getAllValues()
Parses the CSV according to the given strategy and returns the content as an array of records (whereas records are arrays of single values).
String[]
getLine()
Parses from the current point in the stream til the end of the current line.
int
getLineNumber()
Returns the current line number in the input stream.
CSVStrategy
getStrategy()
Obtain the specified CSV Strategy
private boolean
isEndOfFile(int c)
private boolean
isEndOfLine(int c)
Greedy - accepts \n and \r\n This checker consumes silently the second control-character...
private boolean
isWhitespace(int c)
protected CSVParser.Token
nextToken()
Convenience method for nextToken(null).
protected CSVParser.Token
nextToken(CSVParser.Token tkn)
Returns the next token.
String
nextValue()
Parses the CSV according to the given strategy and returns the next csv-value as string.
CSVParser
setStrategy(CSVStrategy strategy)
Deprecated. the strategy should be set in the constructor CSVParser(Reader,CSVStrategy).
private CSVParser.Token
simpleTokenLexer(CSVParser.Token tkn, int c)
A simple token lexer Simple token are tokens which are not surrounded by encapsulators.
protected int
unicodeEscapeLexer(int c)
Decodes Unicode escapes.

Field Details

EMPTY_STRING_ARRAY

private static final String[] EMPTY_STRING_ARRAY
Immutable empty String array.

INITIAL_TOKEN_LENGTH

private static final int INITIAL_TOKEN_LENGTH
length of the initial token (content-)buffer
Field Value:
50

TT_EOF

protected static final int TT_EOF
Token (which can have content) when end of file is reached.
Field Value:
1

TT_EORECORD

protected static final int TT_EORECORD
Token with content when end of a line is reached.
Field Value:
2

TT_INVALID

protected static final int TT_INVALID
Token has no valid content, i.e. is in its initilized state.
Field Value:
-1

TT_TOKEN

protected static final int TT_TOKEN
Token with content, at beginning or in the middle of a line.
Field Value:
0

code

private final CharBuffer code

in

private final ExtendedBufferedReader in

record

private final ArrayList record
A record buffer for getLine(). Grows as necessary and is reused.

reusableToken

private final CSVParser.Token reusableToken

strategy

private CSVStrategy strategy

wsBuf

private final CharBuffer wsBuf

Constructor Details

CSVParser

public CSVParser(InputStream input)

Deprecated. use CSVParser(Reader).

Default strategy for the parser follows the default CSVStrategy.
Parameters:
input - an InputStream containing "csv-formatted" stream

CSVParser

public CSVParser(Reader input)
Parameters:
input - a Reader containing "csv-formatted" input

CSVParser

public CSVParser(Reader input,
                 char delimiter)

Deprecated. use CSVParser(Reader,CSVStrategy).

Customized value delimiter parser. The parser follows the default CSVStrategy except for the delimiter setting.
Parameters:
input - a Reader based on "csv-formatted" input
delimiter - a Char used for value separation

CSVParser

public CSVParser(Reader input,
                 char delimiter,
                 char encapsulator,
                 char commentStart)

Deprecated. use CSVParser(Reader,CSVStrategy).

Customized csv parser. The parser parses according to the given CSV dialect settings. Leading whitespaces are truncated, unicode escapes are not interpreted and empty lines are ignored.
Parameters:
input - a Reader based on "csv-formatted" input
delimiter - a Char used for value separation
encapsulator - a Char used as value encapsulation marker
commentStart - a Char used for comment identification

CSVParser

public CSVParser(Reader input,
                 CSVStrategy strategy)
Customized CSV parser using the given CSVStrategy
Parameters:
input - a Reader containing "csv-formatted" input
strategy - the CSVStrategy used for CSV parsing

Method Details

encapsulatedTokenLexer

private CSVParser.Token encapsulatedTokenLexer(CSVParser.Token tkn,
                                               int c)
            throws IOException
An encapsulated token lexer Encapsulated tokens are surrounded by the given encapsulating-string. The encapsulator itself might be included in the token using a doubling syntax (as "", '') or using escaping (as in \", \'). Whitespaces before and after an encapsulated token are ignored.
Parameters:
tkn - the current token
c - the current character
Returns:
a valid token object

getAllValues

public String[][] getAllValues()
            throws IOException
Parses the CSV according to the given strategy and returns the content as an array of records (whereas records are arrays of single values).

The returned content starts at the current parse-position in the stream.

Returns:
matrix of records x values ('null' when end of file)

getLine

public String[] getLine()
            throws IOException
Parses from the current point in the stream til the end of the current line.
Returns:
array of values til end of line ('null' when end of file has been reached)

getLineNumber

public int getLineNumber()
Returns the current line number in the input stream. ATTENTION: in case your csv has multiline-values the returned number does not correspond to the record-number
Returns:
current line number

getStrategy

public CSVStrategy getStrategy()
Obtain the specified CSV Strategy
Returns:
strategy currently being used

isEndOfFile

private boolean isEndOfFile(int c)
Returns:
true if the given character indicates end of file

isEndOfLine

private boolean isEndOfLine(int c)
            throws IOException
Greedy - accepts \n and \r\n This checker consumes silently the second control-character...
Returns:
true if the given character is a line-terminator

isWhitespace

private boolean isWhitespace(int c)
Returns:
true if the given char is a whitespace character

nextToken

protected CSVParser.Token nextToken()
            throws IOException
Convenience method for nextToken(null).

nextToken

protected CSVParser.Token nextToken(CSVParser.Token tkn)
            throws IOException
Returns the next token. A token corresponds to a term, a record change or an end-of-file indicator.
Parameters:
tkn - an existing Token object to reuse. The caller is responsible to initialize the Token.
Returns:
the next token found

nextValue

public String nextValue()
            throws IOException
Parses the CSV according to the given strategy and returns the next csv-value as string.
Returns:
next value in the input stream ('null' when end of file)

setStrategy

public CSVParser setStrategy(CSVStrategy strategy)

Deprecated. the strategy should be set in the constructor CSVParser(Reader,CSVStrategy).

Sets the specified CSV Strategy
Returns:
current instance of CSVParser to allow chained method calls

simpleTokenLexer

private CSVParser.Token simpleTokenLexer(CSVParser.Token tkn,
                                         int c)
            throws IOException
A simple token lexer Simple token are tokens which are not surrounded by encapsulators. A simple token might contain escaped delimiters (as \, or \;). The token is finished when one of the following conditions become true:
  • end of line has been reached (TT_EORECORD)
  • end of stream has been reached (TT_EOF)
  • an unescaped delimiter has been reached (TT_TOKEN)
Parameters:
tkn - the current token
c - the current character
Returns:
the filled token

unicodeEscapeLexer

protected int unicodeEscapeLexer(int c)
            throws IOException
Decodes Unicode escapes. Interpretation of "\\uXXXX" escape sequences where XXXX is a hex-number.
Parameters:
c - current char which is discarded because it's the "\\" of "\\uXXXX"
Returns:
the decoded character