org.apache.commons.csv
Class CSVParser
java.lang.Object
org.apache.commons.csv.CSVParser
public class CSVParser
extends java.lang.Object
Parses CSV files according to the specified configuration.
Because CSV appears in many different dialects, the parser supports many
configuration settings by allowing the specification of a
CSVStrategy
.
Parsing of a csv-string having tabs as separators,
'"' as an optional value encapsulator, and comments starting with '#':
String[][] data =
(new CSVParser(new StringReader("a\tb\nc\td"), new CSVStrategy('\t','"','#'))).getAllValues();
Parsing of a csv-string in Excel CSV format
String[][] data =
(new CSVParser(new StringReader("a;b\nc;d"), CSVStrategy.EXCEL_STRATEGY)).getAllValues();
Internal parser state is completely covered by the strategy
and the reader-state.
see
package documentation
for more details
(package private) static class | CSVParser.Token - Token is an internal token representation.
|
EMPTY_STRING_ARRAY
private static final String[] EMPTY_STRING_ARRAY
Immutable empty String array.
INITIAL_TOKEN_LENGTH
private static final int INITIAL_TOKEN_LENGTH
length of the initial token (content-)buffer
TT_EOF
protected static final int TT_EOF
Token (which can have content) when end of file is reached.
TT_EORECORD
protected static final int TT_EORECORD
Token with content when end of a line is reached.
TT_INVALID
protected static final int TT_INVALID
Token has no valid content, i.e. is in its initilized state.
TT_TOKEN
protected static final int TT_TOKEN
Token with content, at beginning or in the middle of a line.
record
private final ArrayList record
A record buffer for getLine(). Grows as necessary and is reused.
CSVParser
public CSVParser(InputStream input)
use CSVParser(Reader)
.
Default strategy for the parser follows the default
CSVStrategy
.
input
- an InputStream containing "csv-formatted" stream
CSVParser
public CSVParser(Reader input)
input
- a Reader containing "csv-formatted" input
CSVParser
public CSVParser(Reader input,
char delimiter)
use CSVParser(Reader,CSVStrategy)
.
Customized value delimiter parser.
The parser follows the default
CSVStrategy
except for the delimiter setting.
input
- a Reader based on "csv-formatted" inputdelimiter
- a Char used for value separation
CSVParser
public CSVParser(Reader input,
char delimiter,
char encapsulator,
char commentStart)
use CSVParser(Reader,CSVStrategy)
.
Customized csv parser.
The parser parses according to the given CSV dialect settings.
Leading whitespaces are truncated, unicode escapes are
not interpreted and empty lines are ignored.
input
- a Reader based on "csv-formatted" inputdelimiter
- a Char used for value separationencapsulator
- a Char used as value encapsulation markercommentStart
- a Char used for comment identification
CSVParser
public CSVParser(Reader input,
CSVStrategy strategy)
input
- a Reader containing "csv-formatted" inputstrategy
- the CSVStrategy used for CSV parsing
encapsulatedTokenLexer
private CSVParser.Token encapsulatedTokenLexer(CSVParser.Token tkn,
int c)
throws IOException
An encapsulated token lexer
Encapsulated tokens are surrounded by the given encapsulating-string.
The encapsulator itself might be included in the token using a
doubling syntax (as "", '') or using escaping (as in \", \').
Whitespaces before and after an encapsulated token are ignored.
tkn
- the current tokenc
- the current character
getAllValues
public String[][] getAllValues()
throws IOException
Parses the CSV according to the given strategy
and returns the content as an array of records
(whereas records are arrays of single values).
The returned content starts at the current parse-position in
the stream.
- matrix of records x values ('null' when end of file)
getLine
public String[] getLine()
throws IOException
Parses from the current point in the stream til
the end of the current line.
- array of values til end of line
('null' when end of file has been reached)
getLineNumber
public int getLineNumber()
Returns the current line number in the input stream.
ATTENTION: in case your csv has multiline-values the returned
number does not correspond to the record-number
getStrategy
public CSVStrategy getStrategy()
Obtain the specified CSV Strategy
- strategy currently being used
isEndOfFile
private boolean isEndOfFile(int c)
- true if the given character indicates end of file
isEndOfLine
private boolean isEndOfLine(int c)
throws IOException
Greedy - accepts \n and \r\n
This checker consumes silently the second control-character...
- true if the given character is a line-terminator
isWhitespace
private boolean isWhitespace(int c)
- true if the given char is a whitespace character
nextToken
protected CSVParser.Token nextToken()
throws IOException
Convenience method for nextToken(null)
.
nextToken
protected CSVParser.Token nextToken(CSVParser.Token tkn)
throws IOException
Returns the next token.
A token corresponds to a term, a record change or an
end-of-file indicator.
tkn
- an existing Token object to reuse. The caller is responsible to initialize the
Token.
nextValue
public String nextValue()
throws IOException
Parses the CSV according to the given strategy
and returns the next csv-value as string.
- next value in the input stream ('null' when end of file)
simpleTokenLexer
private CSVParser.Token simpleTokenLexer(CSVParser.Token tkn,
int c)
throws IOException
A simple token lexer
Simple token are tokens which are not surrounded by encapsulators.
A simple token might contain escaped delimiters (as \, or \;). The
token is finished when one of the following conditions become true:
- end of line has been reached (TT_EORECORD)
- end of stream has been reached (TT_EOF)
- an unescaped delimiter has been reached (TT_TOKEN)
tkn
- the current tokenc
- the current character
unicodeEscapeLexer
protected int unicodeEscapeLexer(int c)
throws IOException
Decodes Unicode escapes.
Interpretation of "\\uXXXX" escape sequences
where XXXX is a hex-number.
c
- current char which is discarded because it's the "\\" of "\\uXXXX"