CSV_XS
NAME
SYNOPSIS
DESCRIPTION
SPECIFICATION
METHODS
FUNCTIONS
INTERNALS
EXAMPLES
CAVEATS
TODO
EBCDIC
DIAGNOSTICS
SEE ALSO
AUTHOR
COPYRIGHT AND LICENSE
NAME
Text::CSV_XS − comma−separated values manipulation routines
SYNOPSIS
# Functional interface
use Text::CSV_XS qw( csv );
# Read whole file in memory
my $aoa = csv (in => “data.csv”); # as array of array
my $aoh = csv (in => “data.csv”,
headers => “auto”); # as array of hash
# Write array of arrays as csv file
csv (in => $aoa, out => “file.csv”, sep_char=> “;”);
# Only show lines where “code” is odd
csv (in => “data.csv”, filter => { code => sub { $_ % 2 }});
# Object interface
use Text::CSV_XS;
my @rows;
# Read/parse CSV
my $csv = Text::CSV_XS−>new ({ binary => 1, auto_diag => 1 });
open my $fh, “<:encoding(utf8)", "test.csv" or die "test.csv: $!";
while (my $row = $csv−>getline ($fh)) {
$row−>[2] =~ m/pattern/ or next; # 3rd field should match
push @rows, $row;
}
close $fh;
# and write as CSV
open $fh, “>:encoding(utf8)”, “new.csv” or die “new.csv: $!”;
$csv−>say ($fh, $_) for @rows;
close $fh or die “new.csv: $!”;
DESCRIPTION
Text::CSV_XS provides facilities for the composition and decomposition of comma-separated values. An instance of the Text::CSV_XS class will combine fields into a “CSV” string and parse a “CSV” string into fields.
The module accepts either strings or files as input and support the use of user-specified characters for delimiters, separators, and escapes.
Embedded newlines
Important Note: The default behavior is to accept only ASCII characters in the range from 0x20 (space) to 0x7E (tilde). This means that the fields can not contain newlines. If your data contains newlines embedded in fields, or characters above 0x7E (tilde), or binary data, you must set “binary => 1” in the call to “new”. To cover the widest range of parsing options, you will always want to set binary.
But you still have the problem that you have to pass a correct line to the “parse” method, which is more complicated from the usual point of usage:
my $csv = Text::CSV_XS−>new ({ binary => 1, eol => $/ });
while (<>) { # WRONG!
$csv−>parse ($_);
my @fields = $csv−>fields ();
}
this will break, as the “while” might read broken lines: it does not care about the quoting. If you need to support embedded newlines, the way to go is to not pass “eol” in the parser (it accepts “n”, “r”, and “rn” by default) and then
my $csv = Text::CSV_XS−>new ({ binary => 1 });
open my $fh, “<", $file or die "$file: $!";
while (my $row = $csv−>getline ($fh)) {
my @fields = @$row;
}
The old(er) way of using global file handles is still supported
while (my $row = $csv−>getline (*ARGV)) { … }
Unicode
Unicode is only tested to work with perl−5.8.2 and up.
See also ” BOM” .
The simplest way to ensure the correct encoding is used for in− and output is by either setting layers on the filehandles, or setting the “encoding” argument for “csv”.
open my $fh, “<:encoding(UTF−8)", "in.csv" or die "in.csv: $!";
or
my $aoa = csv (in => “in.csv”, encoding => “UTF−8”);
open my $fh, “>:encoding(UTF−8)”, “out.csv” or die “out.csv: $!”;
or
csv (in => $aoa, out => “out.csv”, encoding => “UTF−8”);
On parsing (both for “getline” and “parse”), if the source is marked being UTF8, then all fields that are marked binary will also be marked UTF8.
On combining (“print” and “combine”): if any of the combining fields was marked UTF8, the resulting string will be marked as UTF8. Note however that all fields before the first field marked UTF8 and contained 8−bit characters that were not upgraded to UTF8, these will be “bytes” in the resulting string too, possibly causing unexpected errors. If you pass data of different encoding, or you don’t know if there is different encoding, force it to be upgraded before you pass them on:
$csv−>print ($fh, [ map { utf8::upgrade (my $x = $_); $x } @data ]);
For complete control over encoding, please use Text::CSV::Encoded:
use Text::CSV::Encoded;
my $csv = Text::CSV::Encoded−>new ({
encoding_in => “iso−8859−1”, # the encoding comes into Perl
encoding_out => “cp1252”, # the encoding comes out of Perl
});
$csv = Text::CSV::Encoded−>new ({ encoding => “utf8” });
# combine () and print () accept *literally* utf8 encoded data
# parse () and getline () return *literally* utf8 encoded data
$csv = Text::CSV::Encoded−>new ({ encoding => undef }); # default
# combine () and print () accept UTF8 marked data
# parse () and getline () return UTF8 marked data
BOM
BOM (or Byte Order Mark) handling is available only inside the “header” method. This method supports the following encodings: “utf−8”, “utf−1”, “utf−32be”, “utf−32le”, “utf−16be”, “utf−16le”, “utf−ebcdic”, “scsu”, “bocu−1”, and “gb−18030”. See Wikipedia
If a file has a BOM, the easiest way to deal with that is
my $aoh = csv (in => $file, detect_bom => 1);
All records will be encoded based on the detected BOM.
This implies a call to the “header” method, which defaults to also set the “column_names”. So this is not the same as
my $aoh = csv (in => $file, headers => “auto”);
which only reads the first record to set “column_names” but ignores any meaning of possible present BOM.
SPECIFICATION
While no formal specification for CSV exists, RFC 4180
Many informal documents exist that describe the “CSV” format. “How To: The Comma Separated Value ( CSV ) File Format”
1) http://tools.ietf.org/html/rfc4180
2) http://tools.ietf.org/html/rfc7111
3) http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
The basic rules are as follows:
CSV is a delimited data format that has fields/columns separated by the comma character and records/rows separated by newlines. Fields that contain a special character (comma, newline, or double quote), must be enclosed in double quotes. However, if a line contains a single entry that is the empty string, it may be enclosed in double quotes. If a field’s value contains a double quote character it is escaped by placing another double quote character next to it. The “CSV” file format does not require a specific character encoding, byte order, or line terminator format.
• |
Each record is a single line ended by a line feed ( ASCII/ “LF”=0x0A) or a carriage return and line feed pair ( ASCII/ “CRLF”=”0x0D 0x0A”), however, line-breaks may be embedded. |
||
• |
Fields are separated by commas. |
||
• |
Allowable characters within a “CSV” field include 0x09 (“TAB”) and the inclusive range of 0x20 (space) through 0x7E (tilde). In binary mode all characters are accepted, at least in quoted fields. |
||
• |
A field within “CSV” must be surrounded by double-quotes to contain a separator character (comma). |
Though this is the most clear and restrictive definition, Text::CSV_XS is way more liberal than this, and allows extension:
• |
Line termination by a single carriage return is accepted by default |
||
• |
The separation−, escape−, and escape− characters can be any ASCII character in the range from 0x20 (space) to 0x7E (tilde). Characters outside this range may or may not work as expected. Multibyte characters, like UTF “U+060C” ( ARABIC COMMA ), “U+FF0C” ( FULLWIDTH COMMA ), “U+241B” ( SYMBOL FOR ESCAPE ), “U+2424” ( SYMBOL FOR NEWLINE ), “U+FF02” ( FULLWIDTH QUOTATION MARK ), and “U+201C” ( LEFT DOUBLE QUOTATION MARK ) (to give some examples of what might look promising) work for newer versions of perl for “sep_char”, and “quote_char” but not for “escape_char”. |
If you use perl−5.8.2 or higher these three attributes are utf8−decoded, to increase the likelihood of success. This way “U+00FE” will be allowed as a quote character.
• |
A field in “CSV” must be surrounded by double-quotes to make an embedded double-quote, represented by a pair of consecutive double-quotes, valid. In binary mode you may additionally use the sequence “”0” for representation of a NULL byte. Using 0x00 in binary mode is just as valid. |
||
• |
Several violations of the above specification may be lifted by passing some options as attributes to the object constructor. |
METHODS
version
(Class method) Returns the current module version.
new
(Class method) Returns a new instance of class Text::CSV_XS. The attributes are described by the (optional) hash ref “%attr”.
my $csv = Text::CSV_XS−>new ({ attributes … });
The following attributes are available:
eol
my $csv = Text::CSV_XS−>new ({ eol => $/ });
$csv−>eol (undef);
my $eol = $csv−>eol;
The end-of-line string to add to rows for “print” or the record separator for “getline”.
When not passed in a parser instance, the default behavior is to accept “n”, “r”, and “rn”, so it is probably safer to not specify “eol” at all. Passing “undef” or the empty string behave the same.
When not passed in a generating instance, records are not terminated at all, so it is probably wise to pass something you expect. A safe choice for “eol” on output is either $/ or “rn”.
Common values for “eol” are “