Using awk
to deal with CSV that uses quoted/unquoted delimiters¶
CSV files are a mess, yes.
Assume you have CSV files that use the comma as delimiter and quoted data fields that can contain the delimiter.
"first", "second", "last"
"fir,st", "second", "last"
"firtst one", "sec,ond field", "final,ly"
Simply using the comma as separator for awk
won't work here, of course.
Solution: Use the field separator ", "|^"|"$
for awk
.
This is an OR-ed list of 3 possible separators:
", " | matches the area between the datafields |
^" | matches the area left of the first datafield |
"$ | matches the area right of the last data field |
You can tune these delimiters if you have other needs (for example if you don't have a space after the commas).
Test:
The awk
command used for the CSV above just prints the fileds separated by ###
to see what's going on:
$ awk -v FS='", "|^"|"$' '{print $2"###"$3"###"$4}' data.csv
first###second###last
fir,st###second###last
firtst one###sec,ond field###final,ly
ATTENTION If the CSV data changes its format every now and then (for example it only quotes the data fields if needed, not always), then this way will not work.