Table of Contents
How to create a CSV file for the Excel Spreadsheet
Recently I've devoted some time to make some tests on what's the best way to store content programmatically for the excel spreadsheet. In other words, to make a program to output data that can be open directly by the microsoft excell program.
I've achieved some interesting results that I want to write down. And share with whatever reader you are.
The CSV format
The CSV format stands for “Comma Separated Values”. This file format isn't a “formal” format in the sense that is strictly defined and standarized, on the contrary, it is loosely defined and the implementarions may vary.
What I've find out researching into the way Excel stores this information and making some tries is this:
The general format is this:
value11<separator>value12<separator>value13\r\n value21<separator>value22<separator>value23\r\n
As <separator>
you can use comma (,), semicolon (;) and tab characters (vbTab, 0x09)
If the character <separator>
and in one of the values is a separator character, you have to put it among double quotes. Let's say that the separator is semicolon, and the first value is “hello;world”. The final codification of this would be:
"hello;world";value12;value13\r\n value21;value22;value23\r\n
How the carriage returns in the values are treated??? They are treated the same as the separator: if they are among a value, you have to put the value into a double quotes. Let's say that the third value has a carriage return. In that case, (foo\r\nbar)
, then you have to code it like this:
"hello;world";value12;"foo\r\nbar"\r\n value21;value22;value23\r\n
What happens with the lenght of the values??? The maximum length per record is 32,768 bytes (sounds familiar, isn't it?). Pay attention to the “bytes”, because if we are using Unicode as our encoding standard, the number of characters might not be equal to the number of bytes. You have to cut values longer than that because otherwise, excell will do it for you, putting the rest of the data in the following line.
The best way to encode a CSV format
After many tries, I've discovered that:
- You can use Unicode as an enconding format for your CSV files
- If you use Unicode, the best way to encode is to use <tab> as a field separator: the rest of separators (semicolon, comma) are not recognized for Excel as a proper CSV file, and then will force the user to use the “import as text” facility which is a little bit more cumbersome that just double click on the file.
A reference implementation in visual basic script
Here is my reference implementation in visual basic script (provided the input comes from a recordset), which can be easily converted into other programming language like java.
As a safety precaution, I've put behind double quotes any value that contains a “cr” or “lf” character in it.
'------------------------------------------------------------------- ' ' csv.vbs - helper functions for writing csv files ' '-------------------------------------------------------------------- function openCsv( byval filename ) dim fso dim textFile dim overwrite dim useUnicode overwrite = true useUnicode = true set fso = createobject( "Scripting.FileSystemObject" ) set openCsv = fso.createTextFile( filename, overWrite, useUnicode ) end function 'openCsv function closeCsv( byval fileObject ) fileObject.close set fileObject = nothing end function ' fileObject function writeHeaderCsv( byval fileObject, byval recordset ) dim nCont nCont = 0 do while nCont < recordset.fields.count fileObject.write item(recordset.fields(nCont).name) & recordSep() nCont = nCont + 1 loop ' nCont fileObject.write lineSep() end function ' writeHeaderCsv ' prints the current line of the ' recorset as a cvs line function writeLineCsv( byval fileObject, byval recordset ) dim nCont nCont = 0 do while nCont < recordset.fields.count fileObject.write item(recordset.fields(nCont).value) & recordSep() nCont = nCont + 1 loop ' nCont fileObject.writeline end function ' writeLine function lineSep() lineSep = vbLf end function ' lineSetp ' after trying several alternatives, the only ' separator that makes the excel correctly interpret ' the differents cells is vbTab, so I left as ' the default separator function recordSep() recordSep = vbTab ' ";" end function ' recordSep ' escape of items: ' value es;to must be set as "es;to" (with double quotes) ' value es"to" must be set as es""to"" (doubling the double quotes) function item( byval value ) if isempty( value ) then value = "" end if ' isempty if isnull( value ) then value = "" end if ' isnull item = value ' cut items longer than 32768 bytes long ' as you can imagine "bytes" doesn't correspond ' to "chars" in utf, so, to be sure, will cut ' to a prudent 30767 char (2000 bytes less) if len( item ) > 30767 then item = mid( item, 1, 30767 ) & "..." end if ' length( item ) item = replace( item, """", """""" ) if instr( item, vbCr ) > 0 or instr( item, vbLf ) > 0 then item = """" & item & """" end if ' instr( item if instr( item, recordSep() ) > 0 then item = """" & item & """" end if ' instr( item end function ' item
~~~DISQUS~~~