User Tools

Site Tools


windows:aboutexcelcsvformat

How to create a CSV file for the Excel Spreadsheet

Recently I've devoted some time to make some tests on what's the best way to store content programmatically for the excel spreadsheet. In other words, to make a program to output data that can be open directly by the microsoft excell program.

I've achieved some interesting results that I want to write down. And share with whatever reader you are.

The CSV format

The CSV format stands for “Comma Separated Values”. This file format isn't a “formal” format in the sense that is strictly defined and standarized, on the contrary, it is loosely defined and the implementarions may vary.

What I've find out researching into the way Excel stores this information and making some tries is this:

The general format is this:

  
value11<separator>value12<separator>value13\r\n
value21<separator>value22<separator>value23\r\n

As <separator> you can use comma (,), semicolon (;) and tab characters (vbTab, 0x09)

If the character <separator> and in one of the values is a separator character, you have to put it among double quotes. Let's say that the separator is semicolon, and the first value is “hello;world”. The final codification of this would be:

  
"hello;world";value12;value13\r\n
value21;value22;value23\r\n

How the carriage returns in the values are treated??? They are treated the same as the separator: if they are among a value, you have to put the value into a double quotes. Let's say that the third value has a carriage return. In that case, (foo\r\nbar), then you have to code it like this:

  
"hello;world";value12;"foo\r\nbar"\r\n
value21;value22;value23\r\n

What happens with the lenght of the values??? The maximum length per record is 32,768 bytes (sounds familiar, isn't it?). Pay attention to the “bytes”, because if we are using Unicode as our encoding standard, the number of characters might not be equal to the number of bytes. You have to cut values longer than that because otherwise, excell will do it for you, putting the rest of the data in the following line.

The best way to encode a CSV format

After many tries, I've discovered that:

  • You can use Unicode as an enconding format for your CSV files
  • If you use Unicode, the best way to encode is to use <tab> as a field separator: the rest of separators (semicolon, comma) are not recognized for Excel as a proper CSV file, and then will force the user to use the “import as text” facility which is a little bit more cumbersome that just double click on the file.

A reference implementation in visual basic script

Here is my reference implementation in visual basic script (provided the input comes from a recordset), which can be easily converted into other programming language like java.

As a safety precaution, I've put behind double quotes any value that contains a “cr” or “lf” character in it.

'-------------------------------------------------------------------
'
' csv.vbs - helper functions for writing csv files
'
'--------------------------------------------------------------------
 
function openCsv( byval filename ) 
 
	dim fso
	dim textFile 
	dim overwrite 
	dim useUnicode
 
	overwrite = true
	useUnicode = true
	set fso = createobject( "Scripting.FileSystemObject" )
	set openCsv = fso.createTextFile( filename, overWrite, useUnicode )
 
end function 'openCsv
 
 
function closeCsv( byval fileObject )
 
	fileObject.close
 
	set fileObject = nothing 
 
end function ' fileObject 
 
function writeHeaderCsv( byval fileObject, byval recordset )
 
	dim nCont 
 
	nCont = 0
	do while nCont < recordset.fields.count 
		fileObject.write item(recordset.fields(nCont).name) & recordSep()
		nCont = nCont + 1
	loop ' nCont 
	fileObject.write lineSep()
 
end function ' writeHeaderCsv
 
' prints the current line of the 
' recorset as a cvs line
function writeLineCsv( byval fileObject, byval recordset ) 
 
	dim nCont 
 
	nCont = 0
	do while nCont < recordset.fields.count 
		fileObject.write item(recordset.fields(nCont).value) & recordSep()
		nCont = nCont + 1
	loop ' nCont 
	fileObject.writeline
 
end function ' writeLine
 
function lineSep()
	lineSep = vbLf
end function ' lineSetp
 
' after trying several alternatives, the only 
' separator that makes the excel correctly interpret
' the differents cells is vbTab, so I left as 
' the default separator
function recordSep()
	recordSep = vbTab ' ";"
end function ' recordSep
 
 
' escape of items: 
' value es;to must be set as "es;to"  (with double quotes)
' value es"to" must be set as es""to"" (doubling the double quotes)
function item( byval value )
 
	if isempty( value ) then 
		value = ""
	end if ' isempty
	if isnull( value ) then 
		value = ""
	end if ' isnull 
 
	item = value 
	' cut items longer than 32768 bytes long
	' as you can imagine "bytes" doesn't correspond 
	' to "chars" in utf, so, to be sure, will cut 
	' to a prudent 30767 char (2000 bytes less)
	if len( item ) > 30767 then 
		item = mid( item, 1, 30767 ) & "..."
	end if ' length( item ) 
	item = replace( item, """", """""" )
	if instr( item, vbCr ) > 0 or instr( item, vbLf ) > 0 then 
		item = """" & item & """"
	end if ' instr( item
	if instr( item, recordSep() ) > 0 then 
		item = """" & item & """"
	end if ' instr( item
end function ' item

~~~DISQUS~~~

windows/aboutexcelcsvformat.txt · Last modified: 2014/12/24 10:42 (external edit)