User Tools

Site Tools


windows:aboutexcelcsvformat

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
windows:aboutexcelcsvformat [2014/10/29 09:59] – created rlunarowindows:aboutexcelcsvformat [2022/12/02 22:02] (current) – external edit 127.0.0.1
Line 7: Line 7:
 ===== The CSV format ===== ===== The CSV format =====
  
-The [[http://en.wikipedia.org/wiki/Comma-separated_values|CSV format]] stands for +The [[http://en.wikipedia.org/wiki/Comma-separated_values|CSV format]] stands for "Comma Separated Values". This file format isn't a "formal" format in the sense that is strictly defined and standarized, on the contrary, it is loosely defined and the implementarions may vary. 
  
 +What I've find out researching into the way Excel stores this information and making some tries is this: 
 +
 +The general format is this: 
 +
 +<code>  
 +value11<separator>value12<separator>value13\r\n
 +value21<separator>value22<separator>value23\r\n
 +</code>
 +
 +As ''<separator>'' you can use comma (,), semicolon (;) and tab characters (vbTab, 0x09)
 +
 +If the character ''<separator>'' and in one of the values is a separator character, you have to put it among double quotes. Let's say that the separator is semicolon, and the first value is "hello;world". The final codification of this would be: 
 +  
 +<code>  
 +"hello;world";value12;value13\r\n
 +value21;value22;value23\r\n
 +</code>
 +  
 +**How the carriage returns in the values are treated???** They are treated the same as the separator: if they are among a value, you have to put the value into a double quotes. Let's say that the third value has a carriage return. In that case, ''(foo\r\nbar)'', then you have to code it like this:
 +  
 +
 +<code>  
 +"hello;world";value12;"foo\r\nbar"\r\n
 +value21;value22;value23\r\n
 +</code>
 +
 +**What happens with the lenght of the values???** The maximum length per record is 32,768 **bytes** (sounds familiar, isn't it?). Pay attention to the "bytes", because if we are using Unicode as our encoding standard, the number of characters might not be equal to the number of bytes. You have to cut values longer than that because otherwise, excell will do it for you, putting the rest of the data in the following line. 
 +  
 +  
 +===== The best way to encode a CSV format =====
 +
 +After many tries, I've discovered that: 
 +
 +  * You can use Unicode as an enconding format for your CSV files
 +  * If you use Unicode, the best way to encode is to use <tab> as a field separator: the rest of separators (semicolon, comma) are not recognized for Excel as a proper CSV file, and then will force the user to use the "import as text" facility which is a little bit more cumbersome that just double click on the file. 
 +
 +===== A reference implementation in visual basic script =====
 +
 +Here is my reference implementation in visual basic script (provided the input comes from a recordset), which can be easily converted into other programming language like java.
 +
 +As a safety precaution, I've put behind double quotes any value that contains a "cr" or "lf" character in it.
 +
 +<code vbs>
 +'-------------------------------------------------------------------
 +'
 +' csv.vbs - helper functions for writing csv files
 +'
 +'--------------------------------------------------------------------
 +
 +function openCsv( byval filename ) 
 +
 + dim fso
 + dim textFile 
 + dim overwrite 
 + dim useUnicode
 +
 + overwrite = true
 + useUnicode = true
 + set fso = createobject( "Scripting.FileSystemObject" )
 + set openCsv = fso.createTextFile( filename, overWrite, useUnicode )
 +
 +end function 'openCsv
 +
 +
 +function closeCsv( byval fileObject )
 +
 + fileObject.close
 +
 + set fileObject = nothing 
 +
 +end function ' fileObject 
 +
 +function writeHeaderCsv( byval fileObject, byval recordset )
 +
 + dim nCont 
 +
 + nCont = 0
 + do while nCont < recordset.fields.count 
 + fileObject.write item(recordset.fields(nCont).name) & recordSep()
 + nCont = nCont + 1
 + loop ' nCont 
 + fileObject.write lineSep()
 +
 +end function ' writeHeaderCsv
 +
 +' prints the current line of the 
 +' recorset as a cvs line
 +function writeLineCsv( byval fileObject, byval recordset ) 
 +
 + dim nCont 
 +
 + nCont = 0
 + do while nCont < recordset.fields.count 
 + fileObject.write item(recordset.fields(nCont).value) & recordSep()
 + nCont = nCont + 1
 + loop ' nCont 
 + fileObject.writeline
 +
 +end function ' writeLine
 +
 +function lineSep()
 + lineSep = vbLf
 +end function ' lineSetp
 +
 +' after trying several alternatives, the only 
 +' separator that makes the excel correctly interpret
 +' the differents cells is vbTab, so I left as 
 +' the default separator
 +function recordSep()
 + recordSep = vbTab ' ";"
 +end function ' recordSep
 +
 +
 +' escape of items: 
 +' value es;to must be set as "es;to"  (with double quotes)
 +' value es"to" must be set as es""to"" (doubling the double quotes)
 +function item( byval value )
 +
 + if isempty( value ) then 
 + value = ""
 + end if ' isempty
 + if isnull( value ) then 
 + value = ""
 + end if ' isnull 
 +
 + item = value 
 + ' cut items longer than 32768 bytes long
 + ' as you can imagine "bytes" doesn't correspond 
 + ' to "chars" in utf, so, to be sure, will cut 
 + ' to a prudent 30767 char (2000 bytes less)
 + if len( item ) > 30767 then 
 + item = mid( item, 1, 30767 ) & "..."
 + end if ' length( item ) 
 + item = replace( item, """", """""" )
 + if instr( item, vbCr ) > 0 or instr( item, vbLf ) > 0 then 
 + item = """" & item & """"
 + end if ' instr( item
 + if instr( item, recordSep() ) > 0 then 
 + item = """" & item & """"
 + end if ' instr( item
 +end function ' item 
 +
 +
 +
 +</code>
 +
 +
 +
 +----
 +
 +~~~DISQUS~~~
  
  
windows/aboutexcelcsvformat.1414573140.txt.gz · Last modified: 2022/12/02 22:02 (external edit)