Using the comma separated values (CSV) and text delimited formats.
Introduction.
Describing the characteristics of CSV and delimited formats for reading and saving data is the purpose of this article. The formats are introduced and alternatives considered. Rules are provided for use in files or web pages. Lastly, the implications of using CSV and delimited formats over alternatives is covered.
Background.
This article is intended as an aide to understanding CSV and delimited formats. It is recommended to have some experience with applications using flat files, such as databases and spreadsheets.
GRML, or General Reuse Markup Language, is a file format and markup language with the columns and results of CSV and delimited formats and the tags and nesting of XML.
Defining CSV and Delimited formats.
CSV (Comma Separated Values) and delimited formats use characters to define values of data. This character is called a
delimiter. A delimited format is defined by the delimiter it uses to separate values. For example, the CSV format uses a
comma delimiter. It is a comma delimited format. A file using a
TAB delimiter is in
TAB delimited format. Unlike other formats, such as HTML, or XML, the delimited formats do not use tags to define values.
Delimited formats are used to exchange data between applications, read data from files, or save data to file. Many databases, spreadsheets, and web browsers read and save files using a delimited format. For example, Microsoft Excel, Microsoft Access, and Microsoft SQL Server read and save files using CSV and delimited formats.
With the emergence of XML, some have considered CSV and delimited formats to be
legacy formats. With many new applications choosing to add XML as an option to read and save data, the question has been asked if other formats are necessary. Since the largest software providers (Microsoft, IBM, and Oracle) still continue to support CSV and delimited formats, the question may be irrelevant. In addition, many legacy systems depend on CSV and delimited formats so much, they have become a de facto industry standard.
XML is used in configuration files. These files are not designed to grow in size or to change frequently. They are used to set application, operating system, or user settings. However, as data size increases, so does the overhead of using XML over CSV and delimited formats. Generally, the larger the data and the more frequent the changes, the greater the cost to use XML.
Knowing an alternative to CSV and delimited formats, a context is provided for when and how it is used. With this context, rules need to be provided for use in file and web browsers. While there are many ways to read these formats, the rules compatible for use in GRML file and web browsers are described.
Rules for Use.
When reading or saving CSV and delimited files or web pages, certain rules are followed. Below are explanations of these rules for GRML file and web browsers.
- Each record is on one line AND the record separator consists of a line feed (ASCII/LF=0x0A), or a carriage return and line feed pair (ASCII/CRLF=0x0D 0x0A).
100, 1935 Sixth St., London, UK
35, 709 W. 42nd Str., New York, USA
929, 1 Pacific Coast Highway, Los Angeles, USA
120, 59 Avenida Atlantica, Rio de Janeiro, Brasil
Using a comma delimiter, four (4) records are created.
- Fields are separated by a character, or delimiter.
John,Doe,120 Any St.,"Anytown - VA",08123,
Using a comma delimiter, there are five (5) values.
John
Doe
120 Any St.
"Anytown - VA"
08123
- Leading and trailing space-characters adjacent to delimiters are ignored.
John , Doe ,
Using a comma delimiter, the values are,
John
Doe
- There are no embedded delimiters in a record.
In the first example, "Anytown - VA" is one value. If changed to "Anytown, VA", the values are,
"Anytown
VA"
To read as one value, use a delimiter other than a comma.
- There are no embedded line-breaks in a record.
Conference room 1, "John,
Please bring the M. Mathers file for review
-J.L.
",10/18/2002
Using a comma delimiter, four records are created. The values of the first record are,
Conference room 1
"John
The value of the second is,
Please bring the M. Mathers file for review
The value of the third is,
-J.L.
The values of the fourth are,
"
10/18/2002
- The first record in a delimited file may be a header record containing column (field) names. Delimited formats do not have a mechanism for describing the first record as a header row. However, GRML file and web browsers use the first record for this purpose. This record is encoded like any other record, following the previous rules.
CSV and Delimited formats vs. XML.
GRML file and web browsers do not support XML. Browsers require a specific format of data. XML provides only a generic format of data. While XML is becoming more popular and many consider it a replacement for CSV and delimited formats, there are issues affecting its adoption.
Size and speed.
CSV and delimited formats use less formatting than XML. This results in smaller bandwidth and storage requirements than XML. When size and speed are important, CSV and delimited formats provide an alternative. For environments using high-cost bandwidth where large amounts of data are used, CSV and delimited formats require fewer resources.
Reading and saving.
There are fewer requirements for reading and saving CSV and delimited formats, compared to XML. XML uses nested tags to store data. Each tag has a name (the element) and must be closed. Parsers (read a data format) and writers (save a data format) require checks to enforce these rules. Using only a delimiter, CSV and delimited formats do not need to check tags and nesting. There is no need to find start and end tags. Requirements for reading and saving with CSV and delimited formats are less than XML.
Best-case scenario.
As an alternative to CSV and delimited formats, the absolute theoretical best advantage for XML is one-letter element names for its tags and every field quoted in delimited formats. While a practically unrealistic advantage, the overhead of XML is still slightly greater than 2 times the size of CSV or delimited formats. Typically, this is much greater. In practical terms, expect XML to be 9 to 90 times the size of CSV or delimited formats.
This is a direct comparison and considers transfers of regular tables (all rows of a column are the same type). The only circumstance where XML is comparable to CSV and delimited formats is when data is very sparse and using minimal element names. In this case, XML makes up much of the overhead a delimited format uses in delimiters. However, there is still a meaningful difference in overhead between XML and the delimited formats.
Conclusion.
The CSV and delimited formats described are compatible with many applications. Whether using Microsoft Excel or a GRML file and web browser, the rules for creating a CSV or delimited file apply equally. Take a database and save it using CSV and open it with a GRML browser. The data appears the same.
The advantage of CSV and delimited formats over XML was discussed. While XML is becoming more popular, CSV and delimited formats offer advantages in file size, parsing, and saving. While XML uses tags and nesting, CSV and delimited formats use a delimiter.
Having covered the rules for using CSV and delimited formats, use it to read and save data using different applications, including GRML file and web browsers.
Other Links.
This article is reprinted from
Using CSV and Delimited formats at GRMLBrowser.com.
More articles are available with
file and web browser descriptions using the GRML, CSV, and delimited formats.