Not all CSVs are the same

Data shared as open rely heavily on their quality, not just the data quality itself but data format as well. Here is a comprehensive list of rules to follow when creating CSV files.

CSV (stands for Coma Separated Values) is one of many standardized formats suitable for open data publishing. Yet it can still contain invalid data and thus limit its further usage and computer parsing. On the following lines we will cover whats, hows and rules to follow in order to create a valid CSV file.

There are two main parts of every CSV file:

Head

This is always the first row of the file which defines structure and content of the data part. Each column represents one type of attribute. It also contains its description. Texts in the head cannot contain diacritics and spaces. In case your files head contain spaces you will have to replace these with underscores.

Data part

After the head comes the part of the file where the actual data are stored. It consists of rows that follow the attribute structure defined in the head. Each cell on a row represents one value. These cells cannot be merged together over rows nor columns.

The must-rules to follow:

  • Each cell represents one value of the attribute defined in the head
  • Cells cannot be merged together in any way
  • Dates must be written in ISO string format: YYYY-MM-DD (e.g. 2018-09-30)
  • Thousands are not separated by spaces nor comas, use dot instead
  • The values in cells are separated by commas

UTF-8 encoding

The way the data are encoded is another very important criteria when it comes to judging open data quality. In order to comply with the standards it is important to use UTF-8 encoding. It can be achieved rather easily with just a few clicks in saving process.

Img

Before you hit Save click on Tools and Web Options [www.webtoffee.com\]

Img

Next click on the tab Encoding and select UTF-8 [www.webtoffee.com\]

A great example of well processed CSV files can be Czech Telecommunication Office in Czech Republic. As an example can be used their dataset Checks and fines which can be used as an example and can be seen below this paragraph.

Img

Yet, even in this file can be found mistakes. People of Czech Telecommunication Office have used semicolon ( ; ) as the separator. That said, the data are displayed all in one cell, as can be seen in the picture. As the standard goes, the data cells should be separated by a coma. Luckily this is still machine readable format.

We hope this article was helpful for you.

Apitalks team