EN100-web
EN100-web
EN100-web
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Special theme: Scientific Data Sharing and Re-use<br />
Understanding Open Data CSv file Structures<br />
for Reuse<br />
by Paulo Carvalho, Patrik Hitzelberger and Gilles Venturini<br />
Open Data (OD) is one of the most active movements contributing to the spread of information over<br />
the <strong>web</strong>. However, there is no common standard to publish datasets. Data is made available by<br />
different kind of entities (private and public), in various formats and according to different cost models.<br />
Even if the information is accessible, it does not mean it can be reused. Before being able to use it, an<br />
aspiring user must have knowledge about its structure, location of meaningful fields and other<br />
variables. Information visualization can help the user to understand the structure of OD datasets.<br />
The Public Research Centre Gabriel<br />
Lippmann (Luxembourg), together with<br />
the University François-Rabelais of<br />
Tours (France) are currently running a<br />
project studying OD integration and<br />
how information visualization may be<br />
used to support the end user in this<br />
process. There are numerous obstacles<br />
to overcome before the great business<br />
and societal potential of OD can be realized.<br />
This project aims to address some<br />
of these obstacles.<br />
A key problem is the plethora of available<br />
formats for OD, including: PDF,<br />
CSV, XLS, XML and others which may<br />
or may not be machine-readable. In<br />
2011, PDF was the most common<br />
format. Since then, however, things have<br />
been changing. The use of machinereadable<br />
formats is strongly encouraged<br />
by several initiatives. The use of tabular<br />
data representation, CSV format in particular,<br />
is continually growing. Despite<br />
the increasing popularity of such files,<br />
there is a dearth of recent studies on tabular<br />
information formats and their visualization.<br />
The last relevant work in this<br />
field was published in 1994 - Table Lens<br />
visualization for tabular information [1].<br />
Our project addresses this gap in the<br />
research, focusing on CSV format since<br />
it is currently one of the most OD used<br />
formats and the most important tabular<br />
data format.<br />
Although CSV is a simple tabular<br />
format, its semantic and syntactic interpretation<br />
may be difficult. Since CSV<br />
does not have a formal specification, the<br />
interpretation of files can be a complex<br />
process. Nevertheless, the interpretations<br />
converge to some basic ideas<br />
described in the RFC 4180 [2]. In order<br />
to reuse OD datasets, it is mandatory to<br />
get a global idea of their structures.<br />
Information visualization may be used<br />
to create a real-time solution in order to<br />
obtain a quick and efficient global<br />
overview of OD CSV files. Our intention<br />
is to provide an effective tool to:<br />
• estimate the size of analysed file(s);<br />
• show the entire structure of OD CSV<br />
files;<br />
• view what kind of data each cell contains;<br />
• detect errors so the user can complete/correct<br />
them before use;<br />
• detect and locate missing values;<br />
• compare the structure of different<br />
generations of CSV files, because<br />
OD datasets are commonly published<br />
periodically.<br />
Piled Chart has been created to achieve<br />
these objectives. Piled Chart shows the<br />
entire structure of a CSV file using a<br />
cell-based shape and a colour code. In<br />
the Piled Chart, sequences of rows with<br />
the same structure (rows with same data<br />
type in each cell) are piled (grouped)<br />
into a unique row (see Figure 1). The<br />
same principle is applied to the<br />
columns: columns with the same structure<br />
are grouped into a unique column.<br />
Furthermore, cells are colour-coded<br />
according to data type. Supported data<br />
types are currently numbers, strings,<br />
dates and percentage. A special colour is<br />
also applied for missing values.<br />
Because similar rows and columns are<br />
grouped, the CSV file can be represented<br />
using a reduced area giving the<br />
user a smaller surface to be analysed.<br />
Piled Chart shows great promise, but<br />
still has room for improvement. For<br />
Figure1: Example CSV file and an<br />
example of Piled Chart.<br />
now, Piled Chart is unable to inspect<br />
several files simultaneously. This limitation<br />
is important, especially for files<br />
published periodically by the same<br />
entity (e.g., the monthly expenses of a<br />
government). We do not yet know how<br />
the system responds when large files are<br />
analysed, thus tests with large files must<br />
be performed. The problem of scalability,<br />
which applies to many algorithms<br />
in information visualisation, is challenging.<br />
The user-friendliness of the<br />
approach also needs to be evaluated; a<br />
user must be able to easily understand<br />
the information shown on the graph and<br />
the interaction must be easy and fluid.<br />
The project will continue until the end of<br />
2016 in order to address these questions.<br />
References:<br />
[1] R.Rao, S. K. Card: “The table lens:<br />
merging graphical and symbolic<br />
representations in an interactive focus+<br />
context visualization for tabular<br />
information”, in proc. of the SIGCHI<br />
conference on Human factors in<br />
computing systems (pp. 318-322),<br />
ACM, 1994.<br />
[2] Y. Shafranovich: “Rfc 4180:<br />
Common format and mime type for<br />
comma-separated values (csv) files”,<br />
Cited on, 67, 2005.<br />
Please contact:<br />
Paulo Da Silva Carvalho<br />
Centre de Recherche Public - Gabriel<br />
Lippmann<br />
E-mail: dasilva@lippmann.lu<br />
36<br />
ERCIM NEWS 100 January 2015