Principle 4: Election data is open when it is complete and in bulk

Complete and in bulk

Data that is complete and made available as a whole is open data. Releasing a complete dataset is a clear act of transparency. Any data that is omitted is not accessible and cannot be used. Even when a user is particularly interested in exactly one part of the data, it is still useful for them to have the whole dataset so they can put their specific interest in context. For example, an organization may wish to analyze the voter registration rate for districts in their region. They may focus most of their analysis on the data for that region but, if they have the entire dataset, they can quickly calculate the registration rate for the entire country and use that to put the regional rate in proper context. When an EMB publishes an incomplete dataset, they risk being accused of trying to (purposefully) hide the information that is not included. The risk is often higher when the difference between what is released and what is not released has a geographic dimension. In many countries, support for a particular candidate or party is highly correlated with geography (for instance, urban areas may support one candidate while more rural areas may support a different candidate). Completeness is especially important when there is a geographic component to the data because leaving out one area can appear as though the EMB is somehow biased against a particular candidate or party.

Releasing bulk data means that all the data is contained in a file so that the entire dataset can be obtained in one download. For example, the Electoral Commission (IEC) of South Africa made the polling station-level results (called "voting district results") for the 2014 national and provincial elections available in bulk as one file for download (as a compressed .csv file). The IEC also noted the file type and file size next to the link. Releasing a complete dataset in bulk is often one of the most simple and direct first steps an EMB can take to make data truly open. In cases where that one file might be too large and thus take long to download, the data should also be offered as a small set of files. In Open Government Data: The Book, Joshua Tauberer defines "too large" as, "when a data set is so large so as to be not practically downloadable in bulk. By today's standards, that would be a data set at least 10 gigabytes in size, or about 6 hours on a broadband connection."

Proper Documentation

Sufficient documentation is another aspect to the data being complete or whole. The data file should have corresponding documentation that describes the variables, fields and labels used in the file. At a minimum, the documentation should also include notes on the structure of the data and explanations of any abbreviations used in the data. Ideally, the documentation will include the above as well as a description of how the data was collected, the purpose of the collection, the target audience, links to related auxiliary data and a point of contact in case there are further questions. As encouraged in the G8 Open Data Charter and Technical Annex, the EMB should make sure that a dataset is "fully described, as appropriate, to help users to fully understand the data." The Brazilian Superior Electoral Court's (TSE) repository of election data is a great example of an EMB offering data in bulk along with proper documentation. The repository includes voter registration data, candidate and party information, campaign finance data and election results data. A user can download the election results for 2012 and 2014 in bulk and included with the data is a "read me" file. The "read me" file for the 2014 results, for example, details how the data are encoded, provides a description of each variable, notes when the data were last updated and includes contact information for further questions.