Principle 5: Election data is open when it is analyzable (i.e., available in a digital, machine readable format)
Data that are available in a digital, machine readable format can be quickly and easily analyzed. Access to election data that is "analyzable" is key in assessing the integrity of an electoral process. It is critical in electoral transparency for members of the public to be able to perform their own independent analysis of the raw data and verify the EMB's analysis. A dataset on its own imparts very little information to a human. Once the dataset is processed in some way -- through analysis and/or visualizations -- it becomes useful and conveys insight.
Machine readable
In more technical terms, "machine readable" data is analyzable data. "Machine readable" is a way of saying that the dataset is in a format that can be understood easily by a computer. Put another way, it means that the data are reasonably structured to allow automated processing. In their Open Data Policy Handbook, the Open Knowledge Foundation defines machine readable formats as "ones which are able to have their data extracted by computer programs easily."
Machine readable versus digitally accessible
It is important to understand that "machine readable" is not the same as "digitally accessible" information. Scanning a report makes the content digitally accessible. However, a computer is not able to "understand" the information in the report. Data.gov's Primer on Machine Readability for Online Documents and Data has a useful illustration of the differences between machine readable and digitally accessible:
[The] distinction can be seen in the difference between a magazine cover and a barcode on that cover. A computer cannot directly understand what the picture on the magazine represents, even if it is presented in an online format, but it can read and understand the barcode, using it for identifying the price and tracking the purchase, for example.
When EMBs release information in analyzable (i.e., machine readable) formats, they are helping to bridge the gap between "documents" (which are usually static and frozen in their format) and "data" (which are dynamic and allow for further processing).
While many EMBs make information available as PDF documents, that format is not machine readable. In fact, many of the most popular formats for releasing election information (e.g., reports, contact information, election results) are not machine readable. Formats like PDF, Word documents, JPG images and HTML pages do not have a structure that lends itself to automated analysis and processing. Instead, they are useful for displaying information on a screen or printing information on a page. Unfortunately, those formats make it very difficult to mechanically reconstruct and analyze their contents. While a computer can display the text well in, say, a PDF document, it is difficult -- or nearly impossible -- for it to "understand" the structure and context around the text.
Suitable formats for suitable purposes
As mentioned before, formats such as PDF, Word, JPG images and HTML pages do not have a structure that lends itself to automated processing. Those formats are suitable for displaying information on a screen or printing the information. Historically, those formats were suitable for the intended purpose at the time. When EMBs first released information, the intended use was for a human to read, print and then possibly take action, such as filling out a form. In some cases, users may have been able to query the data but access to the underlying data itself was not available. Releasing information in those closed file types was driven by familiar formats (HTML and PDF), a narrow definition of "users," and a more limited expectation of what those users could and would do with the information.
Currently, EMBs that fully appreciate the range of possible uses of election information make that information available in a range of formats, including formats that are both "printable" and "analyzable." EMBs in Georgia, Colombia and Mexico have made the image scans of the polling station results forms available in near real-time on their websites[1]. They published the image scans of the forms because they wished to be transparent about the primary source of the data and allow any person to visually scrutinize the handwritten numbers, signatures and tallies. For the 2012 Presidential, Senate and Deputy Elections, Mexico's Federal Electoral Institute (IFE), now called the National Electoral Institute (INE), also made the preliminary and final 2012 results information for each election available as a bulk download in a machine readable format (as a TXT file). The preliminary results are available as a compressed file under "Base de Datos" on the IFE's PREP site, while the final results are available as bulk download for each state through the IFE's ATLAS system of historical data. The IFE made the data available that way because they wanted the media and election monitoring organizations to be able to quickly analyze and verify the information. The IFE example illustrates that matching the format to the purpose is not a matter of providing the machine readable data in place of the image scans, but providing the machine readable data in addition to the images. Thus, the choice of format should match the purpose or purposes, which may mean that EMBs might need to publish data in multiple formats almost simultaneously.
CSV, JSON and XML: Recommended formats for election data
- The most common structure of election data is "tabular" data -- data stored as a table or series of tables. The most common machine readable format for tabular data is the "Comma Separated Variables" (CSV) file type. Nearly all databases and spreadsheet programs can save information in this format. The CSV stores tabular data in a text-based format which makes it easily processed by computers.
- The "JavaScript Object Notation" (JSON) format is a machine readable and non-proprietary format. JSON is derived from the JavaScript language used on many websites. JSON is better for representing hierarchical relationships in data (e.g., an organizational chart or different levels of electoral boundaries) rather than tabular data. However, one of the disadvantages to CSV and JSON file formats is that they do not inherently include metadata -- information that explains the data. Metadata provides context for the data by including a description of the variables, how the data was collected or when it was last updated.
- XML or "eXtensible Markup Language" was developed to make the metadata of documents directly available and processable. The XML format makes it easier to include the proper documentation about a dataset. The XML format allows users to tag the information in a document so that computers can automatically index and extract it, thereby making the information easy to search and browse.
The Central Election Commission in Georgia posted images of the polling station forms in near real-time for the 2012 Parliamentary Election, the 2013 Presidential Election, and the 2014 Municipal Elections. Colombia's National Registrar published image scans of polling station forms for the 2014 Senate Election. In 2012, Mexico's Federal Electoral Institute (IFE) posted the images and preliminary results at the polling station level (called "Actas") for the Presidential, Senate and Deputy Elections on their PREP system website. ↩︎