The world’s leading publication for data science, AI, and ML professionals.

4 Key Factors for Open Data Publication – From a User Standpoint

#Data for Social Impact: What Good Open Data Sources Should Be Like

By Hua Deng, Online Volunteer with the UNV online volunteering service for Regional Innovation Centre, UNDP Asia-Pacific.

Image by CILIP Photos from Flickr
Image by CILIP Photos from Flickr

This summer, I worked on a project entirely based on open data – data visualization and analysis for COVID-19 responses in Southeast Asia. During the process of searching for open data sources and working on them, I found out that there is no golden standard on publishing open data, and data publishers are also in the process of exploration. To be honest, there are many subtle points which may seem trivial but could potentially cause inconvenience to the users. I believe it would be valuable for the community to view it from the perspective of a data user, which drives me to draft this blog.

From a user standpoint, I would summarize four key factors for publishing open data: (1) Data Preparation, (2) Data Publishment, (3) By-product, and (4) Community, each with a few sub-criteria. And I would compare three data sources I used for my project as an example: Philippines Department of Health (DOH), Google COVID-19 Community Mobility Reports, and Oxford COVID-19 Government Response Tracker (OxCGRT). On the table below, I put simple descriptions into different cells with color green, yellow and red, indicating good, not good enough, and bad practices. Each will be discussed in detail in the following sections. As for places I do not have comments on, I simply leave it blank.

Image by Author
Image by Author

#1 – Data Preparation

#1.1 – Consistency

Data publishers should determine some standards to follow at the first place, and keep with them when making updates. If there are adjustments that must be made, it should be indicated clearly to the users. Otherwise, it may cause confusion and inconvenience. For example, as for Google Mobility Index, there was a time that it changed from using "Metro Manilla" to "National Capital Region", which are basically different names for the same region in the Philippines. In my project, since I combined different data sources together and they separately used different naming conventions, I had to make a mapping dictionary by myself. When Google made the change without declaration (or declared inexplicitly), I had to figure out what was going on and fix my mapping accordingly.

Besides, consistency also means adhering to major conventions. Just as what I mentioned above, if different data sources all follow the same conventions, users would suffer less from making mappings on their own. Some typical examples include geographical names, date format, etc. Of course, in some cases there might be multiple prevailing conventions, thus it is very hard to coordinate the adoption of conventions among different data owners. Some possible solutions include: (1) data publishers should declare clearly what conventions it uses, so that users are well instructed to do the mappings themselves; (2) third party platforms or service providers could maintain some commonly used mappings, or automatically help with standardization for users.

#1.2 – Quality

Data users always anticipate data with good quality, while the reality sometimes is the opposite. It is great to see that the Philippines Department of Health is publishing Covid-19 data every day on case information, testing, and quarantine facilities, etc. But we should also be aware of the quality issue of the data, including obvious delays and errors, as analyzed by University of the Philippines COVID-19 Pandemic Response Team. One noteworthy problem is on the missing values. If such missing data is not randomized but biased, there would be huge problems in the reliability of all analysis based on such data. Hence, the policymaking would be deteriorated as well. To address this issue, data owners could pay more attention to the data collection process by making clearer requirements and instructions. Also, adopting certain data quality assurance procedures would be very helpful.

#1.3 – Granularity

Within the boundary of Data privacy and feasibility, the open data should be as granular as possible to support broad applications. For example, Philippines DOH’s data on case information is as detailed as individual level, which could support more types of analysis and application compared with aggregated country level data. And users could drill up to city level, region level, country level themselves whichever fits their needs.

#1.4 – Usability

This might be less crucial but is very easy to improve on for better user friendliness. For users who are less experienced in data wrangling, they may hope for data in the format that exactly meets their expectation. For example, Google Mobility data now gives options to download either Global CSV or Region CSVs, so that users could only look at regions of their interests. Another example is OxCGRT, which offers tabular format data as well as time series format data. Besides, following naming conventions to give variables intuitive names could also make the dataset more user-friendly.

Screenshot from Google COVID-19 Community Mobility Reports on 2021–02–07
Screenshot from Google COVID-19 Community Mobility Reports on 2021–02–07
Screenshot from OxCGRT covid-policy-tracker Github Repository on 2021–02–07
Screenshot from OxCGRT covid-policy-tracker Github Repository on 2021–02–07

#2 – Data Publishment (Accessibility & Maintenance)

There is no definite standard on how to publish Open Data, so you can find many different practices. From a user standpoint, I could think of two key judging criteria: (1) whether it is easy to access, and (2) whether it is easy to maintain.

As for Philippines DOH data, it is published on Google Drive. The advantage of it is that everyone knows how to use Google Drive, so it is very accessible. The process is as follows: you should first click into an "READ ME FIRST" pdf file, then scroll down for the latest link in the middle, and finally reach the destination with the data you need. This process increases the difficulty of automating the maintenance of projects, as users have to follow this procedure every time manually to download the latest file.

Screen Recording from Philippines Department of Health COVID-19 TRACKER on 2021–02–07
Screen Recording from Philippines Department of Health COVID-19 TRACKER on 2021–02–07

As for the Google Mobility data, it is both easy for access and maintenance. There are buttons on their website to trigger the download, and its path can be embedded in your script easily. As for OxCGRT, users can either download csv files from Github, or use API for access. Github + API could support multiple types of usage and is easy for users to update, though these two ways might seem less accessible to non-technical users.

There are other practices for sharing open data sources. For example, Microsoft maintain a data lake of COVID-19 open datasets on Azure, and they are ready-to-use for some services integrated in Azure. In this case, usually it is not the publisher who decides to publish the data on Azure, but Azure collects and integrates the open data source on its site under the license of the original publisher.


#3 – By-product (Documentation & Visualization & Report)

The "by-product" published along with data like documents, visualization and reports could assist users in understanding the data. Documents are must-haves, while visualization and reports are nice-to-haves. Without documents, users could not gain detailed and unbiased understanding to data, which is the foundation for any in-depth analysis and application. With visualization and reports, users could play around with data, make some basic analysis easily, and even directly gain the insights from the report. Providing basic visualization and reports could prevent redundant work for the community and reduce barriers for open data usage.

As for Philippines DOH data, it provides files to describe the metadata on what each sheet is about and what each field is about. Its technical notes are updated every day in the "READ ME FIRST" pdf, and so far it has been a really long file, which is quite hard for readers to catch its point. Also, there are no descriptions on data collection procedure. I got confused sometimes when I explored the data, and I had to make assumptions to proceed, which may turn out to be wrong.

As for Google Mobility data, it provides reports for countries and regions. In the report, there are basic descriptions and links to detailed instructions about how the data is collected, how the metrics are constructed, and what you should be aware of to interpret the data. There are also basic line charts and metrics for the country or region. Those reports are highly standardized, and that’s why Google could offer separate reports to countries and regions. The downside of the reports is that they are too simple to gain insights in depth.

As for OxCGRT, from my point of view, it really did a great job, as it provides nearly all support I could think of. Detailed and clear instructed codebook and working paper, check; interactive dashboards offered by Our World in Data, check; regional summaries, check!

Screenshot from OxCGRT on 2021–02–07
Screenshot from OxCGRT on 2021–02–07
Screenshot from OxCGRT on 2021–02–07
Screenshot from OxCGRT on 2021–02–07

#4 – Community (Communication Channel & Use Case Sharing)

After data being published, more opportunities for communication and sharing should be needed to maximize the impact and reach. There should be smooth communication channels between data publisher and users, so that users could report issues with data and get answers in time, and data publishers could get feedbacks and improve on any relevant aspects of data; there also should be opportunities for users to share with other users about their findings and experience, thus contribute to the knowledge base of the community as a whole.

There hasn’t been a single platform that serves as the center of open data for social good communities. Data publishing and use case sharing are mostly done case-by-case, spread over the internet. Here, I would like to introduce some good practices, which are from different types of entities who undertake different roles in the community – and I am sure that more innovative and engaging practices will emerge in the future.

  • OxCGRT. It is open to feedback on the data, analysis or any aspects of the project. Users could contact the team either via a survey form or by email.
  • Facebook Data for Good program. Facebook has a dedicated team to manage this program. As for more mature datasets, they have already made datasets entirely public; for some datasets which might still need further tests or feedback, they offer access to researchers and non-profits upon request. On their website, users can easily find "Case Studies" and "In the News" sections to learn from good use cases. Besides, they also set up slack channels to promote communication between data publishers and data users.
  • Data4COVID19. It is a series of projects undertaken by The GovLab, one of which is a data collaborative repository. In the repository, you can find three types of information: (1) ongoing data collaborative projects, (2) data competitions, challenges, and calls for proposals, and (3) requests for data and expertise. All of these are done via google sheet open for edit, which is not technically hard but achieves great impacts.
  • Data.govt.nz. It is a great one-stop website for open data in New Zealand powered by the government. Here you can nearly find anything you want, datasets, framework and guidelines of publishing data, communities and groups, only to name a few.

Thank you for reading! This is the final post of a series of blogs on #Data for Social Impact. In the first blog, I introduced our project in detail about data visualization and analysis for COVID-19 responses in Southeast Asia. In the second blog, I shared my story as an online United Nations volunteer on COVID-19 data projects, and my reflections on how to better utilize open data and crowdsourcing in the public sector. In this third blog, I discussed what good open data should be like from a user standpoint. I sincerely hope this series of blogs could be helpful!


Related Articles