
Contents
- Introduction
- What is Open Data?
- Open vs. Free vs. Online Data
- Where to find Open Data?
- International organizations
- United States
- Europe
- Latin America
- Asia
- Other Open Data sources (Google Public Data Explorer, Kaggle, FiveThirtyEight, UCI Machine Learning Repository etc.)
5. Conclusions
Introduction
Data Science has the power to bring great contributions to building the world we want to live in. And there are already numerous use cases which demonstrate how it can be leveraged for solving real-world problems.
Some examples of such cases can also be found in my previous article on this subject:
However, for doing so, we need data that is freely available for reusing and structured in a useful format. In this article, I am going through some of the most well-known and important portals that can be used in this regard.
What is open data?
‘Open data’ refers to data that are freely available without restrictions from copyright, patents or other mechanisms of control. (UNICEF Data)
In this context, it is not enough to just share data publicly in hard copy reports. For data to be considered fully open, it must follow certain principles that maximizes its utility:
- to be structured using classifications accepted internationally (ISO-3166 for countries);
- to use non-proprietary file formats (such as JSON or CSV);
- to be available via standards-compliant communication interfaces (such as SDMX-JSON);
- and have appropriate metadata describing it.
Open data is part of a larger set of movements, that includes also open-source software, open educational resources, open access, open science, open government and other.
More and more, certain types of data have started being considered a ‘public good’ which, when made available for use, reuse and free distribution can lead to better policy-making, better informed decisions, value creation and citizen-centric services. And this is how, Open Government Data philosophy and set of policies have also appeared.
Open government is a doctrine according to which citizens should have access to governmental documents and data for effective public oversight. By making governmental data open, public institutions show transparency and accountability in front of the citizens they are serving.
One amazing example I have encountered comes from Seoul, South Korea, where open data has become the norm and it is used for tackling real challenges the city and its citizens are facing. In Seoul, not only public institutions are using the data they are collecting, but also any business, non-profit organization or regular citizen can access them if they wish to build upon them or just check them for accountability reasons.
One of the goals of the City Hall is to provide open data to its citizens so that they can use them and build upon them. And by doing so, it has contributed to the creation of a new industry, in which many startups use the data provided for developing innovative solutions to some of the challenges faced within the city.
For more information regarding the example from South Korea and others alike, see the video below from The Economist:
Open vs. Free vs. Online Data
Open data is data without restrictions. Free data is data that is available without cost. Usually, open data is also free of charge. But when it comes to online data, not all of it can be used for free or without restrictions. In many cases it is copyrighted, being the propriety of its creators, and it requires permission or paying a fee.
Even when the data is not copyrighted, things are not perfectly clear. And we can think here of web scrapping data from LinkedIn. In 2019, the US Court of Appeals denied LinkedIn’s request to prevent the analytics company HiQ from scraping its data. Even so, LinkedIn does not appreciate anyone trying to scrape data from its platform, and warns against it in some articles.
Where to find Open Data?
Now, let’s get to the meat of this article: where can one find open data; be it governmental or of other types. Below, I have covered sources of data provided by international organizations, sources specific to certain regions (US, Europe, Latin America, Asia), and other types of sources of global relevance.
International organizations
World Bank Open Data
Through this portal, the World Bank provides free and open access to a large palette of data regarding development in countries around the globe. And this comes as a result of their belief that by providing a broader access to their data, they increase transparency and accountability, as well as contribute to helping policy makers to make better informed decisions.
The users can navigate around the 4593 datasets either by country and regions or by indicators, organized around different sectors (agriculture, education, gender, infrastructure, environment, urban development etc.).
What is even more valuable about their search portal is the fact that it provides access to types of data such as time series, microdata (obtained from sample surveys, censuses and administrative systems), and geospatial data.
Moreover, if you wish to get a better impression regarding the type of information that can be extracted from their datasets, take a look at their 191 visualizations that cover topics such as the no. of people without access to electricity, the rise of global CO2 emissions, resource depletion, access to improved water sources etc.
OECD Data
The OECD Data portal provides access to 875 databases that can be searched according to the country of interest or topic (agriculture, development, economy, education, energy, environment, finance, government, health, innovation and technology, jobs, society).
One of the portal’s benefits is that it also provides data recorded over time, sometimes as early as 1959. One downside is that it covers mostly data related to countries which are part of the OECD. For example, Romania is not part of it.
And if you do not wish to download datasets yet, an just explore what they have in store, you can make your own queries on large databases in their data warehouse, OECD.Stat.
United Nations Data
The United Nation’s data portal has been created as a result of the belief that statistics should be considered a public good, which can serve for evidence-based policy and better informed decision-making.
The portal aims to provide free access to over 60 million data points organized in 32 large databases compiled by the UN, as same as by other international agencies in a single-entry point. Examples of source organizations are: Food and Agriculture Organization, World Health Organization, The World Bank, the OECD, International Monetary Fund etc.
The search engine allows users to look for information either based on the larger datasets, sources of data, or topics. Each such element has a drop-down menu which, in my opinion, allows for an easy user navigation.
Moreover, UN Data provides access to three specialized UNSD Databases, such as UNComtrade, Monthly Bulletin of Statistics Online and the well-known Sustainable Development Goals indicators, through separate individual portals. The UN Comtrade is a repository of official international trade statistics, relevant analytical tables and publications. MBS Online provide access to economic and social statistics regarding more than 200 countries and territories in the world. And it contains 55 tables with over 100 indicators on a variety of subjects, recorder for 80 years.
The United Nations Global SDG Database offers access to 460 data series that illustrate the progress registered towards achieving the Sustainable Development Goals. The search on the portal can be filtered either by goals and their specific targets and indicators, as same as by geographic areas (as it also includes country profiles) and years (2000 to 2019).
Some other features provided by the UN Data portal include access to popular statistical tables produced as part of the UN Statistical Yearbook and statistical profiles of countries (areas) and regions.
UNICEF DATA
The UNICEF DATA portal is for those wishing to work with data specifically about children and women. Their Data Warehouse includes datasets related to topics such as child mortality, child poverty, child protection and development, education, gender, maternal, child and newborn health, migration, nutrition, transition to work and other. And, again, data can be also filtered by country.
GHO data repository – World Health Organization
When it comes to data, WHO has a high coverage, as it works with 194 Member States from six regions. And through the Global Health Observatory, WHO provides access to more than 1000 indicators that it monitors that can be navigated either according to themes under SDG health and health-related targets, by category, or by country. Some examples of the types of data it provides are: road traffic injuries, noncommunicable diseases and mental health, mortality from environmental pollution, tobacco control, clean cities, Health Equity Monitor etc.

United States
DATA.GOV
The US Government’s open data portal helps users navigate over 225 079 datasets from different Governmental Agencies, which can be used together with the tools and other resources provided to conduct research, develop web and mobile applications, design data visualizations and other.
One advantage when using it is that it allows filtering data according to location (on map), topics, format, types of data (geospatial or non-geospatial), organizations, organizations types, Bureaus and Publishers.
One drawback of the portal is that, even though most datasets have valid metadata, there are still some that do not have working URLs that permit download.
US Census Bureau
The United States Census Bureau is in charge with producing data about the American people and economy, as it’s primary mission is to conduct the US Census every ten years. The data that it collects is then used by policy makers at all levels – federal, state or local.
Some examples of tools that it provides access to are: American Fact Finder, Census Data Explorer and Quick Facts which allow users to search and visualize data according to their interests.
Europe
EU Open Data Portal
The EU Open Data Portal provides free access to data from a broad range of subjects, such as: education, environment, economy and finance, agriculture, forestry, food, health, government and public sector, justice, energy, science and technology, transport etc. The 15 561 datasets (till date) come from all EU institutions, bodies and agencies (e.g. Eurostat, EU’s statistical office, the Joint Research Center, European Investment Bank, the European Commission Directorate Generals, Environment Agency etc.).
Most data provided on the portal can be reused free of charge, for both non-commercial and commercial purposes, on the condition that the source is acknowledged. And only a small number of datasets have special conditions of reuse, as a result of the necessity to protect third-party intellectual property rights.
As a bonus, the portal also provides access to a visualization catalogue that includes a collection of visual tools, training materials [data visualization workshops and webinars which involve working with tools such as D3.js, Qlik Sense, Webtools Maps, PowerBI) and re-usable visualizations.
European Data Portal
This Portal is managed by the Publications Office of the European Union and it harvests the metadata of the Public Sector Information available on public data portals across European countries. To date, it covers 36 countries, 81 catalogues and 1 089 978 datasets, through which one can search based on categories similar to those used by the EU Open Data Portal.
Moreover, it also includes information regarding the provision of data and the benefits of re-using data.
Open Government Data websites from all EU Member States
- data.gov.be
- data.egov.bg/
- data.gov.cz/english
- portal.opendata.dk
- govdata.de
- opendata.riik.ee
- data.gov.ie
- data.gov.gr
- datos.gob.es
- data.gouv.fr
- data.gov.hr
- dati.gov.it
- data.gov.cy
- opendata.gov.lt
- data.gov.lv
- data.public.lu
- data.gov.mt
- data.overheid.nl
- data.gv.at
- danepubliczne.gov.pl
- dados.gov.pt
- data.gov.ro
- podatki.gov.si
- data.gov.sk
- avoindata.fi
- oppnadata.se
Plus the United Kingdom, which no longer is part of the EU:
Asia
ADB Data Library
The Asian Development Bank (ADB[)](https://www.adb.org/) was established in 1966 and it has 68 members, of which 49 are form Asia and the Pacific region. Its Data Library has a pretty intuitive search system, through which one can browse either by topic or country. The repository contains (to date) 234 datasets, 45 dashboards and 10 data stories. Among the topics covered are: financial sector, poverty, people, public sector governance, economics, and other.
One other interesting product of ADB I have learned about during the Bank’s recent conference in Evaluation is EVA, an AI engine that scans evaluation and other types of documents in order to identify lessons in ADB’s operations developed in its member countries.
South Korea Open Government Data portal
South Korea is a very good example of best practice when it comes to open data. However, their website is designed only for native speakers.
Latin America
Numbers for Development
Numbers for Development is the Inter-America Development Bank’s Open Data portal, and it showcases socio-economic indicators for the Latin American and the Caribbean Region. And it is built upon seven data sources: Agrimonitor (tracks agricultural policies), INTrade (trade in the region), Latin Macro Watch (macroeconomics, social issues, trade, capital flows, markets and governance), Public Management, Social Pulse (living conditions), SIMS (labor markets), Sociometro (socio-economic conditions). The search process can be filtered either by country, or by indicator.
Below, I have added an interesting article regarding how big and open data were previously used for social good in Latin American countries:
Open Data portals from Latin American countries
Other Open Data sources
Google Public Data Explorer
The Google Public Data Explorer is in part a search engine that facilitates access to datasets provided by international organization (as those covered previously in this article), national statistical offices, NGOs and research institutions. In addition, the team behind it wanted to give more to its users and that is why their aim is to make the large datasets of public interest easier to explore, visualize and communicate even by non-technical audiences.

Beside the Google Public Data Explorer, there is also the Google Dataset Search engine which enables its users to find datasets stored across the Web through simple keyword searches. When using it, one can apply filters related to the download format, usage rights, topics, or according to the last update. One criteria the source uses for ranking its datasets in search return results is the number of scholarly articles that has citied a dataset.

FiveThirtyEight
FiveThirtyEight is a very comprehensive source for high-quality data coming from the field of Journalism. The topics covered include: politics, sports, science & health, economics and culture.
Kaggle
Among open data sources, Kaggle might be the most well known by data scientists, due to the community that it has built around it.
Kaggle supports a variety of publication formats for datasets, but they also encourage their dataset publishers to share their data in an accessible and non-proprietary format, where possible. Among the supported file types are: CSVs, JSON, and SQLite.
One big advantage of Kaggle for those who are new to Data Science is that it supports learning by creating communities around each of its dataset, in which every interested user can contribute by solving tasks related to that dataset, submit their results and participate in discussions, receive and give feedback.
DBpedia
DBpedia was built based on the most commonly used infoboxes within Wikipedia and its ontology currently contains 4 233 000 instances, from which, for example, 1 450 000 are persons and 241 000 are organizations. Its data has previously benefited companies such as Apple, Google and IBM for some of their most important artificial intelligence projects.
UCI Machine Learning Repository
The UC Irvine Machine Learning Repository contains 557 datasets that can be used for empirical analysis of Machine Learning algorithms. It has been created in 1987 and has been used by students, educators and researchers as a primary source for machine learning datasets. Among the topics covered by their newest uploaded datasets are: Facebook Large Page-Page Network, amphibians, early stage diabetes risk prediction, bitcoin, and other. And the top 5 most popular dataset since 2007 refer to: classes of iris plant, predict whether income exceeds $50K/year based on census data, using chemical analysis to determine the origin of wines, diagnosing breast cancer, presence of heart disease in patients.
Conclusions
While going through the above-mentioned portals, I was amazed by the wealth of information available as well as by the additional tools some of them are offering for public use. Data truly can be beautiful.
As the amounts of data that become available in the world grow bigger and bigger, I believe we have increasing chances of using them for higher purposes, and in helping shape a better world.
Thank you for reading. I hope the content was useful. And if you believe that there are other sources of Open Data worth adding which were not included, please mention them through a comment.