How to Reveal Impressive Data Stories with Python

Creating a Digital Mindset for Interactive DataViz with Data-Driven Storytelling

Oğuzhan Yediel
Towards Data Science

--

One of the symbols of Istanbul, photo by the author

What Will I Tell You?

As I mentioned in my previous article, visualization is a key point in the data science world. The primary goal of any data science project is to create value. Regardless of the domain, Data Science teams somehow touch the heart of the business and play an important role in every decision that can be made. If you cannot visualize the results of the project in good order, you cannot convey the value of the project in the correct way. Data scientists are responsible for which technologies are installed with which structures in the content of the project. In meetings with one of the stakeholders or other departments, as a data scientist, you need to guide the results of the project rather than the technical details. Insights from the data will help guide the meeting. This article will provide information about how your visualization project should be by using ready-made data sets.

The article was written for visualization projects based on analysis demand rather than the visualization mechanisms of the projects to be sent to production. I will explain the visualizations on the basic chart types without going into too much detail. Producing complex data visuals does not mean creating a good story. Revealing the patterns that need to be explained in your data with as simple plots as possible will contribute to the fluency of the project.

The Common Approach

The essential way is to preserve the codes in the Jupyter notebook, which is data enthusiasts’ main squeeze. It is considered advantageous for Jupyter to be in notebook form and run cell by cell. The codes are all there, and it can be difficult to find the visualizations you will present among the codes you write irregularly. It is often not possible to clean the Jupyter notebook that is being worked on, either because of ‘lack of time’ or for other reasons. It will not be nice during your meeting to any stakeholder or any other department if you present the visualized data from here. The only solution that will make the common approach sustainable is to take the visualizations to be presented from the Jupyter notebook and narrate them through any presentation tool. Actually, this is a work-around procedure that has started your laziness. I only recommend doing this way in case of an emergency. To give an example from real life, there had happened some problems with our map system in my previous company. The problems were related to new business decisions made quickly. Licensed cartographic service was replaced with an open-source routing service called OSRM in a week on the backend side, and it was requested some stats about the comparison of both services. If you are in these conditions and have only a few hours, you can present your results in any presentation tool (or in the same notebook) after you have done your visualizations in Jupyter.

Another important feature that makes the first way fast is that the created visualizations are generally static. The main reason for this situation is that the functions under very easy-to-use packages such as pandas, matplotlib, and seaborn are preferred when visualizing.

Below I have exemplified some of the chart types drawn in the common method that I have mentioned in general terms. The first two of the plots are static while the last two map visualizations are an interactive Leaflet map. You can access the codes of the interactive ones from the python file called imm_free_wifi_locs in the repo. In the visualizations below, I used a data set from the IMM Data Portal, which I will talk about a lot in the article. The following data set contains the number of newly registered subscriptions in WiFi service that are offered free of charge in some locations of Istanbul.

Change of monthly subscription count on the line graph
Indication of county-based subscription count on the bar graph

In addition to showing IMM’s free wifi locations in point form, they are also presented in hexagonal form with their subscription count in below.

By the way, let’s talk about ‘Datapane’. Datapane is an application programming interface that makes it easy to share results from data with other people after analyzing data. It is the world’s most popular way to share data science insights from Python. You can make great reports and dashboards by making use of this very easy-to-use Python library.

The Holistic Approach

The second way is to do a project design on the codes. I used Jetbrains’ PyCharm product for this. In the designed data visualization study, besides keeping the codes organized, you can provide a better meeting experience by presenting the outputs of the project in your local with different libraries, such as Streamlit. The next step of this systematic approach is to present the results to the relevant teams in real-time with a determined date planning. In this article, I will give information about how the project should be concluded by giving the details of the second way. I will also show some examples of the basic visualization outputs I made on Istanbul Metropolitan Municipality (IMM) data sets using the ‘plotly’ library.

One of the most significant features that distinguishes the second way from the first is that the visualizations are usually dynamic and interactive. You can play with the plots or save the plots as images by using the toolbar that automatically appears in the plots generated as a result of the code.

Plotly Toolbar

IMM Open Data Portal

The positive change that started in the administration of Istanbul two years ago caused an increase in the quality of service offered to the people of the city. Within the framework of transparent steps in the right direction, it was decided that Istanbul, one of the most beautiful cities in the world, should have an open data platform. IMM Open Data, published in January 2020, is providing access to data provided by IMM and its subsidiary companies. The data is free to access regardless of your intent. There are 171 data sets under 10 different categories. The number of data sets in the portal is increasing day by day. At the same time, you can make a data set request according to your field of study, interest, or any need. IMM Open Data Portal is a good resource for anyone interested in data visualization. I suggest you do data visualization works by choosing some data sets from the portal.

Categories that have data sets

In the article, I used various datasets from governance, people, environment, and mobility. You can see how many data sets there are under the categories from the image below.

Source

Lux

Before the graphics, I want to talk about Lux. Lux is an application programming interface (API) where you can examine general visualizations on any data with a single line of code, and a Python library that simplifies the preliminary steps of data science by automating the data exploration process. Just import the library before you start your work. Using the visualizations created by Lux, you can create an overall picture of your data sets, plan your preprocessing steps on your data and share them with other people by saving the visualizations as ‘html’. For detailed information about Lux-API, you can check Ismael Araujo’s article.

Below you can see Lux outputs of daily IMM wifi new user data.

Correlation graphs
Frequency of occurrence graphs for categorical columns

The Used Data Sets

  • Daily IMM WiFi New User Data
  • Istanbul Dam Occupancy Rates
  • Hourly Public Transport Data Set
  • Hourly Traffic Density Data Set
  • Transportation Management Center Traffic Announcement Data

Daily IMM WiFi New User Data

The first data set, whose charts we will examine, includes the data of domestic and foreign users who benefit from the Wi-Fi service provided by the Istanbul Metropolitan Municipality.

The line chart shows the monthly distribution of the total number of users registered for the free wireless Internet service. The highest number of subscriptions was reached in November 2020. The number of registered subscribers has exceeded 18 thousand.

In the bar chart, the number of subscriptions by districts was visualized. Chart data contained how many subscriptions in which districts. According to the graph, Fatih is the district that takes most registered. One of the reasons for this may be that many touristic structures are located in this region. As can be seen from the stacked bar chart below, the number of foreign users also takes the highest percentage in this region.

Istanbul Dam Occupancy Rates

The next data is the data set that includes daily changes in the occupancy rates of dams in Istanbul. Along with the dam occupancy rates, there is also a column containing the dam reserved water amounts. It would be a nice dataset if it included the dam breakdown. With the years 2007, 2008, and 2014, 7 years later, the occupancy rate dropped below 20% in 2021 once again.

The monthly line graph of the reserve water amount, which naturally shows a positive correlation with the dam occupancy rate, is as follows.

I had a bar chart drawn by choosing random months from autumn, winter, and summer seasons. It is noteworthy that the average occupancy rate in July was high, except for the dry years we have detected above.

The average bar chart for the last 15 years on a monthly basis is as follows. The high rate in the spring months is expected behavior.

Hourly Public Transport Data Set

Our third dataset is public transport data, which is my favorite dataset on the portal. This data set contains passenger and journey data using public transportation in Istanbul in hourly terms. Within the project, I generally compared the statistics of January and February for 2020 and 2021. In this way, we will be able to learn and compare the statistics on public transportation just before the start of the pandemic outbreak and in the covid-19 period in 2021. As a footnote, some transportation vehicles, such as rail systems, do not work at night.

The charts above and below include the average number of passengers and passes on a daily basis.

In the bar chart below, the ratios of the total number of passengers by highway, railway, and seaway are given. Of course, the pandemic outbreak is effective in the cause of the decline in 2021.

The bar chart below contains the breakdown of the chart above on a line basis. In January 2020, about 60% of more than 100 million passengers preferred to use private-public buses. If we look at the railway details, the 4 most used lines of Istanbul do not change; Aksaray-Airport, Kabataş-Bağcılar, Marmaray, and Taksim-4.Levent. The average of passengers opting for these four lines during the morning rush hour in February is visualized below. On the seaway, it is seen that motorboats are preferred more than city lines. However, the number of passengers using these two lines in January 2021 is close to each other.

The most obvious thing seen in the line chart below is that the vast majority of white-collar employees have been working from home during the pandemic. The reason is the change in Taksim-4.Levent line. This line is generally used by white-collar employees in Mecidiyeköy, Gayrettepe, and Levent districts. In general, the average number of passengers on the given lines has decreased below 20 thousand in 2021 due to the pandemic.

İTÜ — Ayazağa Station, Taksim-4.Levent Line, July 2014, photo by the author

In our last chart below this part, the average hourly number of passengers of the Metrobus, Istanbul’s best-loved transportation vehicle (just kidding), is visualized for January. In the month before the pandemic, approximately 80 thousand people preferred this transportation in the evening rush hours. You can see a piece of that crowd in the photo below.

Zincirlikuyu Metrobus Station, January 2020, photo by the author

Hourly Traffic Density Data Set

The next data set contains Istanbul location density and traffic information on an hourly basis. The average number of vehicles on a day and hour basis is visualized in the heatmap below.

You can examine the heatmap in detail for the evening rush hours in the visualization below. Due to the pandemic conditions, there has not been a great decrease in going out by car. It can be inferred that people prefer to use their personal vehicles instead of public transportation. Only due to night and weekend curfews, there is a decrease in those days and hours.

On the Istanbul map, the density map visualization is as follows.

Transportation Management Center Traffic Announcement Data

Our latest data set is the data set containing the instant announcements made by the Transportation Management Center (TMC) since 2018 on a location basis. I produced charts ​for announcements rather than location-based visualization plots. In our first graph, you can find the interactive plot showing the distribution of the total number of announcements.

From the bar chart below, you can view the monthly distribution of Istanbul accident announcements in 2019 and 2020. The number of accidents increased in 2020 and reached 5500 in October 2020.

The next chart presents the average number of vehicle breakdowns over a two-and-a-half-year period for March, July, and October on a bar chart.

The last plot of the article shows on the scatter plot how much time passes on average in minutes between the start and end times of the announcements.

About the Project and Graphs

In the article, only certain charts of the project were selected and presented via Datapane in order to ensure interactivity. Under the project, the codes of all charts were written as parametrically as possible, and the desired charts can be easily reproduced by choosing the desired variables and time options. You can find all the codes, including the charts in the article, on my Github page.

imm_dataviz/
|-- config.py
|-- dam_occupancy_rates_daily.py
|-- Dockerfile
|-- environment.yml
|-- LICENSE
|-- public_transport_hourly.py
|-- README.md
|-- requirements.txt
|-- traffic_announcements_instant.py
|-- traffic_density_hourly.py
|-- utils.py
|-- wifi_new_user_daily.py

Each python file works on the following three basic functions.

if __name__ == "__main__":
# main()
putting_into_streamlit()
# putting_into_datapane()

main function allows you to view all the charts by reflecting them one by one in your local. putting_into_streamlit function, on the other hand, uses the Streamlit library to help you collect all images in one place and make it easier to compare visualizations. At the same time, if you are going to do any visualization work or project, I strongly recommend you to use Streamlit. putting_into_datapane function uses the Datapane library that I prefer to present the images interactively in the article.

Notes on the Project & What’s Next?

  • The data under the portal are 45 days behind and are updated daily.
  • Streamlit will automatically pick up your Operating System theme (light or dark) and change colors with your OS. After running the codes, select the light theme from the top right.
  • I deliberately avoided writing common (util) functions so that it can be easily treated each py file as a separate unit.
  • Because the project was written as modular as possible, different charts can be drawn by changing the parameters in the config file.
  • The project was weak in terms of logging and try-except mechanisms. Logs can be added by using the logger library, and exception & error handling mechanisms can be put in the codes.
  • For the content side;

— New charts can be produced by combining meteorological data with dam occupancy rates or hourly public transport data.

— New visualizations can be produced by combining the number of public transport boarding passes and ticket prices.

— Location-based visualizations can be made in traffic announcement data.

  • Do not hesitate to contribute as the visualization project is completely open-source.

Conclusion

I presented a small-scale sample visualization engine using libraries such as Plotly, Streamlit, and Datapane on some datasets I chose from the IMM Open Data Portal, which was served in January 2020. Thanks to the interactive data visualizations I created with Python, we gained insights about the data of Istanbul. I have tried to show you how to present any visualization project in the field of data science with simple but effective pieces of code. I told that the analysis-based visualization projects should be presented to stakeholders or other departments, with which technologies, and I provided detailed information about using up-to-date libraries via IMM DataViz project.

My Previous Articles

Links

References and Useful Resources

--

--