The world’s leading publication for data science, AI, and ML professionals.

Melbourne Datathon 2019 Winner’s Dairy

Disclaimer: This is an overdue blog on the experience "WeCane" shared during our journey for the Melbourne datathon competition. The…

Photo by Author
Photo by Author

Disclaimer: This is an overdue blog on the experience "WeCane" shared during our journey for the Melbourne datathon competition. The reader can expect to learn the steps I took to be successful in our submission.

Last year I had a goal: To win the Melbourne datathon competition.

My first task was to build a great team:

A great team is one where team members have the right motivation towards a common purpose, complementary skillsets and excellent communication and trust. To build a team, I needed to understand what value I bring to the table and the skills we would require as a team to win.

I am confident in my analytical thinking, problem-solving skills and business acumen gather through working on various projects and in different industry verticals.

I knew I could structure a business problem into a technical problem and create a product wireframe with a suitable value proposition. With my research background, I was sure I could find the appropriate machine learning methodology to solve any data problem.

The complementary skills I was looking for was software development skills, refined visual design skills and data engineering skills which would be required to build a great product.

I reached out to people I trusted and was more than lucky to have Lei Qian, Rohan Kirpekar and Anh Phuong Tran in my team. Letian Wang, Satyam Kumar and Ting Hu were also part of the team for some duration and contribution with ideation, research and data engineering. I would also give like special thanks to Deepak Kumar Singh and Akshit Vijay who consulted us and helped in the product development process.

Team member’s introduction:

  • Lei is a software developer with twelve years of product development experience.
  • Rohan is an experienced technical consultant with a computer science background.
  • Anh is a data analyst with a passion for data visualization.
  • Letian is a data engineer with a psychology background.
  • Satyam is an enthusiastic data analyst with many creative ideas.
  • Ting is a business analyst with a problem-solving and consulting experience.

With the right team and support from well-wishing friends, I felt very confident we have all the necessary ingredients to win.

Step two was identifying the specific problem statement and designing the solution

This part was the most exciting for me, as I like creative problem-solving. I am most excited about working with a new problem and researching any possible solutions. The problem statement given to us was to build an application using Spatio-temporal data (Satellite data) that would help sugarcane farmers in Queensland. 95% of Australian sugar is grown in Queensland, and sugarcane is the second-highest export crop after wheat as it contributes 2.5 billion dollars to the Australian economy. Growing Data and ANZ bank were the designers of this brilliant problem statement.

I loved this problem statement, as I had done work in satellite image processing and revolutionalizing agriculture using Data Science is my life’s mission.

Without much delay, our team got to work; in our first few meetings, we were wholly dedicated to ideation and thinking of possible solutions we can build for farmers or bankers.

We played around with various ideas ranging from land profile monitoring to weather forecasting dependent yield calculation.

Only after conducting the market research, we encountered the major problem of no set guidelines for agricultural financing, which is resulting in high operational cost for the banks and delay in loan approvals for the farmers. Global sugar demand is expected to reach 199.6 Million Tons by 2024. Hence, to meet the growing demand, proper financing to the sugarcane farmers is critical. Thus our aim was to build a product which solves the financing problem.

We decided to combine two ideas of forecasting yield estimates and loan risk estimation to create our MVP for the solution application.

Photo by Author
Photo by Author

Step three getting hands dirty with (dirty) data

Now we get to the data sciencey juicy part of the article. I don’t know if you’ve ever worked with satellite data, but it is an entirely different experience from any other data format, you have pictures, and you have time, and then you have clouds. Those damned clouds, we spent the better part of the first two months identifying and removing those clouds to create a cloud-free series of images that can be used for yield forecasting.

We were first creating the code for cloud detection and removal from scratch in python, and it was not going well. During our process of googling for debugs on StackOverflow, we discovered the incredible python library ‘Sentinal Hub’. That library single-handedly reduced our testing and prototyping time by 10x. Before finding that library, we were facing issues with processing vast volumes of high dimension data on our miser laptops. We also upgraded to use Colabs free GPUs to reduce the time to experiment with different settings of cloud detection and removal algorithms.

After some time experimenting with various techniques, we started seeing results with good quality could free images. The next step was to convert the image into a panel data series for modelling.

At the same time, we had to filter out only the regions which have sugarcane harvested. So we used the code provided by Growing data to generate sugarcane masks and only used two years of all bands of sentinal two satellite as input data for our modelling.

To increase the accuracy of yield prediction, we divided each tile of satellite data into further smaller 8×8 cubes and stored the pixel and bands information along with their coordinates.

Lei was the main person working on the modelling part of the problem. She experimented with classical regression models such as Linear regression and Arima and later on decision tree models such as xgboost and Random forest.

Performance of each modelling technique:

1.) Arima failed to learn the complexity of the data

2.) Spline based linear regression – underestimating and was unable to determine the data structure complexity

3.) Xgboost – Performed better than Arima and linear regression, but was always underestimating

4.) Random Forest – Was able to learn the complexity of the data structure and predicted values were close to real values

Please see below our presentation explaining the modelling experiments and results:

The accuracy of the prediction was compared using training the model on 2016–2017 data and testing on actual yields measure from 2018. When the accuracy of the model was within a reasonable margin of error, we stopped further experimenting. The same modelling technique was then used to train the model on 2017–2018 dataset to predict the values for 2019.

Step four tool design and development

I won’t go into the details of this part, but I can give the basic design architecture we followed in developing the tool:

Data source (Sentinal hub) – As mentioned before the data is coming from Sentinal hub

Flask API (backend) – We decided to used flask API because all the processing and analysis was done in python and for the ease of integration

HTML, Javascript, CSS (Front end)

Heroku (Deployment)- Simple deployment for prototype apps

To see the tool live in action click on the link below

https://sugarcaneapp.herokuapp.com/loan

Step five presentation

Coming to the final stage was a great honour, given that there were beautiful entries by so many talented participants. I much admire the creativity of our fellow finalists.

The presentation was in two parts. One was to create a 5 minutes video explaining the tool and second was a 10 minutes presentation explaining the benefits and how we built our tool. It was great that we got together and practised our final pitch a day before, although there were last minutes jitters before the presentation, yet I was quite pleased how flawless our performance went. I would once again want to thank my team members Lei, Anh and Rohan for their hard work and persistence.

https://www.youtube.com/watch?v=W9-CfppFx7g

Hope you liked this long blog and got some ideas for your our Datathon entry next year.

Finally, We are still working on making a difference in the lives of farmers, and if you’re interested, please reach out. Deepak Singh and I are working on the ‘Krishiguru’ project, and our aim is to provide insights which can be used for better decision making in the agricultural sector.

You can reach us on the Facebook page: https://www.facebook.com/KrishiGuruIndia.

Related reading:

Datathon Guide

Melbourne Datathon Website

1st place winner blog


Related Articles