
The articles
Here I’ll be talking about these two articles:
Machine learning ‘causing science crisis’
Can we trust scientific discoveries made using machine learning?
Please read them before this so you can create your own point of view.
Here are more reviews:
Statistician: Machine learning is causing a "crisis in science"
Machine learning is contributing to a "reproducibility crisis" within science
Statistician Raises Red Flag about Reliability of Machine Learning | Digital Trends
What’s not wrong
Don’t take me wrong here, there are lots of important points made in the articles, I just think that the reasons for them may not be the right ones.
There’s a crisis of reproducibility in science

This is true. We are in troubles. In some sciences only 36% of studies are replicated. That’s not enough.
More data:

In their study they state:
Although the vast majority of researchers in our survey had failed to reproduce an experiment, less than 20% of respondents said that they had ever been contacted by another researcher unable to reproduce their work.
That’s some serious problems too, because a a small amount of people are trying to replicate studies, some of them are failing at reproducing the experiments but only 20% are telling others, so the chances for you to know if what you are doing is wrong are minimum.
But as you could see from the image above, the problem it’s happening everywhere, in almost every science. More on this later.
Machine learning is not bulletproof

If there is a crisis, it’s on how to do data science (DS) correctly. A lot of data science projects fail, and it’s not because of lack of skill or knowledge (some times), data science projects need a clear and effective plan of attack to be successful and a methodology for running it.
Most Machine Learning projects are inside the DS world (in a way or another), and the are issues with the way we run the machine learning cycle inside DS.
Here are some of the most "organized" workflows for machine learning:




Did you see the problems there?
My questions:
- Are we sure we can reproduce the results we got?
- Are we testing enough?
- How are we versioning our models?
- Are we even versioning our data?
- Are we giving enough information to the readers to be able to reproduce our findings?
- Are we open to answer questions about what and how are we doing machine learning?
I think for most organizations, the answers will be no.
Analyzing what is wrong

Here I’ll take a look of some of the things the author mention that I think are incorrect and why.
A growing amount of scientific research involves using machine learning software to analyze data that has already been collected […] the answers they come up with are likely to be inaccurate or wrong because the software is identifying patterns that exist only in that data set and not the real world.
Wrong because:
- In most cases the data we have comes from the real world, are we saying here that the transactions we are using to do a forecasting model didn’t happen?
- Data can be dirty (most times it is dirty) but we have a lot of ways to improve the data we have and understand if it’s dirty.
- Common machine learning cycles include diving our data into training, validation and testing, and these "validation" or "test" datasets behave like new "real data". And because they come from the original dataset, they can all be thought as "real" representations of what’s happening.
Often these studies are not found out to be inaccurate until there’s another real big dataset that someone applies these techniques to and says ‘oh my goodness, the results of these two studies don’t overlap’
Wrong because:
- Big data does not necessarily mean better results. And, this is the way Science works, it’s like saying we have a crisis in physics because I found that my results are not the same as other article.
- I’m not saying that results aren’t wrong sometimes, but we are working on ways on fooling our own systems to be sure about that before publishing.
There is general recognition of a reproducibility crisis in science right now. I would venture to argue that a huge part of that does come from the use of machine learning techniques in science.
Wrong because:
- It’s true that almost every field in science is using machine learning to improve the way they work and their results, but I don’t think that the problems are in machine learning itself. The data issues are sometimes overlooked, poor machine learning workflows, poor testing and others. I would say here is not about machine learning but the way some people are doing it.
- Blaming the reproducibility crisis on machine learning and data science it’s not knowing the full picture.
A lot of these techniques are designed to always make a prediction. They never come back with ‘I don’t know,’ or ‘I didn’t discover anything,’ because they aren’t made to.
Wrong because:
- Yes, this systems are made for finding patterns in data, but if you do it correctly you will understand that some of this insights are meaningless.
- There are systems that come with ‘I don’t know’, when we evaluate our models we can find that they don’t know actually nothing about the data.
- Machine learning is not just running an algorithm, it’s about testing them, doing benchmarking, validate them and more.
What’s not entirely wrong
The author said:
Can we really trust the discoveries that are currently being made using machine-learning techniques applied to large data sets? The answer in many situations is probably, ‘Not without checking,’ but work is underway on next-generation machine-learning systems that will assess the uncertainty and reproducibility of their predictions.
I think we can trust our predictions if we have a system that can understand when something is going wrong. We have tools for doing that, and the data science methodologies are including more and more:
- Data governance;
- Data versioning;
- Code versioning;
- Continuos integration / deployment;
- Model versioning.
All the modern developments in technology are possible because we found a way on working with big and small data. There’s a long ride until successfully building bulletproof workflows, but is not that far away. Here I’m proposing a very simple one (disclaimer: it’s work in progress):

What’s next?
Be careful on what you are reading. I’m not saying I have the truth in my hands, I’m just giving my opinion from someone who’s been applying machine learning in production as a data scientist and it’s working on making it more reproducible, transparent and powerful.
Thanks to Genevera for bringing this important topic to the table, and I’m not trying to be offensive here, just open a conversation.