Notes from Industry

Postmortem: A Year of Data Science Peer Review in Startups

Published in

Towards Data Science

12 min readJul 7, 2021

About a year ago I suggested two peer review processes for data science projects, outlined a structure for the process — including separate review of the research phase and the model design and implementation — and positioned it within the wider scope of a data science project flow (as it is practiced in startup companies). The framework also included a list of topics, pitfalls and questions that should be reviewed.

It was not just a mental exercise. The post was written with intent to introduce formal peer review processes into the workflow of one of the data science teams I have helped to build and have been advising during the last three years. Since the post was written, I have had the opportunity to both act as a reviewer several times (with a couple of clients) and witness the integration of this approach into the team dynamics and how it was practiced.

Unsurprisingly, things did not go exactly as planned. Thus, this post is about what worked and what didn’t. I have focused on the most challenging aspects of trying to get data scientists to get review from their peers. I hope this helps others who wish to formalize peer review processes in data science

The back of a handsome peer reviewer (Photo by Charles Deluvio on Unsplash)

Reminder: What are we reviewing?

The review processes I presented are meant to complement a workflow composed of four phases: Scoping — Research — Model Development — Productization/Deployment. Two review processes are suggested: The first for the research phase and the second to the model development phase. Both outline a list of topics to review and detail many concrete questions for each.

Peer review processes as part of a DS project workflow (source: author)

The research review focuses on research methodology and both practical and theoretical issues meant to ensure the right approach is selected, before moving on to model development. Examples are approach assumption, model composability, domain adaptation, noise and bias and existing implementations.

The model development review focuses on practical issues with the produced models, meant to prevent problematic models from getting to production. Examples include leakage, causality, evaluation metrics, overfitting and runtime.

Key Takeaways

This is a list of key takeaways from the my experience. Each is meant to overcome a specific challenge I encountered.

📅 Takeaway #1: Schedule first

Challenge: Forming a habit is hard. Integrating the process into the team habits and culture was (and is), I think, the main challenge. This is harder than code review, because there is no easy way to implement the mandatory gateway like in hosted version control systems (e.g. Github), where merging code into a specific (e.g. master) branch requires an approval on a PR (which you cannot give yourself).

A sign that does not say “Rabbits” (Photo by Drew Beamer on Unsplash)

Habit forming is also inherently tied to how well structured is your project flow; the more structured, the easier it should be placing a (usually) mandatory review process as a condition to progressing from one phase to another. With less structured processes, a lot more responsibility lies with the data scientist leading a project and his manager to identify the point in time where a review is required.

Takeaway: Have the DS schedule a research review — even without a reviewer, with just the head of the team invited — as soon as a new project starts; then they can postpone it, but less likely to skip it. You can also structure your workflow to incorporates these review phases more formally.

🔍 Takeaway #2: Encourage all to identify review opportunities

Challenge: Responsibility lies on the reviewee. Another key obstacle in the road to successful adoption was the fact that it’s the responsibility of the reviewee (the data scientist whose work is under review) to make the peer review happen, which means it’s his responsibility to:

Identify that a research phase (or a phase with research in it) is ongoing.
Identify that the phase is close to completion.
Remember that the peer review mechanism exists.
Inflict upon himself the extra work of preparing for the review, and the vulnerability of exposing his work to explicit peer scrutiny.

That is a lot of extra weight on one pair of shoulders. In the two most related peer review practices to our area of work — academic review and code review — there is no question of when or whether to pass a review; an article must pass a review to get published in an academic venue; a pull request must be reviewed and approved by a different team member for it to be pulled into the main development branch (or at least, it’s easy to set up your VCS to enforce this). We have no such privilege here.

Takeaway: While it’s not their responsibility, data scientists should be encouraged to actively help their peers ask themselves whether its time for a review; weekly or daily team meetings are great time to help everyone notice when a project is shifting from a more research-like dynamic to development.

👯‍♂️ Takeaway #3: Reviews can use research and model dev topics

Challenge: Phase boundaries are fuzzy. When I thought about the right time to review a research iteration — or a model — I had a somewhat-ideal picture in mind of how a small data science project is structured.

The process I suggested assumed a world where data science teams explicitly divide their projects into scoping, research, model development and productization phases; where entering a phase the proposed duration (or a range for it) for the phase is declared, and the deliverables and DoD are clear; and thus that a peer review for the end of the research phase can be easily scheduled, sometimes even beforehand.

Boundary detection is usually harder than this (Photo by Héctor J. Rivas on Unsplash)

In actuality, people do not divide their work into such distinct phases (although I believe it can be useful), and the boundaries between the two types of work are often blurry.

I’d also like my framework to be usable by most small data science teams working on a mix of research-based and product feature-based projects, regardless of the workflow their using; again, splitting the review process into two distinct types made it less relevant for such workflows.

Takeaway: When research and model development phase boundaries are blurry, or when work inherently includes aspects of both, review sessions can draw on topics and goals from both the research review list and the model dev one.

👂 Takeway #4: Get comfortable critiquing non-technical issues

Problem: Brainstorming solutions is more fun. When sitting through a bunch of these sessions (formal and informal ones) I noticed a peculiar phenomenon: Data scientists tend to turn research review sessions into modelling brainstorming sessions.

Sticky notes are SOTA on most brainstorming benchmarks (Photo by MING Labs on Unsplash)

While brainstorming ways to model the problem can be beneficial to the reviewer, this usually isn’t the main desired benefit from the session; by the time the reviewee has asked for a research review he should have a clear map of possible modelling approaches to the problem, with a recommendation regarding which one (or ones) is the most appropriate for the project, considering product needs, resource constraints, available data, etc. This is doubly true if we’re talking about a model development review session.

Furthermore, if we examine the topics I listed, not one of them focuses on modelling. Sure, the discussion and critique on each such topic can give birth to suggestions of a new modelling approach that might help solve a newly-discovered challenge, but:

Solutions also come in the form of preprocessing, feature engineering, label transformations, implementation details and other aspects of the data science practice.
More importantly, the reviewer should focus on helping the reviewee find the holes, weaknesses and unmitigated risks in his suggested solution to the problem. It is the reviewee’s job to come up with solutions to those issues.

So why is it that data scientists focus on modelling suggestions? I think it might be because critique of methodology and planning is harder to communicate and convince people of, while technical aspects are safer to critique, at least for technical people.

That shouldn’t be a surprise: Technical claims feel (and often — although not always — are) more objective, dispassionate and impersonal. Critique of work patterns and research methodology can be perceived — and sometimes also delivered — as attacks on the reviewee’s character. That shouldn’t be the case. We should all be able to safely admit that we’re not perfect at doing what we do, and that we can learn from our peers in all aspects. I think being aware of this bias can help us improve the level and scope of review our projects go through.

Takeaway: Peers must become comfortable with debating and critiquing issues in planning, research methodology and product/business assumptions. They should also, in my opinion, be actively encouraged to pay attention to these aspects of the project under review.

💥🥇 Takeaway #5: Making mistakes should be allowed; finding them should be lauded

Problem: Getting critiqued is intimidating. I don’t think this one is special in anyway to data science, besides the unique anti-pattern of treating an approach failure (or a negative research discovery) as a personal failure — a poisonous legacy from the academic background many of us come from (more on that later).

It is, however, considerably more important when doing research activity than in engineering, as fundamental failures are in the heart of our profession.

Takeaway: A culture of vulnerable communication; of admitting frequently and adamantly to not knowing things; one that considers identifying mistakes as a natural and common work process — is crucial to promote frequent and candid peer review.
Takeaway: Both the reviewee and the reviewer should be lauded when review exposes crucial mistakes or unaddressed issues — it is a mutual accomplishment; it means that not only did the reviewer had the wits to identify such issues, but also that the reviewee timed the review, presented the right information and framed it correctly as to enable said identification.

📝 Takeaway #6: Topic focus

Problem: Numerous potential topics to touch. The list I made is long, and concerns many issues — so much that it can be nigh impossible to go through them all in a single meeting, at least not superficially. I think a simple takeaway is due here:

Takeaway: The reviewee and the reviewer should agree on a small subset of issues they are especially interested in tackling — for example, bias, KPI-objective alignment and domain adaptation — and focus on them.
If you can’t dedicate time to discuss this, then either of the sides choosing 3–4 issues they want to focus on should be good enough.

📉 Takeaway #7: Drop prep if busy

Problem: Prep time is scarce. I’ve discovered that the reviewer usually can’t find the time to prepare for the review meeting — there are always more pressing tasks at hand. Pragmatically, I think that you can do without it, and also that the above focus on a few topics can help.

Takeaway: If preparation time is scarce, just reading through the part of the checklist that concerns the issues in focus should be enough to get an idea of the kind of problems and pitfalls that relate to them.

🧪💥 Takeaway #8: Review failed research

Problem: We usually don’t review failed research. Failed research is — in many if not most of the cases — not a failure of the researcher; it is in the nature of research activity to include numerous negative discoveries: Word embeddings do not capture the right semantics for our case; sequential architectures do not seem to facilitate fundamentally-different representation learning on our data; pre-trained models seem to come from a too-distant domain to be used effectively; etc. These are valuable discoveries, that should be used to navigate the future of the project, and decide how our resources should be spent, given the usually large amount of possible research directions.

A short story by GPT-3 (Photo by Kind and Curious on Unsplash)

Unfortunately, data scientists — and often their environment as well — have a tendency to internalize the failure of an approach to solve a problem as a failure of their capability as a data scientist, although the two are well separated. And even when we do personally fail — and we do, a lot — these failures should be taken as opportunities to learn and improve our process.

For example, say we’ve just spent 3 weeks on a deep examination of the use we might make of unsupervised entity embedding techniques and have come up with significant negative discoveries — e.g. that unsupervised entity embedding techniques fail to produce strong enough representations for our entities, or that the resulting output will be very hard to integrate into current frameworks, or the we need several orders of magnitude more data to utilize them. These discoveries, in turn, might dictate that this research direction will not be pursued for the next couple of years. This is a significant decision — if this direction turns out to be crucial to our task, a mistake here could be costly in the long run — and should thus stand upon a well-supported research discovery that we believe in.

And so, in my opinion, it makes sense to put in another hour or two to consider together how well-corroborated it is, whether we could have arrived at the same conclusion with a smaller investment of time and/or resources, how far reaching should the conclusion be, what caveats might prompt a return to this research direction in the future, and so on.

Takeaway: Review should happen after every significant research iteration — even a failed one. It should be used both to learning about methodology and to better define the scope of the negative discovery.

The good bits

Here are some things that just worked well. :)

Preparation alone did a lot of the work

Maybe not most, but a lot. I think this is actually a good thing, since the prep work (when it was done) helped the data scientists under review take the required time to consider key risks and potential challenges to their work in a more in-depth manner than they would have otherwise.

A formal outlet to peer critique

Peer critique of ideas happens anyway, of course. People react to it differently, however, depending on context. I think the meetings gave a formal outlet to peer critique; since critique was expected, and was framed as something everyone and every project goes through, it helped the reviewees avoid considering it a sign that they did wrong or that they are singled out.

This, in turn, makes it possible to let your guard down and be a bit more open to the content of the critique. This can very different from getting critiqued on your research plan or results while presenting them to the whole team, which might feel like you’re being shot down when you’re trying to share something cool, or that people are focusing on the bad parts. Producing productive outcomes from peer feedback is hard enough as it is.

Finally, the peer review session itself can also help the reviewee to prepare for said presentation and build up their case, preventing such unmitigated critique and its negative side effects.

People got actual useful feedback

Which is perhaps the whole point, but it is still not trivial. While this gave outlet to some critique that happened anyway, and while some things got identified while preparing, it was apparent that a lot of the useful feedback people got during those sessions wouldn’t have been available to them otherwise, and has positively impacted the way they work. This was also attested to by several people who got the chance to be reviewed.

Closing Words

That’s it. I hope to keep iterating on this simple-but-hopefully-useful framework for data science peer review. A lot of my work is about reviewing the data science processes of my peers at client companies. I also get to talk about this formalization itself every once-in-a-while, using it as a chance to convince people they should perform peer review internally, and in a structured manner, inside their data science teams.

As always, I’d love to get your take on peer review in data science, and also your critique about my suggestions. Catch me at shaypalachy.com :)