While I am a full-time academic, given my area of expertise, I have had the privilege to work on several consultancy projects over the years. I was recently interviewed by my university’s business and innovation department to offer advice to other academics seeking to engage with industry and undertake consultancies. The process got me thinking about tips to offer in case someone wishes to undertake consultancy projects in data science. In this post, I will provide you with my top 5 tips that you will find useful while undertaking any data science consultancy. This should be relevant for anyone working in data science (and not just academics). The steps to be presented follow the stages of a typical project, from initial negotiations to finally handing over. Let us get going!
1: Manage expectations
It is quite possible that you have been asked to help with a project because the CEO of a company believes that they have a lot of data and they would now like to apply "machine learning/artificial intelligence/data science" (AI/ML/DS) to help their business. However, they do not have a good idea of what really needs to be done but they believe that AI/ML/DS will offer them a competitive edge over others. And since you are the data scientist, you can figure that out (Not every company you engage with will be in this situation, but you will be surprised how often this happens: way more than you think!). The first task before you take on the project is to manage your client’s expectations. You obviously do not have to be didactic in your approach, but you must be making very good use of your communication skills to ensure that you can engage with your client at the right level. Most real-life projects are going to be vaguely defined (the hype, the media adds to all this) and it is your job during the project on-boarding to ensure that you can adequately manage the expectations of your clients and be transparent and clear on what is possible.
2: Vision Development
Using the word "vision" may make it sound lofty, but this is a crucial step. If the initial negotiations are fruitful and you do take on the project, then do not rush to getting hands on data and analysing it. Before you start analysing any data or writing any piece of code, you should arrange a workshop-style meeting with your client’s team and work out the data science project plan ensuring that it aligns with the business goals of the company. For small companies, this meeting will often involve both the CEO and someone from the developer’s team who have hands-on knowledge of the company’s data (this meeting should answer questions like where the data is hosted, where would any data science product developed be implemented, how the data science project fits in the bigger scheme of things etc.). This is also your opportunity to understand the problem you are addressing (this crucial step separates a hacker/coder from a data scientist; do not underestimate its value!)
3. Implementation Plan
Next important step is the implementation plan that will be dependent on the vision for the data science product that has been developed. This step should not unduly be made complicated. There is a simple model that should guide this step (think of the data flow!): data input, processing, and output. The key parts to figure out here are:
Data Input: From where will the data come from and how will it be fed? (the how should also be figuring out the amount, type, duration, frequency etc.)
Processing: Where will the model to be developed sit? Is the company already using an existing cloud service that provides data science tools or does the company have their own standalone server (this option is getting scarce by the day)
Output: Where will the output of the model go? For example, is it going to be fed to an intelligent dashboard or is it part of a bigger architecture?
I remember this one client wishing to use data science in their business. When we got to this stage, I found that they had a mobile side of things where their users used a mobile application, and the data was stored in a cloud system offered by a third-party software. And separately, they had a server side with a website that their clients could use and that was stored in a completely different place! (it was probably a poor strategic decision at an earlier design phase where they had not envisioned that they might one day want to combine the two data streams for gaining insights and it was a clear bottleneck for what they could do moving on. There are workarounds for this but that is not the point here.). These are the kind of issues that you need to have found out at this stage and it is only possible if you do a careful analysis and planning of looking at the data flow and thinking about the potential data product that you will eventually deliver.
4. Implementation
Presumably due to the wide interest in machine learning and data science, new methods with fancy names are being developed quite often (it is not humanly possible for any one person to keep up!). It does so happen quite often that a company wants to develop a specific machine learning algorithm (sometimes they have "patented" it too). There typically is not a specific reason for it though. Why, out of all possible approaches, would the company want to have a specific class of algorithm (say neural networks)? Data science is an empirical science and very often, there is no theoretical reason why a specific method is the most appropriate for a given problem (or the theoretical reasoning is computationally intractable to work out). Instead, it is usually a matter of going through a systematic workflow guided by experience and insights drawn during the process. I would therefore suggest that before applying any model, you should always do two basic things: visualize your data in various ways to better understand it (data types, any relationships, how much is missing, data distributions to spot outliers etc.); implement the simplest model first and have that as your benchmark. There is absolutely no reason to apply any complicated or fancy algorithm unless you know that it is going to be better than a simple algorithm. I personally always start with a simple algorithm that is easily interpretable and can easily be implemented. My favourite go to algorithms for classification is regularized logistic regression (LASSO) and linear regression for regression (despite the name of "logistic regression", it is an algorithm for classification!). I use the simple algorithm as my benchmark and then check if there is additional benefit of using other algorithms. I suggest you do the same (there is no reason to apply a fancy algorithm if it has no benefit, remember [Occam’s razor](https://simple.wikipedia.org/wiki/Occam%27s_razor#:~:text=Occam’s%20razor%20(or%20Ockham’s%20razor,the%20more%20unlikely%20an%20explanation.) and one of my favourite quote from Einstein: "Everything should be made as simple as possible, but no simpler".
5: Handing Over
You should always ensure that you hand over your work properly so that the company can make use of your work in the future. In addition to a presentation to share the key aspects of your project, you should consider writing a simple, easy to read document to help anyone in future taking your work further. In addition, remember that the model you developed is not going to be the final model and it may require modifications. You need to clarify how can this model be updated, what performance metrics it measures and have the rationale for all this clearly documented.
In summary, manage expectations starting from initial negotiations, understand the business needs and develop the vision for data product accordingly, plan implementation while clearly understanding the company’s existing IT infrastructure, start implementation with a simple model and use that as your benchmark, and hand over with clear documentation. I hope this was helpful!
Read every story from Ahmar Shah, PhD (Oxford) (and thousands of other writers on Medium)