The world’s leading publication for data science, AI, and ML professionals.

Is your AI Professional Grade?

What you can do to start developing data science for professionals.

https://www.publicdomainpictures.net/en/view-image.php?image=161590&picture=gavel-and-stethoscope
https://www.publicdomainpictures.net/en/view-image.php?image=161590&picture=gavel-and-stethoscope

Professional Grade is a term that was popularized in the early 2000’s with the advertising tagline "GMC: We Are Professional Grade." Today, it is used to distinguish from general-use (or Consumer Grade) products and to communicate that a product will work better or longer in a more stressful environment where it is used more frequently or by…well, professionals.

Most of the Artificial Intelligence we employ today is powered by some sort of machine learning model developed by a human, a Data Scientist. These models are typically built using a training data set. This training data is a critical factor in determining the "intelligence" level of the resulting Artificial Intelligence application. It’s a simple fact: better training data produces better models. An axiom of this is: more and richer training data will produce more robust models. Let’s discuss two important components data scientists must consider when searching for the perfect training dataset: relevance and specification.

Finding a dataset with relevant content is the first critical step in a data science project. If, for example, you wish to build a chatbot to answer customer support questions, you need training data that contains sample customer support questions. And not just any set of questions; if the set of questions is too narrow or too perfect (yes, too perfect!), the resulting model will not have the necessary variability to learn from and will not perform well in the wild. In Data Science terms, we would say the resulting model is not very robust. It is best to use data that consists of a broad range of well-formed and poorly-formed questions.

Locating high-quality and relevant training Data is so critical to model development today that dedicated websites source unique data assets that can be leveraged for training data. For example, a big search provider like Google offers Data Search, a search solely dedicated to finding data sources. Also, each of the Big-3 cloud vendors have their own data sharing platforms that provide unique and useful data sets: Google Cloud, AWS, Azure.

While locating relevant data to the project is a critical first step, the next important step is the process of specification where much of the "intelligence" comes from in Artificial Intelligence. Most training data is found in a raw format and is missing a key component. In our earlier example of the customer support chatbot, once a data scientist has a rich set of sample customer support questions, they will now need the most correct answers to these questions. In the Data Science lingo, we call this data labeling. Data labeling – also known as data annotation or data tagging – is the process of adding information (labels) to raw data. This labeled information is what the resulting model will attempt to predict. To add these labels to the dataset, modern data scientists use labeling programs that read in a raw, unlabeled dataset and then go through it record-by-record. The user adds labels based on what they observed in each record. Some of the better labeling tools today employ some machine learning assistance that starts to make suggestions, using the initial human inputs as a guide, to help speed up the human-labeling as they progress through the data. Just as the Big-3 cloud vendors provided plenty of raw training data, they also provide intelligent data labeling capabilities: Google AutoML, Amazon Ground Truth, Azure data labeling.

The above methods are excellent tools for creating relevant and well-specified training datasets for Consumer Grade AI applications. Unfortunately, when it comes to Professional Grade development, there are additional challenges on both the relevance and specification components that are difficult to overcome without a lot of help.

When sourcing data for a Legal or Medical AI application, data provenance needs to be a primary consideration. While there are some sample snapshots of professional data available to the public, the vast majority of the data used by Professionals are maintained and curated by industry-recognized companies like LexisNexis (Legal), Thomson Reuters (Legal), Elsevier (Medical), Wolters Kluwer (Medical), to name a few. The data assets maintained by these organizations are large and complex and have been collected and curated by legal and medical professionals over many years. Naively asking for a data dump from these data providers can often lead to problems down the road. Especially when a prospective customer of the application (e.g., a lawyer or physician) asks detailed questions about the AI application, which requires intimate knowledge of the provenance of the data. For example, when presenting a new application to a legal professional, it is not uncommon to get questions like what jurisdictions does this cover, what areas of the law are covered, how far back does it go, or have you accounted for bad law?

Furthermore, many Data Science applications are fueled by a constant access-to or feeding-of data from these professionally recognized data sources. So, even if the data science developer gets a one-time snapshot of data from one of these sources, once in production, they will need to need find a way to maintain recency and completeness of this data. This is accomplished by creating programatic links between the AI application and the data providers via application programming interfaces or API’s. Since this data is proprietary and an extremely valuable asset to the providers, these API’s are typically behind paywalls. Since care and feeding-of the models is of primary importance, API costs need to be considered as they can add substantially to the Total Cost of Ownership (TCO) calculations for running and maintaining the applications.

In the typical Consumer Grade application, when it comes to labeling data, data scientists can usually do it themselves or with help from a friend. However, legal and medical data labeling requires advance training, knowledge, and experience. For a Professional Grade AI project that requires Legal or Medical expertise, a data scientists will need to partner with a lawyer or doctor to perform the labeling. To scale up Consumer Grade human labeling for large datasets, it is not uncommon to farm it out to a large, cheap resource such as Mechanical Turk. However, due to the advanced educational requirements and rarity of professionals, scaling up the human labeling process for Professional Grade data is difficult and can be quite expensive.

In addition, once a professional is located to help with the data labeling, the nuances of Legal and medical data will still make it difficult to perform a simple annotation. For example, it may sound simple to read some case law and determine a judge’s ruling. However, this seemingly easy for-or-against labeling process is in fact quite complicated and requires a person with a legal background to interpret the language of the document and make a determination. As such, this data is subject to an increased risk of interpretation bias. To reduce this, a best practice is to use a consensus-based approach, such as pair-wise labeling. This process involves having multiple experts label the same document and only accepting the labels that are agreed upon by two-or-more experts. **** The extra-effort produces high-quality labels, but it adds substantially to the cost and time of the project.

Building a Professional Grade AI application starts with locating relevant data from a reliable industry-recognized source. It can’t just be a one-time dump; professionals expect the information provided by an application to be current and well-maintained so connecting to these sources via API’s should be considered. Enriching the data via labeling so that it can become a training dataset for modeling requires the help of subject matter experts like lawyers or doctors. Since the professional data is nuanced and subject to interpretation, beware of interpretation bias and try to seek for consensus across multiple experts to improve the label quality. Developing a Professional Grade AI application is not easy and requires extra effort from the data scientist. While developing a Consumer Grade application can typically be done by an individual, to develop a Professional Grade application, it requires the partnership of a team: data scientists, data providers, and subject matter experts. What will be the final proof that you have made a Professional Grade AI product? Simple: that a legal or medical professional trusts and uses it in their daily work!


Related Articles