As the field of artificial intelligence (AI) continues to evolve, we are witnessing a growing trend: the rise of Large Language Models (LLMs) specifically tailored for particular industries. These industry-specific LLMs are not just adapted to the specialized terminology and context of a given field, but also offer customized AI solutions to tackle unique challenges within that industry. For instance, in healthcare, a specialized LLM could accelerate drug research and discovery, while in finance, a corresponding model could swiftly decode complex investment strategies.
Against this backdrop, the so-called "industry large models" can be essentially understood as "extensions of general large models applied within specific industries". There are two core concepts to emphasize here: the first is the "general large model," and the second is "industry-specific data."
The true value of general large models lies not just in their enormous parameter count, but more importantly, in their wide applicability across multiple domains. This cross-domain universality not only enhances the adaptability of the model but also generates unique capabilities as the model evolves toward becoming more "general." Therefore, training a model solely with industry-specific data is a myopic approach that fundamentally contradicts the core philosophy of general large models, which is "universality."
As for the industry-specific data, there are mainly two ways to apply it. The first involves directly fine-tuning or continuing the training of a general large model using this data. The second method utilizes prompts or external databases, leveraging the "in-context learning" capabilities of general large models to solve particular industry problems. Both approaches have their advantages and limitations, but they share the common goal of harnessing the capabilities of general large models to address industry-specific challenges more accurately.
Balancing Immediate Benefits and Long-Term Vision
General large language models exhibit significant strengths in two dimensions: "knowledge" and "capability," where knowledge can be analogized to experience and capability to intelligence quotient (IQ).
Capability as IQ: After being trained with vast amounts of data, general large models behave like entities with extraordinarily high IQs, possessing the ability to rapidly understand, learn, and tackle challenges. Take mathematical problem-solving as an example: the large model only needs a brief understanding of mathematical formulas and concepts to effortlessly resolve intricate problems, a situation termed "zero-shot learning." With a few example problems provided, it can even address more complicated mathematical challenges, known as "few-shot learning." In contrast, industry-specific large models delve deeper into various specialized books and problem sets. When there’s a noticeable IQ disparity between two large models, such as between GPT-4 and GPT-3.5, the one with a higher IQ will have a better accuracy rate, even with less experience.
Knowledge as Experience: The knowledge of general large models stems from the data they have been exposed to. This means that for certain specific questions or domains, if the model hasn’t encountered related data, it can’t provide an accurate answer. For instance, when asking for the proof of "Fermat’s Last Theorem," unless the model has been exposed to pertinent academic literature, it won’t be able to elucidate the theorem’s proof accurately.
Against this backdrop, industry-specific large models can be seen as "transitional products" when the capability of general large models falls short in specific domains. For instance, in the legal field, the ChatLaw model excels in legal questioning and information retrieval tasks because it’s trained on extensive specialized legal data. However, as the capability of general large models continually advances, the need for such specialized training might diminish. In the long run, by incorporating external knowledge bases or contextual information, general large models hold the potential to employ their advanced reasoning skills to tackle intricate industry challenges, gradually replacing industry-specific large models. This trajectory is not only inevitable in the development of General Artificial Intelligence (AGI) but also represents a more fundamental approach to addressing industry challenges.
Strategies for Industry Models
Solving industry challenges using large models can be broadly categorized into three main approaches:
Building from Scratch with Blended Data
This method involves creating a new model from the ground up using a combination of general and domain-specific data. Notable examples include BloombergGPT. However, it’s essential to note that this method may require a substantial amount of data and computational resources.
Adjusting Weights on General Models
This category encompasses two strategies:
- Continuing Pre-training on General Models: Models like LawGPT exemplify this strategy, which builds upon the foundation of a general model through continued pre-training. This iterative process enhances the model’s adaptability to a specific domain.
- Instruction Tuning (SFT): This approach utilizes instruction tuning (SFT) on top of a general model’s framework. By fine-tuning the weights of a general model, it can better address specific industry problems.
Both of these methods aim to adapt a general model for industry-specific needs by adjusting its weights. While these approaches are generally less resource-intensive than complete retraining, they might not consistently achieve optimal performance levels. Also, achieving significant performance improvements compared to retraining industry-specific large models might prove challenging.
Combining General Models with Domain Knowledge
This category includes two strategies:
- Leveraging External Domain Knowledge: This method involves augmenting a general large model with domain-specific knowledge, such as utilizing vector databases. By integrating information from domain knowledge databases, the general model’s effectiveness in generating responses to industry-specific issues can be enhanced.
- Utilizing In-Context Learning for Domain Responses: Through "in context learning," industry-specific challenges are addressed by creating contextually relevant prompts. A general large model then directly generates responses. As the trend towards broader context windows gains momentum, prompts can encompass more domain-specific knowledge. Consequently, even with a general large model, effective responses to industry-specific problems can be achieved.
In conclusion, all these approaches share the common goal of tackling industry-specific challenges, but they differ in methodology and resource requirements. Choosing the right approach should consider available data, computational capabilities, and expected performance gains. When evaluating industry-specific large models, it’s essential to differentiate between these diverse approaches, preventing misleading comparisons between methods with varying workloads and costs.
Practical Implementation Guidelines
In the realm of industry-specific large models, a crucial yet often overlooked issue is data ratio. Specifically, practitioners frequently discover that using a large amount of industry-specific data for model fine-tuning can paradoxically diminish the model’s capabilities. One potential explanation is the disparity in quality between industry-specific data and general data. Furthermore, the data ratio – how much specialized data is mixed with general data – may not be optimal. For example, some models like BloombergGPT employ a simplistic 1:1 data ratio, but the efficacy of this approach for pre-training, continuous pre-training, or fine-tuning remains an open question.
My experience in developing domain-specific language models across diverse sectors, from e-commerce and office management to smart home technologies, has provided me with valuable insights. Typically, these models, with a size of 13B, leverage domain-specific data spanning 20M to 100M. I’ve found that a data ratio of approximately 10% to 20% domain-specific data to the total dataset during the continuous pre-training phase consistently delivers optimal results. Exceeding this ratio often compromises the model’s general capabilities, such as summarization and question-answering.
Another interesting observation was that during the continuous pre-training process, it was beneficial to simultaneously incorporate data used for Supervised Fine-Tuning (SFT). This allowed the model to acquire more domain-specific knowledge even during the pre-training phase.
However, it’s worth noting that this 10% to 20% ratio isn’t a one-size-fits-all solution. It can vary based on the specific pre-trained model in use, as well as other factors like model size and the proportion of original data. Fine-tuning these ratios often requires empirical adjustments to achieve the desired balance between general and domain-specific capabilities.
When it comes to Supervised Fine-Tuning (SFT), the data ratio can be more flexible, sometimes even reaching a 1:1 balance for promising results. But this flexibility is contingent on the total volume of data used in SFT. If the dataset is small, the impact of this mixed data approach could be minimal.
When devising strategies to address industry-specific challenges, it’s essential to tailor the approach based on task complexity. For simpler tasks, well-crafted prompts are often more effective, especially when combined with a vector database. This approach can solve the majority of industry problems. For moderately complex tasks, Supervised Fine-Tuning (SFT) is generally more effective, particularly when some general data is mixed in. This can address most of the remaining issues. For more complex challenges, it may be prudent to wait for general large models to further enhance their capabilities.
Conclusion
As the field of AI continues to broaden, the role of industry-specific large models becomes increasingly critical in addressing specialized demands. However, it’s crucial to understand that these specialized models are not standalone constructs. They are, in essence, specialized applications of general large models tailored to meet unique industry needs. These industry-specific iterations act as interim solutions, bridging the gaps that general models currently can’t fill. Yet, as general models continue to evolve, the need for such specialized adaptations may diminish, making way for more universally applicable solutions.
Selecting the appropriate strategy to tackle industry-specific challenges is a nuanced endeavor. Among the complexities is the often-underestimated factor of data ratio, which can profoundly influence a model’s effectiveness. For straightforward tasks, the use of carefully designed prompts can yield excellent results, whereas more complex issues may require the enhanced capabilities of evolving general large models. In this ever-changing environment, the key to developing impactful industry solutions lies in the meticulous balancing of data types, model architecture, and task intricacy.