The world’s leading publication for data science, AI, and ML professionals.

2023 Predictions: What’s Next for AI Research?

Excited by the past year, we looked forward to 2023, and wondered what it would look like

This blog post was co-authored with Guy Eyal, an NLP team leader at Gong

TL;DR:

In 2022, large models achieved state-of-the-art results in various tasks and domains. A significant breakthrough in natural language processing (NLP) was achieved when models were trained to align with user intent and human preferences, leading to improved generation quality. Looking ahead to 2023, we can expect to see new methods to improve the alignment process (such as reinforcement learning with AI feedback), the development of automatic metrics for understanding alignment effectiveness, and the emergence of personalized aligned models, even in an online manner. There may also be a focus on addressing factuality issues as well as developing open-source tools and specialized compute resources to allow the industrial scale of aligned models. In addition to NLP, there will likely be progress in other modalities such as audio processing, computer vision, and robotics, and the development of multimodal models.


2022 AI Research Progress: A Year in Review

2022 was an excellent year for Artificial Intelligence/machine learning, with numerous large language models (LLMs) published and achieving state-of-the-art results across various benchmarks. These LLMs demonstrated their superior performance through few-shot learning, surpassing smaller models that had been fine-tuned on the same tasks [1–3]. This has the potential to reduce the need for specialized, in-domain datasets. Techniques like Chain of Thoughts [4] and Self Consistency [5] also helped to improve the reasoning capabilities of LLMs, leading to significant gains on reasoning benchmarks.

There were also notable advancements in dialogue systems resulting in more helpful, safe, and faithful models that could stay up-to-date through fine-tuning on annotated data and the use of retrieval from external knowledge sources [6–7].

In Automatic Speech Recognition (ASR), the use of an encoder-decoder transformer architecture allowed for more efficient scaling of model size, leading to a 50% reduction in word error rate on multiple ASR benchmarks without any domain adaptation [8].

Diffusion models [9–10] trained on large image datasets, made impressive strides in computer vision and sparked a new trend in AI art. Additionally, we saw the beginnings of multimodal models that use pre-trained LLMs to improve performance in tasks ranging from vision to robotics [9–12].

Finally, the release of ChatGPT [13] gave users a glimpse into the future of working with AI assistants in various fields and domains.


Photo by Moritz Knöringer on Unsplash
Photo by Moritz Knöringer on Unsplash

2023 Predictions: The Year of The Alignment

Excited by the past year, we looked forward to 2023, and wondered what it would look like. Here are our thoughts:

Reinforcement Learning with Human Feedback (RLHF), a supervised approach that aligns models with user intent and human preferences, has become increasingly popular in recent months [15]. This supervised approach shows promising results in the generation quality as can be seen by comparing the outputs of vanilla GPT3 [16] and ChatGPT. RLHF is so effective that a model trained with instruction tuning outperforms a model that is more than 100 times larger in size.

Image source - InstructGPT [15]
Image source – InstructGPT [15]

Going further with this year’s trend, we, like many others, expect that alignment will continue to be a significant factor. However, we predict active Research work in additional core functionalities that are currently lacking in most models and therefore limit their applicability in many fields.

Reinforcement Learning with AI Feedback (RLAIF)

Currently, RLHF requires human-curated data. Although small compared to the pre-training data size, it requires extensive and expensive human labor. For example, OpenAI used 40 annotators to write and label 64K samples for instruction tuning [15]. An interesting and exciting alternative that we think will be utilized this year is to use different LLMs as the instructors and labelers **** – Reinforcement Learning with AI Feedback (RLAIF). RLAIF will enable cost reduction and fast scaling of the alignment process as machines will do everything end-2-end. An interesting recent work by Anthropic [17] showed that with good prompting, one could guide an LM to classify harming outputs. These in turn, are used for the training of the reward model necessary for RLHF.

Metrics For Alignment

We assume that many methods will be developed to achieve better alignment between the model’s outputs and user intent. This, in return, will improve even further generation quality.

In order to understand which method is superior, automatic metrics should be developed alongside current human evaluation methods. This is a long-standing issue in NLP, as previous metrics fail to align with human annotations.

Recently, two promising approaches have been introduced: MAUVE [18], which compares the distributions of human-generated and model-generated outputs using divergence frontiers, and model-written evaluations [19], which utilizes other language models to assess the quality of the generated output. Further research in these areas may be a valuable direction during 2023.

Personalized Aligned Models

Expecting a model to be aligned with the entire society does not make sense, as we are not aligned with each other. Therefore we expect to see many different models aligned with different usages and users. We term this as Personalized Aligned Models.

We’ll see various companies align models with their own needs, and big companies align many models with their different users. This will greatly improve the end user’s experience when using LLMs in personal assistants, internet searches, text editing, and more.

Open Source And a Specialized Compute

To achieve personalization of aligned models at the industry scale, two components that don’t/partially exist today will have to be available for public use: models that can be aligned and compute resources.

Models to be Aligned and Open Source Models that are candidates for alignment will have to be developed, as the current models, such as Meta’s OPT [20], are not sufficient as they are not on par with paid APIs. Alternatively, we’ll see a paid API for the model’s alignment: non-public models by Google / OpenAI / Cohere / AI21 with full-serving options for consumers will be available and will serve as a valid business model.

Computational resources: although the alignment is much cheaper than pre-training, it still requires very specialized computational resources. Therefore we predict a race towards generating such an accessible infrastructure for the public, probably on-cloud.

Handling Factuality Issues

The apparent fluency of the output produced by LLMs may lead individuals to perceive the model as factually accurate and confident. However, a known limitation of LLMs that still needs to be solved by alignment is their tendency to generate hallucinated content. Therefore, we see two important research directions that will flourish this year: Outputting sources for text (citations) and outputting the model’s confidence.

Outputting sources for the current output can be achieved in many ways. One interesting direction is to connect LLMs with text retrieval ** mechanisms that will help ground/relate the outputs to known sources [21]. This may also help models stay relevant,** although their training process stopped at some point in the past. Another recently suggested idea is to do this in post-process by searching for documents that are most proximal to the output [22]. While the latter will not solve hallucinations, it will make it easier for the user to validate the results.

Recent works in different domains (ASR for example [23]) trained models that have two outputs: token prediction and a per-token confidence score. Using similar methods while extending the confidence score to relate to the entire output will help the user to take the results with a grain of salt.

Online Alignment

As people change over time, with shifts in interests, beliefs, jobs, and family status, it makes sense that their personal assistants should adapt as well. One very promising research direction we’re predicting is online alignment. Users will be able to continue personalizing their models after deployment. The alignment process will be continuously updated using an online learning algorithm by giving feedback to the models [24].

What About Other Modalities?

We expect to see considerable improvements in audio and speech recognition domains. We assume that Whisper [8] will be able to utilize unlabelled data (such as Wav2Vec 2.0 [25] / HuBERT [26]), which will significantly improve performances in challenging acoustic scenarios.

SpeechT5 [27] was an early bird, so we assume that T0-like models [28] for audio will be trained on scale (both training data and model size), resulting in improved audio embeddings. This will enable a unified speech enhancement, diarization, and transcription system. In the longer term, we expect auditory models to answer questions similar to natural language processing (NLP) models. The grounding context of these auditory models will be an audio segment, which will serve as the context for the query without the need for implicit transcription.

Multi-Modal Models

An important paradigm for the next year would be large multimodal models. What will they look like? We suspect they may look very similar to language models. By that we mean that the user will prompt the model with a given modality, and the model will be able to generate its output in a different modality (as in Unified-IO [29]).

Although very exciting, diffusion models [9] currently cannot classify images. This can be solved easily by outputting text similar to how we use LLMs today in classification tasks. Similarly, these models will be able to transcribe, generate and enhance audio and videos by good prompting.

What about aligning multimodal models? This is for the far future! Or as we call it in the current pace of our field – in a few months.


Closing Thoughts

This post presents our predictions regarding the needed advances in AI research in 2023. Large models can perform a wide range of academic tasks, as shown by their impressive performance on standard benchmarks. However, their applicability needs to be improved, as in real-world scenarios, these models still encounter embarrassing failures (untruthful, toxic, or simply not helpful to the user). We believe that aligning the models with user needs and keeping them up-to-date can address many of these issues. To that end, we have focused on the scalability and adaptability of the alignment process. If our hypothesis is correct, the field of generative language models will undergo significant changes soon. The potential uses of these models are vast, ranging from editing tools to domain-specific AI assistants that can automate manual labor in industries such as law, accounting, and engineering. Combining the above statement with predicted progress in computing (GPT- 4) and using the same methods applied to domains such as vision and audio processing promises another exciting year.

Thank you for reading!! If you have any thoughts regarding this 2023 projection, we warmly welcome them in the comments.


References:

[1] Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prabhakaran, V., . . . Fiedel, N. (2022). PaLM: Scaling Language Modeling with Pathways. arXiv. https://doi.org/10.48550/arXiv.2204.02311

[2] Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M., Tow, J., Rush, A. M., Biderman, S., Webson, A., Ammanamanchi, P. S., Wang, T., Sagot, B., Muennighoff, N., . . . Wolf, T. (2022). BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv. https://doi.org/10.48550/arXiv.2211.05100

[3] Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., Mihaylov, T., Ott, M., Shleifer, S., Shuster, K., Simig, D., Koura, P. S., Sridhar, A., Wang, T., & Zettlemoyer, L. (2022). OPT: Open Pre-trained Transformer Language Models. arXiv. https://doi.org/10.48550/arXiv.2205.01068

[4] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain of Thought Prompting Elicits Reasoning in Large Language Models. arXiv. https://doi.org/10.48550/arXiv.2201.11903

[5] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv. https://doi.org/10.48550/arXiv.2203.11171

[6] Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H., Jin, A., Bos, T., Baker, L., Du, Y., Li, Y., Lee, H., Zheng, H. S., Ghafouri, A., Menegali, M., Huang, Y., Krikun, M., Lepikhin, D., Qin, J., . . . Le, Q. (2022). LaMDA: Language Models for Dialog Applications. arXiv. https://doi.org/10.48550/arXiv.2201.08239

[7] Shuster, K., Xu, J., Komeili, M., Ju, D., Smith, E. M., Roller, S., Ung, M., Chen, M., Arora, K., Lane, J., Behrooz, M., Ngan, W., Poff, S., Goyal, N., Szlam, A., Boureau, Y., Kambadur, M., & Weston, J. (2022). BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage. arXiv. https://doi.org/10.48550/arXiv.2208.03188

[8] Radford, A., Kim, JW, Xu, T, Brockman, G., McLeavey, C., Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. OpenAI. https://cdn.openai.com/papers/whisper.pdf

[9] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2021). High-Resolution Image Synthesis with Latent Diffusion Models. arXiv. https://doi.org/10.48550/arXiv.2112.10752

[10] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv. https://doi.org/10.48550/arXiv.2204.06125

[11] Tang, Z., Yang, Z., Wang, G., Fang, Y., Liu, Y., Zhu, C., Zeng, M., Zhang, C., & Bansal, M. (2022). Unifying Vision, Text, and Layout for Universal Document Processing. arXiv. https://doi.org/10.48550/arXiv.2212.02623

[12] Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., & Yang, H. (2022). OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. arXiv. https://doi.org/10.48550/arXiv.2202.03052

[13] Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., Fu, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Ho, D., Hsu, J., Ibarz, J., Ichter, B., Irpan, A., Jang, E., Ruano, R. J., Jeffrey, K., . . . Zeng, A. (2022). Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. arXiv. https://doi.org/10.48550/arXiv.2204.01691

[14] https://chat.openai.com/chat

[15] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., . . . Lowe, R. (2022). Training language models to follow instructions with human feedback. arXiv. https://doi.org/10.48550/arXiv.2203.02155

[16] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., . . . Amodei, D. (2020). Language Models are Few-Shot Learners. arXiv. https://doi.org/10.48550/arXiv.2005.14165

[17] Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Perez, E., Kerr, J., . . . Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv. https://doi.org/10.48550/arXiv.2212.08073

[18] Pillutla, K., Swayamdipta, S., Zellers, R., Thickstun, J., Welleck, S., Choi, Y., & Harchaoui, Z. (2021). MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers. arXiv. https://doi.org/10.48550/arXiv.2102.01454

[19] Perez, E., Ringer, S., Lukošiūtė, K., Nguyen, K., Chen, E., Heiner, S., Pettit, C., Olsson, C., Kundu, S., Kadavath, S., Jones, A., Chen, A., Mann, B., Israel, B., Seethor, B., McKinnon, C., Olah, C., Yan, D., Amodei, D., . . . Kaplan, J. (2022). Discovering Language Model Behaviors with Model-Written Evaluations. arXiv. https://doi.org/10.48550/arXiv.2212.09251

[20] Iyer, S., Lin, X. V., Pasunuru, R., Mihaylov, T., Simig, D., Yu, P., Shuster, K., Wang, T., Liu, Q., Koura, P. S., Li, X., Pereyra, G., Wang, J., Dewan, C., Celikyilmaz, A., Zettlemoyer, L., & Stoyanov, V. (2022). OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization. arXiv. https://doi.org/10.48550/arXiv.2212.12017

[21] He, H., Zhang, H., & Roth, D. (2022). Rethinking with Retrieval: Faithful Large Language Model Inference. arXiv. https://doi.org/10.48550/arXiv.2301.00303

[22] Bohnet, B., Tran, V. Q., Verga, P., Aharoni, R., Andor, D., Soares, L. B., Eisenstein, J., Ganchev, K., Herzig, J., Hui, K., Kwiatkowski, T., Ma, J., Ni, J., Schuster, T., Cohen, W. W., Collins, M., Das, D., Metzler, D., Petrov, S., . . . Webster, K. (2022). Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models. arXiv. https://doi.org/10.48550/arXiv.2212.08037

[23] Gekhman, Z., Zverinski, D., Mallinson, J., & Beryozkin, G. (2022). RED-ACE: Robust Error Detection for ASR using Confidence Embeddings. arXiv. https://doi.org/10.48550/arXiv.2203.07172

[24] Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., Elhage, N., Hernandez, D., Hume, T., Johnston, S., Kravec, S., . . . Kaplan, J. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv. https://doi.org/10.48550/arXiv.2204.05862

[25] Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arXiv. https://doi.org/10.48550/arXiv.2006.11477

[26] Hsu, W., Bolte, B., Tsai, Y., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. arXiv. https://doi.org/10.48550/arXiv.2106.07447

[27] Ao, J., Wang, R., Zhou, L., Wang, C., Ren, S., Wu, Y., Liu, S., Ko, T., Li, Q., Zhang, Y., Wei, Z., Qian, Y., Li, J., & Wei, F. (2021). SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing. arXiv. https://doi.org/10.48550/arXiv.2110.07205

[28] Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T. L., Raja, A., Dey, M., Bari, M. S., Xu, C., Thakker, U., Sharma, S. S., Szczechla, E., Kim, T., Chhablani, G., Nayak, N., . . . Rush, A. M. (2021). Multitask Prompted Training Enables Zero-Shot Task Generalization. arXiv. https://doi.org/10.48550/arXiv.2110.08207

[29] Lu, J., Clark, C., Zellers, R., Mottaghi, R., & Kembhavi, A. (2022). Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks. arXiv. https://doi.org/10.48550/arXiv.2206.08916

[30] Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R. (2020). NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. arXiv. https://doi.org/10.48550/arXiv.2003.08934

[31] Chen, C., Gao, R., Calamia, P., & Grauman, K. (2022). Visual Acoustic Matching. arXiv. https://doi.org/10.48550/arXiv.2202.06875

[32] Zhu, Z., Peng, S., Larsson, V., Xu, W., Bao, H., Cui, Z., Oswald, M. R., & Pollefeys, M. (2021). NICE-SLAM: Neural Implicit Scalable Encoding for SLAM. arXiv. https://doi.org/10.48550/arXiv.2112.12130


Related Articles