ChatGPT's multimodal journey! A PDF of the paper is attached

author：AMiner scientific and technological intelligence mining 2023-10-09 11:25:00

ChatGPT is down again to pull the progress bar!

Recently, ChatGPT has been updated to allow users to upload one or more images based on images for analysis and conversation.

OpenAI explains it with an example: let the AI decide what to have for dinner by taking pictures of the refrigerator and pantry! (The gospel of entanglement)

In addition, ChatGPT's mobile app will add a speech synthesis option that, when paired with existing speech recognition capabilities, enables completely direct verbal conversations with AI assistants.

This update of ChatGPT makes us believe more in the capabilities of large models.

This article will review the scientific research trends of GPT along the way, from GPT-1 to GPT-4, and then to this GPT-4V, what has OpenAI done right? Welcome to leave a message in the comment area to discuss.

1. GPT-4V(ision) System Card

The paper introduces the GPT-4V (ision) system card, which enables users to guide GPT-4 analysis of user-supplied image input. The incorporation of additional modalities, such as image inputs, into large language models (LLMs) is considered a key frontier in AI research and development. Multimodal LLM offers the possibility to amplify the impact of language-only systems, enabling them to solve new tasks and provide users with new experiences through novel interfaces and capabilities. In this paper, the authors analyze the safety attributes of GPT-4V. Training for the GPT-4V was completed in 2022 and access to early users began in March 2023. GPT-4 is the technology behind GPT-4V's visual abilities, and its training process is the same. The pre-trained model is first trained using large internet text and image data and licensed datasets to predict the next word in a document. Then, an algorithm called reinforcement learning human feedback (RLHF) is used to fine-tune it with additional data to produce an output that the human trainer prefers. Large multimodal models introduce different limitations and expand the risk surface area compared to text-based language models. GPT-4V has the limitations and capabilities of each modal (text and visual) while presenting emerging capabilities derived from the crossover between said modalities and the intelligent reasoning bestowed by large models. This system card provides an overview of how OpenAI prepares GPT-4's vision capabilities for deployment. It describes the period of small-scale user early access models and the security lessons OpenAI gained from this period, multimodal assessments for deployment studies, key findings from the expert red team, and mitigations implemented by OpenAI prior to widespread release.

ChatGPT's multimodal journey! A PDF of the paper is attached

Add image annotations, no more than 140 words (optional)

Link: https://www.aminer.cn/pub/6512b48a80b4415e7a3267dc/?f=toutiao

2. GPT-4 Technical Report

This paper describes the development process of GPT-4. GPT-4 is a large-scale multimodal model that accepts image and text input and generates text output. Despite being inferior to humans in many real-world scenarios, GPT-4 demonstrated human-level performance on a variety of professional and academic benchmarks, including scoring among the top 10 percent of testers on a simulated bar exam. GPT-4 is a Transformer-based model that is pre-trained to predict the next mark in a document. The backward training process results in improved performance in terms of factual accuracy and consistent behavior. A core component of the project is the development of infrastructure and optimization methods with predictive behavior at all scales. This allows us to accurately predict the performance of some aspects of GPT-4 based on models that do not exceed the computational amount of GPT-4.

ChatGPT's multimodal journey! A PDF of the paper is attached

Add image annotations, no more than 140 words (optional)

Link: https://www.aminer.cn/pub/641130e378d68457a4a2986f/?f=toutiao

3. Training language models to follow instructions with human feedback

This paper shows how human feedback can be used to train language models to follow user instructions, which can be used for a variety of tasks. Until then, larger language models may produce inaccurate or unuseful outputs because they are not aligned with user intent. To solve this problem, the method uses supervised learning and reinforcement learning from human feedback (RLHF) to train language models. The method, called InstructGPT, uses human feedback to further fine-tune the GPT-3 model. In human evaluations, although InstructGPT has 100 times the number of parameters as GPT-3, the output is preferred to the output of GPT-3. In addition, the InstructGPT model achieves improvements in outputting truthfulness and reducing toxicity while achieving little degradation in performance on many public natural language processing datasets. Although InstructGPT still has some simple errors, the approach shows that using human feedback to train language models is a promising direction that can help language models better follow user instructions.

ChatGPT's multimodal journey! A PDF of the paper is attached

Add image annotations, no more than 140 words (optional)

Link: https://www.aminer.cn/pub/61f50e3ad18a2b03dd0e7489/?f=toutiao

4. Evaluating Large Language Models Trained on Code

This paper introduces a GPT language model called Codex, which was trained from publicly available code on GitHub and investigates its ability to write Python code. The paper also describes a production version of Codex that is used in the production environment of GitHub Copilot. In the Human Assessment (HumanEval) test, Codex solved 28.8% of the problems, while GPT-3 and GPT-J solved 0% and 11.4% of the problems, respectively. In addition, the paper also finds that repeated sampling is an unexpectedly effective strategy for the model to generate effective solutions. Using this strategy, using 100 samples to resolve each issue, you can increase the resolution rate to 70.2%. But a closer look at the model also reveals its limitations, including the difficulty of handling docstrings with long chains of operations and binding operations. Finally, the paper discusses the potential impact of deploying powerful code-generation techniques, including security, security, and economics.

ChatGPT's multimodal journey! A PDF of the paper is attached

Add image annotations, no more than 140 words (optional)

Link: https://www.aminer.cn/pub/60e7be6891e011dcbc23b0a0/?f=toutiao

5. WebGPT: Browser-assisted question-answering with human feedback

This paper describes how to use a web browser-assisted question answering task, which is trained on a GPT-3 model. By using a text-based browser environment, the model can search and navigate the Internet to answer long questions. By setting up tasks to make it easy for humans to complete, you can use Imitation Learning to train a model and optimize the quality of answers guided by human feedback. To make it easier to assess factual accuracy, the model must collect references when browsing supporting answers. We train and evaluate the model using the ELI5 dataset, which is a question asked by Reddit users. Our best model is fine-tuned by behavioral cloning and rejected sampling using a reward model trained to predict human preference. The model's answers were selected as the best answer by humans 56% of the time, 69% more time than human demonstration answers, and 69% more time than the highest voted answers on Reddit.

ChatGPT's multimodal journey! A PDF of the paper is attached

Add image annotations, no more than 140 words (optional)

Link: https://www.aminer.cn/pub/61bff4285244ab9dcb79c82c/?f=toutiao

6. Language Models are Few-Shot Learners

This paper discusses how language models perform in small-sample learning. Significant advances have been made on many natural language processing tasks and benchmarks by using large-scale text pre-training language models and then fine-tuning them on specific tasks. However, this approach still requires thousands or hundreds of thousands of task-specific samples to fine-tune compared to humans. This paper shows that by increasing the size of the language model, it is possible to improve the performance of a small number of samples that are not task-independent, and sometimes even compete with previous state-of-the-art fine-tuning methods. Specifically, this paper trains GPT-3, an autoregressive language model with 17.5 billion parameters, 10 times more than previous non-sparse language models, and tests its performance under small-sample conditions. For all tasks, GPT-3 is applied without gradient updates or fine-tuning, specifying tasks only through text interaction and a small number of sample demonstrations. GPT-3 excels on multiple natural language processing datasets, including translation, question answering, and fill-in-the-blank tasks, as well as tasks that require real-time reasoning or domain adaptation, such as word decoding, using a new word in a sentence, and doing 3-digit arithmetic. At the same time, we also found that GPT-3 on some datasets still encounters difficulties with small-sample learning, and GPT-3 on some datasets faces methodological and ethical issues related to large-scale Web datasets. Finally, we found that GPT-3 can generate news articles that are difficult for humans to distinguish between being written by humans or not, and discuss this approach and its broader societal impact on GPT-3.

ChatGPT's multimodal journey! A PDF of the paper is attached

Add image annotations, no more than 140 words (optional)

Link: https://www.aminer.cn/pub/5ed0e04291e011915d9e43ee/?f=toutiao

7. Assessment of Empirical Troposphere Model GPT3 Based on NGL's Global Troposphere Products

This paper evaluates the performance of a GPT3 empirical Troposphere model based on the Nevada Geodetic Laboratory's global Troposphere dataset. The model is the latest version of the GPT family of models used to predict Troposphere data for more than 16,000 sites worldwide. Due to the large data set, long time span, and wide site distribution, the study used mean bias (BIAS) and root mean square error (RMS) as indicators to analyze the spatiotemporal characteristics of empirical models. The experimental results show that: (1) NGL's Troposphere products have the same accuracy as IGS products and can be used to evaluate general Troposphere models. (2) The global mean deviation of ZTD (Vertical Total Delay) predicted by GPT3 is -0.99cm and the global average root mean square error is 4.41cm. The accuracy of the model is closely related to latitude and ellipsoid height, showing significant seasonal variation. (3) The global mean root mean square error predicted by GPT3 is 0.77mm and 0.73mm, respectively, for the northern and eastern gradients, which are highly correlated, gradually increasing or decreasing from the equator to low and high latitudes.

ChatGPT's multimodal journey! A PDF of the paper is attached

Add image annotations, no more than 140 words (optional)

Link: https://www.aminer.cn/pub/5efdb2d09fced0a24b637f77/?f=toutiao

8. Learning to summarize from human feedback

This paper explores the implications for training and evaluating task-specific data and methods as language models become increasingly powerful. For example, digest generation models are often trained to predict human reference digests and evaluated using metrics like ROUGE, but these metrics are just rough proxies of the quality of the digests we really care about. In this paper, we show that by training a model to optimize human preferences, the quality of abstracts can be significantly improved. We collected a large-scale dataset of high-quality human comparative summaries and used that dataset to train a model to predict human preferences. We then use the model as a reward function to fine-tune the digest generation policy through reinforcement learning. TLs we apply to Reddit posts; DR dataset, which found that our model outperformed the human reference abstracts and larger models tuned using only supervised learning. Our model can also be applied to CNN/DM news articles, producing summaries that are almost identical to human reference summaries without any specific news fine-tuning. WE CONDUCTED AN IN-DEPTH ANALYSIS OF THE DATASET AND THE FINE-TUNED MODEL TO DETERMINE THAT OUR REWARD MODEL WAS APPLICABLE TO THE NEW DATASET, AND THAT OPTIMIZING OUR REWARD MODEL PROVIDED BETTER SUMMARIES THAN OPTIMIZING ROUGE BASED ON HUMAN PREFERENCES. We hope our research evidence will motivate machine learning researchers to pay more attention to how their training losses affect the behaviors they really want."

ChatGPT's multimodal journey! A PDF of the paper is attached

Add image annotations, no more than 140 words (optional)

Link: https://www.aminer.cn/pub/5f5356f991e0110c40a7bc3b/?f=toutiao

9. The Radicalization Risks of GPT-3 and Advanced Neural Language Models

This paper explores the risks of radicalization of GPT-3 and high-level neural network language models. By experimenting with different extreme narratives, social interactions, and cues of radical ideologies, we found that GPT-3 was significantly better than GPT-2 in generating extreme text. We also demonstrate GPT-3's ability to accurately simulate interactions, information, and impactful content that could be used to radicalize individuals and trap them in violent extremist thoughts and behaviors. Despite OpenAI's strong precautions, unregulated similar technologies could be radicalized and recruited online on a massive scale, so successful and efficient weaponization without safeguards may require less experimentation. AI stakeholders, policymakers, and government agencies should invest in social norms, public policies, and educational programs as soon as possible to prevent the influx of machine-generated disinformation and propaganda. Mitigation measures require effective policies and cooperation between industry, government and society.

ChatGPT's multimodal journey! A PDF of the paper is attached

Add image annotations, no more than 140 words (optional)

Link: https://www.aminer.cn/pub/5f61dc3891e011fae8fd69dc/?f=toutiao

10. Can GPT-3 Pass a Writer’s Turing Test?

This paper explores recent developments in the field of natural language generation, as well as recent advances in large-scale statistical language models such as GPT-3. Past techniques relied on formal grammar systems and small statistical models, and language rewriting using a large number of heuristic rules. However, these old technologies are rather limited and error-prone, producing short, incoherent language or engaging in dialogue with humans under specific themes. The development of large-scale statistical language models has dramatically advanced the field in recent years, and GPT-3 is one example of this. It can learn language rules by repeatedly touching the language without explicit programming or rules. Like human children, GPT-3 learns language through mass exposure, albeit on a larger scale. With no explicit rules, it sometimes performs poorly on the simplest language tasks, but it can also excel on more complex tasks, such as imitating the author or having philosophical discussions.

ChatGPT's multimodal journey! A PDF of the paper is attached

Add image annotations, no more than 140 words (optional)

Link: https://www.aminer.cn/pub/5fc0d8f09e795e733881396e/?f=toutiao

11. Fine-Tuning Language Models from Human Preferences

This paper examines how human preferences can be applied to natural language processing tasks to improve the performance of language models. In the paper, the authors apply reinforcement learning to four natural language processing tasks by constructing a reward model using human question answering, including emotional persistent text and descriptive text, and TL; Summary generation tasks for DR and CNN/Daily Mail datasets. For the affective sustained text task, the authors trained the model using 5,000 comparisons evaluated by humans, with good results. For the summary generation task, a model trained with 60,000 comparisons copies entire sentences in the input text and skips extraneous preamble, which yields a reasonable ROUGE score and is considered very good by human labelers, but this may be the result of taking advantage of the fact that human labelers rely on simple heuristics.

ChatGPT's multimodal journey! A PDF of the paper is attached

Add image annotations, no more than 140 words (optional)

Link: https://www.aminer.cn/pub/5d835fd63a55ac583ecde807/?f=toutiao

12. Language Models are Unsupervised Multitask Learners

This paper demonstrates that natural language processing tasks such as question answering, machine translation, reading comprehension, and summarization, often require supervised learning on specific task-specific datasets. But when the authors trained language models for millions of web pages on a new dataset called WebText, they found that the language models began learning these tasks without explicit supervision. When the conditions were documents and questions, the language model generated answers in the CoQA dataset of 55 F1, exceeding the scores of the three baseline systems and without more than 127,000 training examples. The authors also found that the capacity of the language model is a key factor in the successful zero-task transfer, and that increasing capacity can improve performance on all tasks and in a log-linear way. The largest model, GPT-2, is a Transformer with 150 million parameters that achieves the best on seven test language modeling datasets in a zero-style context, but still underperforms on WebText. These findings provide a promising path for building natural language processing systems that learn tasks from naturally occurring presentations.

ChatGPT's multimodal journey! A PDF of the paper is attached

Add image annotations, no more than 140 words (optional)

Link: https://www.aminer.cn/pub/5f8eab549e795e9e76f6f69e/?f=toutiao

13. Improving language understanding by generative pre training

This paper discusses how to improve natural language understanding. While a large number of unlabeled text databases are ubiquitous, learning for specific tasks requires a large amount of labeled data, which makes it difficult for discriminatoryly trained models to handle these tasks. To solve this problem, the paper proposes a method called "generative pre-training", which is performed by generating pre-training on a diverse database of unlabeled text, followed by discriminatory fine-tuning for each specific task. Unlike the previous strategy, during fine-tuning, the paper uses task-dependent input transformations to achieve efficient migration, with only minor changes to the model architecture. This paper demonstrates the effectiveness of this approach on a range of natural language understanding benchmark tasks. This universal task-independent model outperformed cells specifically designed for each task in 12 natural language understanding tasks, achieving state-of-the-art results in 9 tasks. For example, we achieved an absolute 8.9% improvement in the Stories Cloze Test task, a 5.7% improvement in the Question Answering (RACE) task, and a 1.5% improvement in the MultiNLI task.

ChatGPT's multimodal journey! A PDF of the paper is attached

Add image annotations, no more than 140 words (optional)

Link: https://www.aminer.cn/pub/5f8eab579e795e9e76f6f6a0/?f=toutiao

How to use AMiner AI?

The way to use AMiner AI is very simple, open the AMiner homepage, and enter the AMiner AI page from the navigation bar at the top of the page or the bottom right corner.

On the AMiner AI page, you can choose to conduct conversations based on individual documents and based on the whole database (personal library), upload local PDFs or search documents directly on AMiner.

ChatGPT's multimodal journey! A PDF of the paper is attached

AMiner AI entrance: AMiner - AI empowers scientific and technological intelligence mining-academic search-thesis search-thesis patent-literature tracking-scholar portrait

Previous: Wedding photography, photo photo frame, photo album professional factory direct supply! #Studio post-production #Photo frame production#Frame#Haute couture series#Album

Next: Back to that cooking, watching movies, playing sports, going to the supermarket, singing, drinking, moving, going to the doctor... When you can do it alone, you don't have to take care of other people's moods, and you don't want to let it because of yourself

Read on