laitimes

Databrick Dolly: A large language model that follows instructions

#挑战30天在头条写日记#

Databrick Dolly: A large language model that follows instructions

Databricks' Dolly is a large-scale language model that follows instructions, trained on the Databricks machine learning platform, and licensed for commercial use. Based on, Dolly was trained by Databricks staff to fine-tune about pythia-12b15k instruction/response fine-tuning records generated in the InstructGPT paper, which include brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. Not a state-of-the-art model, but it does exhibit surprisingly high-quality instructions, following behavior that is not characteristic of the underlying model on which it is based. databricks-dolly-15kdolly-v2-12b

Databricks is committed to ensuring that every organization and individual can benefit from the transformative power of AI. The Dolly model series represents the first step in our journey and we are excited to share this technology with the world."

The model is available on Hugging Face as databricks/dolly-v2-12b.

Model overview

dolly-v2-12b is a 12 billion parametric causal language model created by Databricks, derived from EleutherAI's Pythia-12b, fine-tuned on a corpus of approximately 15K recorded instructions generated by Databricks employees and released under license (CC-BY-SA).

Known limitations

Performance limits

Dolly-v2-12b is not a state-of-the-art generative language model, and although quantitative benchmarking is ongoing, it is not designed to compete with more modern model architectures or models influenced by larger pre-trained corpora.

The Dolly model family is under active development, so it's unlikely that the list of any shortcomings will be exhaustive, but we've listed known limitations and missteps here as a way to document and share our initial findings with the community. In particular, Dolly-v2-12b struggles with: syntactically complex prompts, programming problems, mathematical operations, factual errors, dates and times, open-ended questions and answers, hallucinations, lists of specific lengths, stylistic imitations, a sense of humor, etc. In addition, we found that certain features were not available in the original model of Dolly-V2-12B, such as well-formed letter writing.

Dataset limits

Like all language models, dolly-v2-12b reflects the content and limitations of its training corpus.

  • Heap: GPT-J's pre-trained corpus contains content collected primarily from the public internet and, like most web-scale datasets, it contains content that many users find objectionable. As a result, the model may reflect these shortcomings, openly and sometimes subtly, in cases where explicitly required to generate objectionable content, such as in the case of biased or harmful implicit associations.
  • databricks-dolly-15k: The training data on which the instruction adjustment is based, dolly-v2-12b, represents the natural language instructions generated by Databricks employees between March and April 2023 and includes passages from Wikipedia as reference passages for instruction categories such as closed QA and abstracts. To the best of our knowledge, it does not contain obscene, intellectual property, or personally identifiable information about non-public figures, but may contain spelling errors and factual errors. The dataset may also reflect biases found in Wikipedia. Finally, the dataset may reflect the interests and semantic choices of Databricks employees, a group that is not representative of the global population as a whole.

Databricks is committed to ongoing research and development efforts to develop useful, honest, and harmless AI technologies that maximize the potential of all individuals and organizations.

Start generating the response

If you want to simply test the model without training, you can use the model as databricks/dolly-v2-12b on Hugging Face.

To use a model with the transformers library on a computer with an A100 GPU:

from transformers import pipeline
import torch

instruct_pipeline = pipeline(model="databricks/dolly-v2-12b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
           

You can then use the pipeline to answer the instructions:

instruct_pipeline("Explain to me the difference between nuclear fission and fusion.")
           

Build on other instances

A100 instance types aren't available in all cloud regions or can be difficult to provision. Inference can be done on other GPU instance types.

A10 GPU

The 6.9B and 2.8B parameter models should work as-is.

To build using a 12B parameter model on A10s (example: 1 x A10 24GB) g5.4xlarge, you need to load and run the build with 8-bit weights, which has a slight impact on the results:

  • Bitsandbytes is also installed
  • Add model_kwargs={'load_in_8bit': True} to the pipe() command shown above

V100 GPU

When using V100 (e.g. p3.2xlarge, 1 x V100 16GB, NC6s_v3), in all cases set torch_dtype=torch.float16 to pipeline().

Otherwise, follow the steps above. The 12B parameter model may not work correctly in 8-bit on V100.

Get started with training

  • Add the dolly repository to Databricks (under Repositories, click Add repository, enter the https://github.com/databrickslabs/dolly.git, and then click Create repository).
  • Launch 13.x ML (includes Apache Spark 3.4.0, GPU, Scala 2.12) a single-node cluster with a node type of 8 A100 GPUs or higher (e.g. Standard_ND96asr_v4 or p4d.24xlarge). Note that these instance types might not be available in all regions or might be difficult to provision. In Databricks, note that you must first select the GPU runtime and then deselect Use Photon to display these instance types, if supported.
  • Open a notebook in train_dolly repository (that is, a file in train_dolly.pyGithubdolly) repository, attach to the GPU cluster, and run all cells. After the training is complete, the notebook saves the model in /dbfs/dolly_training.

Training for other instances

A100 instance types aren't available in all cloud regions or can be difficult to provision. For smaller Dolly model sizes, you can train on other GPU instance types with minor modifications to reduce memory usage. These modifications are not optimal, but they are easy to implement.

Select your GPU family type from the widget gpu_family, enter num_gpus number of GPUs available in the widget, and run the rest of the code. Many different options are set for you to train a model for one of the following GPU types:

  • A100 (default)
  • A10
  • V100

Details of the different configurations are as follows.

A100 GPU

The A100 GPU is the preferred choice for training all model sizes and is the only GPU that can train a 12B parameter model in a reasonable amount of time. Therefore, this is the default configuration set in the a100_config.jsondeepspeed configuration file.

A10 GPU

Training a 12B parameter model on A10 is not recommended.

In an A10 instance (for example: g5.24xlarge4 x A10 24GB; Standard_NV72ads_A10_v52 x A10), simply select from the small gpu_family widget and enter num_gpus number of GPUs available in the widget, then run the rest of the code. This will use the a10_config.jsondeepspeed configuration file, which makes the following changes:

  • per-device-train-batch-size and set to 3train_dolly.pydeepspeed in the call per-device-eval-batch-size
  • In the "zero_optimization" deepspeed configuration section, we added:
  • "offload_optimizer": { "device": "cpu", "pin_memory": true },

V100 GPU

To run on a V100 instance with 32GB of GPU memory (for example: p3dn.24xlarge or or), simply select from the widget Standard_ND40rs_v2 and enter the number of GPUs available in the widget, then run the rest of the code. This will use the deepspeed configuration file, which makes the following changes: v100gpu_familynum_gpusv100_config.json

  • It makes the above changes to the A10
  • It enables the FP16 floating-point format
  • It sets per-device-train-batch-size and sets per-device-eval-batch-size to 3

You can slightly increase the batch size of a 32GB instance compared to the 24GB A10 case above.

Run unit tests locally

pyenv local 3.8.13
python -m venv .venv
. .venv/bin/activate
pip install -r requirements_dev.txt
./run_pytest.sh
           

citation

@online{DatabricksBlog2023DollyV2,
    author    = {Mike Conover and Matt Hayes and Ankit Mathur and Jianwei Xie and Jun Wan and Sam Shah and Ali Ghodsi and Patrick Wendell and Matei Zaharia and Reynold Xin},
    title     = {Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM},
    year      = {2023},
    url       = {https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm},
    urldate   = {2023-06-30}
}           

Project Address:

https://github.com/databrickslabs/dolly

Read on