天天看点

建立一个能搜索正确数据源并准确回答问题的ChatGPT

作者:吉祥庄钢铁侠

在这个例子中,我们将要求GPT在用户提出问题时挑选合适的数据集进行搜索,然后回答用户的问题。

建立一个能搜索正确数据源并准确回答问题的ChatGPT

⚠️,否则你可以通过使用Embedbase Cloud而不是自己运行它来简化这个例子。

如果是这样,你可以跳到种子数据集部分。

安装

在虚拟环境中安装所需的依赖项:

virtualenv env
source env/bin/activate
pip install embedbase pgvector psycopg2 openai
           

启动Postgres作为一个Embedbase数据库

为Embedbase数据库运行一个Postgres实例。

docker run -d --name pgvector -p 8080:8080 -p 5432:5432 \
 -e POSTGRES_DB=embedbase -e POSTGRES_PASSWORD=localdb \
 -v data:/var/lib/postgresql/data ankane/pgvector
           

启动嵌入基地

创建一个新文件main.py,代码如下:

import os
from embedbase import get_app
from embedbase.database.postgres_db import Postgres
from embedbase.embedding.openai import OpenAI
import uvicorn
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
app = (
      get_app()
      .use_embedder(OpenAI(OPENAI_API_KEY))
      .use_db(Postgres())
      .run()
  )
if __name__ == "__main__":
    uvicorn.run("main:app", reload=True)
           

用以下命令启动Embedbase应用程序:

python3 main.py
           

种子数据集

我们需要在Embedbase中添加一些数据来询问ChatGPT。

import json
import requests
import fire
 
# Set the Embedbase API URL
EMBEDBASE_API_URL = "http://localhost:8000"
# if using embedbase cloud, add your api key to the headers
# EMBEDBASE_API_KEY = "<your embedbase api key>"
 
def seed_dataset():
    animals = {
        "lion": {"weight": 190, "height": 1.2, "speed": 80},
        "elephant": {"weight": 5000, "height": 3.2, "speed": 40},
        "giraffe": {"weight": 800, "height": 5.5, "speed": 60},
        "zebra": {"weight": 350, "height": 1.5, "speed": 60},
        "rhinoceros": {"weight": 2300, "height": 1.8, "speed": 45},
        "crocodile": {"weight": 1000, "height": 4.5, "speed": 20},
        "hippopotamus": {"weight": 1500, "height": 1.5, "speed": 30},
        "cheetah": {"weight": 60, "height": 0.8, "speed": 110},
        "kangaroo": {"weight": 80, "height": 1.5, "speed": 56},
        "penguin": {"weight": 30, "height": 1.1, "speed": 10},
    }
    cars = [
        {"make": "Toyota", "model": "Camry", "year": 2022},
        {"make": "Honda", "model": "Civic", "year": 2021},
        {"make": "Ford", "model": "F-150", "year": 2023},
        {"make": "Tesla", "model": "Model S", "year": 2022},
        {"make": "Chevrolet", "model": "Corvette", "year": 2021},
        {"make": "Jeep", "model": "Wrangler", "year": 2022},
        {"make": "BMW", "model": "X5", "year": 2023},
        {"make": "Mercedes-Benz", "model": "S-Class", "year": 2022},
        {"make": "Audi", "model": "A4", "year": 2021},
        {"make": "Lamborghini", "model": "Aventador", "year": 2022},
    ]
 
    # clear the dataset just in case it already exists
    requests.get(f"{EMBEDBASE_API_URL}/v1/animals/clear",
        # if using embedbase cloud, add your api key to the headers
        # headers={
        #     "Authorization": "Bearer " + EMBEDBASE_API_KEY,
        # },
    )
    requests.get(f"{EMBEDBASE_API_URL}/v1/cars/clear",
        # if using embedbase cloud, add your api key to the headers
        # headers={
        #     "Authorization": "Bearer " + EMBEDBASE_API_KEY,
        # },
    )
 
    # seed the animals dataset
    requests.post(
        f"{EMBEDBASE_API_URL}/v1/animals",
        json={"documents": [{"data": json.dumps(animal)} for animal in animals]},
        # if using embedbase cloud, add your api key to the headers
        # headers={
        #     "Authorization": "Bearer " + EMBEDBASE_API_KEY,
        # },
    )
 
    # seed the cars dataset
    requests.post(
        f"{EMBEDBASE_API_URL}/v1/cars",
        json={"documents": [{"data": json.dumps(car)} for car in cars]},
        # if using embedbase cloud, add your api key to the headers
        # headers={
        #     "Authorization": "Bearer " + EMBEDBASE_API_KEY,
        # },
    )
 
if __name__ == "__main__":
    fire.Fire({
        "seed": seed_dataset,
    })           
python3 ask.py seed
           

搜索

我们现在将创建我们应用程序的主要逻辑。我们将要求GPT在用户提出问题时挑选正确的数据集进行搜索。

这个过程将按原样进行:

1. 用户提出一个问题

2. GPT将查询`/datasets`以获得数据集的列表

3. GPT将用所选择的数据集和问题查询`/search`。

4. GPT将返回结果

import re
import os
import json
import requests
import openai
import fire
 
# Set the Embedbase API URL
EMBEDBASE_API_URL = "http://localhost:8000"
 
 
def get_datasets():
    response = requests.get(
        f"{EMBEDBASE_API_URL}/v1/datasets",
        # if using embedbase cloud, add your api key to the headers
        # headers={
        #     "Authorization": "Bearer " + EMBEDBASE_API_KEY,
        # },
    )
    return [e["dataset_id"] for e in response.json()["datasets"]]
 
 
def search_dataset(dataset_id, query):
    payload = {"query": query, "top_k": 3}
    response = requests.post(
        f"{EMBEDBASE_API_URL}/v1/{dataset_id}/search", json=payload,
        # if using embedbase cloud, add your api key to the headers
        # headers={
        #     "Authorization": "Bearer " + EMBEDBASE_API_KEY,
        # },
    )
    return [e["data"] for e in response.json()["similarities"]]           

上述代码将被用来查询Embedbase API。

# ...
def ask_question(question, openai_model: str = "gpt-3.5-turbo"):
    datasets = get_datasets()
 
    # Prompt for GPT
    prompt = f"Given the following datasets:\n"
    for dataset in datasets:
        prompt += f"- {dataset}\n"
    prompt += f"\nChoose the best dataset to search and answer the following question:\n{question}"
 
    # Call GPT
    response = openai.ChatCompletion.create(
        model=openai_model,
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant that select a dataset to search for a given question."
                "You always say ONLY the dataset name, nothing else. You are given a list of datasets and a question. "
                "For example, if the list of datasets is - plants\n- animals\n- cars\n- fruits\n- vegetables\n"
                "and the question is: What is the fastest animal?, you would say: [animals]",
            },
            {"role": "user", "content": prompt},
        ],
    )
 
    chosen_dataset = response.choices[0].message.content.strip()
    print(f"GPT chose the dataset: {chosen_dataset}")
 
    # extract the dataset name from the output of GPT
    # eg [animals] -> animals
    chosen_dataset = re.sub(r"\[|\]", "", chosen_dataset)
    search_results = search_dataset(chosen_dataset, question)
 
    # Call GPT again to answer the question based on the search results
    prompt = (
        f"Based on the following search results, answer the question: '{question}'\n"
    )
    for result in search_results:
        prompt += f"- {result}\n"
 
    response = openai.ChatCompletion.create(
        model=openai_model,
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant that answers questions based on the provided search results.",
            },
            {"role": "user", "content": prompt},
        ],
    )
 
    answer = response.choices[0].message.content.strip()
 
    return answer
           

上述代码将被用来调用GPT来提问。现在添加一些小的逻辑,当用户提出问题时调用上述函数。

def main(openai_key: str = None, openai_model: str = "gpt-3.5-turbo"):
    openai.api_key = openai_key or os.environ.get("OPENAI_API_KEY")
    question = input("Ask a question: ")
    answer = ask_question(question, openai_model)
    print(f"Answer: {answer}")
if __name__ == "__main__":
    fire.Fire({
              "ask": main,
              "seed": seed_dataset,
          })
           

现在你可以运行应用程序并提出问题。

python3 ask.py ask --openai_key <your-openai-key>
  # feel free to add "--openai_model gpt-4" if you have access to it