許多現代資料系統都依賴于結構化資料，例如 Postgres DB 或 Snowflake 資料倉庫。 LlamaIndex 提供了許多由 LLM 提供支援的進階功能，既可以從非結構化資料建立結構化資料，也可以通過增強的文本到 SQL 功能分析這些結構化資料。

推薦：用 NSDT場景設計器快速搭建3D場景。

本指南有助于逐漸了解這些功能中的每一項。具體來說，我們涵蓋以下主題：

推斷結構化資料點：将非結構化資料轉換為結構化資料。
Text-to-SQL（基礎）：如何使用自然語言查詢一組表。
注入上下文：如何将每個表的上下文注入到 text-to-SQL 提示中。上下文可以手動添加，也可以從非結構化文檔中派生。
在索引中存儲表上下文：預設情況下，我們直接将上下文插入到提示中。如果上下文很大，有時這是不可行的。在這裡，我們展示了如何實際使用 LlamaIndex 資料結構來包含表上下文！

我們将浏覽一個包含城市/人口/國家資訊的示例資料庫。

1、設定

首先，我們使用 SQLAlchemy 來設定一個簡單的 sqlite 資料庫：

from sqlalchemy import create_engine, MetaData, Table, Column, String, Integer, select, column

engine = create_engine("sqlite:///:memory:")
metadata_obj = MetaData(bind=engine)

然後我們建立一個 city_stats 表：

# create city SQL table
table_name = "city_stats"
city_stats_table = Table(
    table_name,
    metadata_obj,
    Column("city_name", String(16), primary_key=True),
    Column("population", Integer),
    Column("country", String(16), nullable=False),
)
metadata_obj.create_all()

現在是時候插入一些資料點了！

如果你希望通過從非結構化資料推斷結構化資料點來研究填充此表，請檢視以下部分。否則，可以選擇直接填充此表：

from sqlalchemy import insert
rows = [
    {"city_name": "Toronto", "population": 2731571, "country": "Canada"},
    {"city_name": "Tokyo", "population": 13929286, "country": "Japan"},
    {"city_name": "Berlin", "population": 600000, "country": "United States"},
]
for row in rows:
    stmt = insert(city_stats_table).values(**row)
    with engine.connect() as connection:
        cursor = connection.execute(stmt)

最後，我們可以用我們的 SQLDatabase 包裝器包裝 SQLAlchemy 引擎；這允許在 LlamaIndex 中使用資料庫：

from llama_index import SQLDatabase

sql_database = SQLDatabase(engine, include_tables=["city_stats"])

如果資料庫中已經填充了資料，我們可以使用空白文檔清單執行個體化 SQL 索引。否則請參閱以下部分。

index = GPTSQLStructStoreIndex(
    [],
    sql_database=sql_database, 
    table_name="city_stats",
)

2、推斷結構化資料點

LlamaIndex 提供将非結構化資料點轉換為結構化資料的功能。在本節中，我們将展示如何通過提取有關每個城市的維基百科文章來填充 city_stats 表。

首先，我們使用 LlamaHub 的維基百科閱讀器加載一些有關相關資料的頁面。

from llama_index import download_loader

WikipediaReader = download_loader("WikipediaReader")
wiki_docs = WikipediaReader().load_data(pages=['Toronto', 'Berlin', 'Tokyo'])

當我們建立SQL索引時，我們可以指定這些文檔作為第一個輸入；這些文檔将被轉換為結構化資料點并插入到資料庫中：

from llama_index import GPTSQLStructStoreIndex, SQLDatabase

sql_database = SQLDatabase(engine, include_tables=["city_stats"])
# NOTE: the table_name specified here is the table that you
# want to extract into from unstructured documents.
index = GPTSQLStructStoreIndex(
    wiki_docs, 
    sql_database=sql_database, 
    table_name="city_stats",
)

你可以檢視目前表以驗證是否已插入資料點！

# view current table
stmt = select(
    [column("city_name"), column("population"), column("country")]
).select_from(city_stats_table)

with engine.connect() as connection:
    results = connection.execute(stmt).fetchall()
    print(results)

3、文本到 SQL（基本）

LlamaIndex 提供“文本到 SQL”功能，既有最基本的水準，也有更進階的水準。在本節中，我們将展示如何在基本級别上使用這些文本到 SQL 的功能。

此處顯示了一個簡單示例：

# set Logging to DEBUG for more detailed outputs
response = index.query("Which city has the highest population?", mode="default")
print(response)

你可以通過 response.extra_info['sql_query'] 通路底層派生的 SQL 查詢。它應該看起來像這樣：

SELECT city_name, population
FROM city_stats
ORDER BY population DESC
LIMIT 1

4、注入上下文

預設情況下，text-to-SQL 提示隻是将表架構資訊注入到提示中。但是，通常你可能還想添加自己的上下文。本節向你展示如何添加上下文，可以手動添加，也可以通過文檔提取。

我們為你提供上下文建構器類以更好地管理 SQL 表中的上下文：SQLContextContainerBuilder。這個類接受 SQLDatabase 對象和一些其他可選參數，并建構一個 SQLContextContainer 對象，然後你可以在構造 + 查詢時将其傳遞給索引。

可以手動将上下文添加到上下文建構器。下面的代碼片段展示了如何實作：

# manually set text
city_stats_text = (
    "This table gives information regarding the population and country of a given city.\n"
    "The user will query with codewords, where 'foo' corresponds to population and 'bar'"
    "corresponds to city."
)
table_context_dict={"city_stats": city_stats_text}
context_builder = SQLContextContainerBuilder(sql_database, context_dict=table_context_dict)
context_container = context_builder.build_context_container()

# building the index
index = GPTSQLStructStoreIndex(
    wiki_docs, 
    sql_database=sql_database, 
    table_name="city_stats",
    sql_context_container=context_container
)

你還可以選擇從一組非結構化文檔中提取上下文。為此，可以調用 SQLContextContainerBuilder.from_documents。我們使用 TableContextPrompt 和 RefineTableContextPrompt（請參閱參考文檔）。

# this is a dummy document that we will extract context from
# in GPTSQLContextContainerBuilder
city_stats_text = (
    "This table gives information regarding the population and country of a given city.\n"
)
context_documents_dict = {"city_stats": [Document(city_stats_text)]}
context_builder = SQLContextContainerBuilder.from_documents(
    context_documents_dict, 
    sql_database
)
context_container = context_builder.build_context_container()

# building the index
index = GPTSQLStructStoreIndex(
    wiki_docs, 
    sql_database=sql_database, 
    table_name="city_stats",
    sql_context_container=context_container,
)

5、在索引中存儲表上下文

一個資料庫集合可以有很多表，如果每個表有很多列+與之相關的描述，那麼整個上下文可能會非常大。

幸運的是，可以選擇使用 LlamaIndex 資料結構來存儲此表上下文！然後，當查詢 SQL 索引時，我們可以使用這個“邊”索引來檢索可以輸入到文本到 SQL 提示中的正确上下文。

這裡我們使用 SQLContextContainerBuilder 中的 derive_index_from_context 函數來建立一個新索引。你可以靈活地選擇要指定的索引類+要傳入的參數。然後我們使用一個名為 query_index_for_context 的輔助方法，它是 index.query 調用的簡單包裝器，它包裝了一個查詢模闆+将上下文存儲在生成的上下文容器中 .

然後你可以建構上下文容器，并在查詢期間将其傳遞給索引！

from gpt_index import GPTSQLStructStoreIndex, SQLDatabase, GPTSimpleVectorIndex
from gpt_index.indices.struct_store import SQLContextContainerBuilder

sql_database = SQLDatabase(engine)
# build a vector index from the table schema information
context_builder = SQLContextContainerBuilder(sql_database)
table_schema_index = context_builder.derive_index_from_context(
    GPTSimpleVectorIndex,
    store_index=True
)

query_str = "Which city has the highest population?"

# query the table schema index using the helper method
# to retrieve table context
SQLContextContainerBuilder.query_index_for_context(
    table_schema_index,
    query_str,
    store_context_str=True
)
context_container = context_builder.build_context_container()

# query the SQL index with the table context
response = index.query(query_str, sql_context_container=context_container)
print(response)

原文連結：http://www.bimant.com/blog/chatgpt-structural-data-augment/

ChatGPT+SQL - 結構化資料增強

1、設定

2、推斷結構化資料點

3、文本到 SQL（基本）

4、注入上下文

5、在索引中存儲表上下文

繼續閱讀

基于資料增強、WGAN-DIV-DC 和 YOLOv5 模型的 MEMS 缺陷檢測

python實作圖像的白平衡，破壞圖像的白平衡（冷、暖）和調節圖像的亮度

企業非結構化資料存儲和安全管控解決方案（PPT）

LenovoThinkSystemSR550Server(XeonSPGen1/Gen2)介紹聯想ThinkSystem

銀行卡如何大批量合并轉到一個excel表中？并形成結構化資料

OpenCTI是一個威脅情報管理平台，能夠結構化、存儲、組織和可視化關于網絡威脅的技術和非技術資訊。資料的結構化使用基于

新一代财務共享中心，即新代财務共享中心。目前，公司80%的合同都是通過電子簽的方式完成簽署的。為什麼電子合同越來越受企業

企業級資料庫有哪些特點？今天來跟大家分享一下。-派可資料。·高性能：企業級資料庫經過精心設計和優化，能夠處理大規模和複雜

金融行業非結構化資料存儲管理，杉岩海量對象存儲緣何脫穎而出？

金融服務的資料挖掘新指南：FIN-DM流程設計解決行業特定需求資料挖掘實踐，在過去幾十年中，已被廣泛應用于尋求保持，增強

DeepMind 釋出 VQVAE-2，圖檔生成效果超越 BigGAN

深度學習之python給圖檔批量命名

利用Albumentations工具包進行圖像的資料增強（以yolo資料标注格式為例）

論文閱讀筆記《Instance Credibility Inference for Few-Shot Learning》

#ELN##百奧利盟系統#研發實驗室管理系統Bio-Research©-功能與應用領域：分子診斷&精準醫療-不僅

Pytorch 04: Pytorch中資料加載---Dataset類和DataLoader類