LangChain 核心模块:Data Conneciton - Document Loaders

使用文档加载器从源中加载数据作为文档。一个文档是一段文字和相关的元数据。

如,有用于加载简单 .txt 文件的文档加载器,用于加载 ArXiv 论文,或者任何网页的文本内容

Document 类

这段代码定义了一个名为Document的类,允许用户与文档的内容进行交互,可以查看文档的段落、摘要,以及使用查找功能来查询文档中的特定字符串。

# 基于BaseModel定义的文档类。
class Document(BaseModel):
"""接口,用于与文档进行交互。""" # 文档的主要内容。
page_content: str
# 用于查找的字符串。
lookup_str: str = ""
# 查找的索引,初次默认为0。
lookup_index = 0
# 用于存储任何与文档相关的元数据。
metadata: dict = Field(default_factory=dict) @property
def paragraphs(self) -> List[str]:
"""页面的段落列表。"""
# 使用"\n\n"将内容分割为多个段落。
return self.page_content.split("\n\n") @property
def summary(self) -> str:
"""页面的摘要(即第一段)。"""
# 返回第一个段落作为摘要。
return self.paragraphs[0] # 这个方法模仿命令行中的查找功能。
def lookup(self, string: str) -> str:
"""在页面中查找一个词,模仿cmd-F功能。"""
# 如果输入的字符串与当前的查找字符串不同,则重置查找字符串和索引。
if string.lower() != self.lookup_str:
self.lookup_str = string.lower()
self.lookup_index = 0
else:
# 如果输入的字符串与当前的查找字符串相同,则查找索引加1。
self.lookup_index += 1
# 找出所有包含查找字符串的段落。
lookups = [p for p in self.paragraphs if self.lookup_str in p.lower()]
# 根据查找结果返回相应的信息。
if len(lookups) == 0:
return "No Results"
elif self.lookup_index >= len(lookups):
return "No More Results"
else:
result_prefix = f"(Result {self.lookup_index + 1}/{len(lookups)})"
return f"{result_prefix} {lookups[self.lookup_index]}"

BaseLoader 类定义

BaseLoader 类定义了如何从不同的数据源加载文档,并提供了一个可选的方法来分割加载的文档。使用这个类作为基础,开发者可以为特定的数据源创建自定义的加载器,并确保所有这些加载器都提供了加载数据的方法。load_and_split方法还提供了一个额外的功能,可以根据需要将加载的文档分割为更小的块。

# 基础加载器类。
class BaseLoader(ABC):
"""基础加载器类定义。""" # 抽象方法,所有子类必须实现此方法。
@abstractmethod
def load(self) -> List[Document]:
"""加载数据并将其转换为文档对象。""" # 该方法可以加载文档,并将其分割为更小的块。
def load_and_split(
self, text_splitter: Optional[TextSplitter] = None
) -> List[Document]:
"""加载文档并分割成块。"""
# 如果没有提供特定的文本分割器,使用默认的字符文本分割器。
if text_splitter is None:
_text_splitter: TextSplitter = RecursiveCharacterTextSplitter()
else:
_text_splitter = text_splitter
# 先加载文档。
docs = self.load()
# 然后使用_text_splitter来分割每一个文档。
return _text_splitter.split_documents(docs)

使用 TextLoader 加载 Txt 文件

from langchain.document_loaders import TextLoader

docs = TextLoader('../tests/state_of_the_union.txt',encoding='utf-8').load()

使用 ArxivLoader 加载 ArXiv 论文

ArxivLoader 类定义

ArxivLoader 类专门用于从Arxiv平台获取文档。用户提供一个搜索查询,然后加载器与Arxiv API交互,以检索与该查询相关的文档列表。这些文档然后以标准的Document格式返回。

# 针对Arxiv平台的加载器类。
class ArxivLoader(BaseLoader):
"""从`Arxiv`加载基于搜索查询的文档。 此加载器负责将Arxiv的原始PDF文档转换为纯文本格式,以便于处理。
""" # 初始化方法。
def __init__(
self,
query: str,
load_max_docs: Optional[int] = 100,
load_all_available_meta: Optional[bool] = False,
):
self.query = query
"""传递给Arxiv API进行搜索的特定查询或关键字。"""
self.load_max_docs = load_max_docs
"""从搜索中检索文档的上限。"""
self.load_all_available_meta = load_all_available_meta
"""决定是否加载与文档关联的所有元数据的标志。""" # 基于查询获取文档的加载方法。
def load(self) -> List[Document]:
arxiv_client = ArxivAPIWrapper(
load_max_docs=self.load_max_docs,
load_all_available_meta=self.load_all_available_meta,
)
docs = arxiv_client.search(self.query)
return docs

ArxivLoader有以下参数:

  • query:用于在ArXiv中查找文档的文本
  • load_max_docs:默认值为100。使用它来限制下载的文档数量。下载所有100个文档需要时间,因此在实验中请使用较小的数字。
  • load_all_available_meta:默认值为False。默认情况下只下载最重要的字段:发布日期(文档发布/最后更新日期)、标题、作者、摘要。如果设置为True,则还会下载其他字段。

GPT-3 论文(Language Models are Few-Shot Learners) 为例,展示如何使用 ArxivLoader

GPT-3 论文的 Arxiv 链接:https://arxiv.org/abs/2005.14165

from langchain.document_loaders import ArxivLoader
query = "2005.14165"
docs = ArxivLoader(query=query, load_max_docs=5).load()
len(docs)
docs[0].metadata # meta-information of the Document
{'Published': '2020-07-22',
'Title': 'Language Models are Few-Shot Learners',
'Authors': 'Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei',
'Summary': "Recent work has demonstrated substantial gains on many NLP tasks and\nbenchmarks by pre-training on a large corpus of text followed by fine-tuning on\na specific task. While typically task-agnostic in architecture, this method\nstill requires task-specific fine-tuning datasets of thousands or tens of\nthousands of examples. By contrast, humans can generally perform a new language\ntask from only a few examples or from simple instructions - something which\ncurrent NLP systems still largely struggle to do. Here we show that scaling up\nlanguage models greatly improves task-agnostic, few-shot performance, sometimes\neven reaching competitiveness with prior state-of-the-art fine-tuning\napproaches. Specifically, we train GPT-3, an autoregressive language model with\n175 billion parameters, 10x more than any previous non-sparse language model,\nand test its performance in the few-shot setting. For all tasks, GPT-3 is\napplied without any gradient updates or fine-tuning, with tasks and few-shot\ndemonstrations specified purely via text interaction with the model. GPT-3\nachieves strong performance on many NLP datasets, including translation,\nquestion-answering, and cloze tasks, as well as several tasks that require\non-the-fly reasoning or domain adaptation, such as unscrambling words, using a\nnovel word in a sentence, or performing 3-digit arithmetic. At the same time,\nwe also identify some datasets where GPT-3's few-shot learning still struggles,\nas well as some datasets where GPT-3 faces methodological issues related to\ntraining on large web corpora. Finally, we find that GPT-3 can generate samples\nof news articles which human evaluators have difficulty distinguishing from\narticles written by humans. We discuss broader societal impacts of this finding\nand of GPT-3 in general."}

使用 UnstructuredURLLoader 加载网页内容

使用非结构化分区函数(Unstructured)来检测MIME类型并将文件路由到适当的分区器(partitioner)。

支持两种模式运行加载程序:"single"和"elements"。如果使用"single"模式,文档将作为单个langchain Document对象返回。如果使用"elements"模式,非结构化库将把文档拆分成标题和叙述文本等元素。您可以在mode后面传入其他非结构化kwargs以应用不同的非结构化设置。

UnstructuredURLLoader 主要参数:

  • urls:待加载网页 URL 列表
  • continue_on_failure:默认True,某个URL加载失败后,是否继续
  • mode:默认single

以 ReAct 网页为例(https://react-lm.github.io/) 展示使用

from langchain.document_loaders import UnstructuredURLLoader
urls = [
"https://react-lm.github.io/",
]
loader = UnstructuredURLLoader(urls=urls)
data = loader.load()
data[0].metadata
print(data[0].page_content)

输出:

{'source': 'https://react-lm.github.io/'}
ReAct: Synergizing Reasoning and Acting in Language Models Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao [Paper] [Code] [Blogpost] [BibTex] Language models are getting better at reasoning (e.g. chain-of-thought prompting) and acting (e.g. WebGPT, SayCan, ACT-1), but these two directions have remained separate.
ReAct asks, what if these two fundamental capabilities are combined? Abstract While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action plan generation) have primarily been studied as separate topics. In this paper, we explore the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information. We apply our approach, named ReAct, to a diverse set of language and decision making tasks and demonstrate its effectiveness over state-of-the-art baselines, as well as improved human interpretability and trustworthiness over methods without reasoning or acting components. Concretely, on question answering (HotpotQA) and fact verification (Fever), ReAct overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API, and generates human-like task-solving trajectories that are more interpretable than baselines without reasoning traces. On two interactive decision making benchmarks (ALFWorld and WebShop), ReAct outperforms imitation and reinforcement learning methods by an absolute success rate of 34% and 10% respectively, while being prompted with only one or two in-context examples. ReAct Prompting A ReAct prompt consists of few-shot task-solving trajectories, with human-written text reasoning traces and actions, as well as environment observations in response to actions (see examples in paper appendix!)
ReAct prompting is intuitive and flexible to design, and achieves state-of-the-art few-shot performances across a variety of tasks, from question answering to online shopping! HotpotQA Example The reason-only baseline (i.e. chain-of-thought) suffers from misinformation (in red) as it is not grounded to external environments to obtain and update knowledge, and has to rely on limited internal knowledge.
The act-only baseline suffers from the lack of reasoning, unable to synthesize the final answer despite having the same actions and observation as ReAct in this case.
In contrast, ReAct solves the task with a interpretable and factual trajectory. ALFWorld Example For decision making tasks, we design human trajectories with sparse reasoning traces, letting the LM decide when to think vs. act.
ReAct isn't perfect --- below is a failure example on ALFWorld. However, ReAct format allows easy human inspection and behavior correction by changing a couple of model thoughts, an exciting novel approach to human alignment! ReAct Finetuning: Initial Results Prompting has limited context window and learning support.
Initial finetuning results on HotpotQA using ReAct prompting trajectories suggest:
(1) ReAct is the best fintuning format across model sizes;
(2) ReAct finetuned smaller models outperform prompted larger models!
loader = UnstructuredURLLoader(urls=urls, mode="elements")
new_data = loader.load()
new_data[0].page_content

输出:

'ReAct: Synergizing Reasoning and Acting in Language Models'

LangChain基础篇 (04)的更多相关文章

  1. iOS系列 基础篇 04 探究视图生命周期

    iOS系列 基础篇 04 探究视图生命周期 视图是应用的一个重要的组成部份,功能的实现与其息息相关,而视图控制器控制着视图,其重要性在整个应用中不言而喻. 以视图的四种状态为基础,我们来系统了解一下视 ...

  2. Java多线程系列--“基础篇”04之 synchronized关键字

    概要 本章,会对synchronized关键字进行介绍.涉及到的内容包括:1. synchronized原理2. synchronized基本规则3. synchronized方法 和 synchro ...

  3. python 基础篇 04(列表 元组 常规操作)

    本节主要内容:1. 列表2. 列表的增删改查3. 列表的嵌套4. 元组和元组嵌套5. range 一. 列表1.1 列表的介绍列表是python的基础数据类型之一 ,其他编程语言也有类似的数据类型. ...

  4. MySQL基础篇(04):存储过程和视图,用法和特性详解

    本文源码:GitHub·点这里 || GitEE·点这里 一.存储过程 1.概念简介 存储程序是被存储在服务器中的组合SQL语句,经编译创建并保存在数据库中,用户可通过存储过程的名字调用执行.存储过程 ...

  5. Java基础篇(04):日期与时间API用法详解

    本文源码:GitHub·点这里 || GitEE·点这里 一.时间和日期 在系统开发中,日期与时间作为重要的业务因素,起到十分关键的作用,例如同一个时间节点下的数据生成,基于时间范围的各种数据统计和分 ...

  6. Java多线程系列 基础篇04 线程中断

    1. 中断线程 中断可以理解为线程的一个标志位属性,它表示一个运行中的线程是否被其他线程进行了中断操作,其他线程通过调用该线程的interrupt()方法对其进行中断操作,线程通过检查自身是否被中断来 ...

  7. Scala基础篇-04 try表达式

    1.try表达式 定义 try{} catch{} finally{} //例子 try{ Integer.parseInt("dog") }catch { }finally { ...

  8. mysql学习之基础篇04

    五种基本子句查询 查询是mysql中最重要的一环,我们今天就来说一下select的五种子句中的where条件查询: 首先我们先建立一张商品表:goods 由于商品数目太多,我就不一一列举了. 在这里我 ...

  9. 【Spark机器学习速成宝典】基础篇04数据类型(Python版)

    目录 Vector LabeledPoint Matrix 使用C4.5算法生成决策树 使用CART算法生成决策树 预剪枝和后剪枝 应用:遇到连续与缺失值怎么办? 多变量决策树 Python代码(sk ...

  10. [ASP.NET Core开发实战]基础篇04 主机

    主机定义 主机是封闭应用资源的对象. 设置主机 主机通常由 Program 类中的代码配置.生成和运行. HTTP项目(ASP.NET Core项目)创建泛型主机: public class Prog ...

随机推荐

  1. 数据库研发人员必看的MySQL 8.0新特性

    本文汇总了MySQL8.0 面向开发的新特性,总共有12个新特性,有想快速了解8.0新特性的朋友,可以看一下哈文章目录:1.公用表达式支持-CTE2.窗口函数3.表达式作为默认值:4.CHECK支持5 ...

  2. elasticsearch之python操作(非原生)

    elasticsearch 模块 Elasticsearch低级客户端.提供从Python到ES REST端点的直接映射. 连接集群节点 指定连接 es = Elasticsearch( ['172. ...

  3. nvm node版本管理

    1.说明 NVM是NODE JS的版本管理工具,可以安装nodejs切换nodejs版本. 2.安装NVM https://github.com/coreybutler/nvm-windows/rel ...

  4. 中电金信:GienTech动态| 获奖、合作、与伙伴共谋数字化转型…

    ​ ​ -- -- GienTech动态 -- -- 中电金信携"源启"亮相第十二届中国电子信息博览会 ​ 4月11日,为期三天的"第十二届中国电子信息博览会" ...

  5. 【Python】【MySQL】Python将JSON数据以文本形式存放到MySQL的Text类型字段中

    1.起因 在做一个自动打卡的玩意.登录会得到那个平台一系列的信息.我又不想专门修改.增加数据库字段来存放,所有打算直接将返回的JSON数据保存到一个MySQL字段中. 内容肯定不能直接放,考虑下比如数 ...

  6. shell 获取进程号

    # Shell最后运行的后台PID(后台运行的最后一个进程的进程ID号) $! # Shell本身的PID(即脚本运行的当前进程ID号 $$

  7. JVM实战—2.JVM内存设置与对象分配流转

    大纲 1.JVM内存划分的原理细节 2.对象在JVM内存中如何分配如何流转 3.部署线上系统时如何设置JVM内存大小 4.如何设置JVM堆内存大小 5.如何设置JVM栈内存与永久代大小 6.问题汇总 ...

  8. Qt/C++音视频开发81-采集本地麦克风/本地摄像头带麦克风/桌面采集和麦克风/本地设备和桌面推流

    一.前言 随着直播的兴起,采集本地摄像头和麦克风进行直播推流,也是一个刚需,最简单的做法是直接用ffmpeg命令行采集并推流,这种方式简单粗暴,但是不能实时预览画面,而且不方便加上一些特殊要求.之前就 ...

  9. Qt/C++音视频开发50-不同ffmpeg版本之间的差异处理

    一.前言 ffmpeg的版本众多,从2010年开始计算的项目的话,基本上还在使用的有ffmpeg2/3/4/5/6,最近几年版本彪的比较厉害,直接4/5/6,大版本之间接口有一些变化,特别是一些废弃接 ...

  10. Qt编写物联网管理平台45-采集数据转发

    一.前言 本系统严格意义上说是一个直连硬件的客户端软件,下面接的modbus协议的设备直接通过网络或者串口和软件通信,软件负责解析数据和存储记录.有时候客户想要领导办公室或者分管这一块的部门经理办公室 ...