释放语义分块的力量：LlamaIndex之旅（Simple Rockets 2）

每日推荐一篇专注于解决实际问题的外文，精准翻译并深入解读其要点，助力读者培养实际问题解决和代码动手的能力。

原文标题：Unleashing the Power of Semantic Chunking: A Journey with LlamaIndex

原文地址：https://medium.com/ai-advances/unleashing-the-power-of-semantic-chunking-a-journey-with-llamaindex-767e3499ca73

释放语义分块的力量：LlamaIndex之旅

介绍

在不断扩展的语言模型领域中，最大化应用潜力通常需要将大块文本分解为更易消化的部分。这个被称为语义分块的过程，在增强ChatGPT等模型性能和促进应用的长期记忆方面发挥了关键作用。

定义语义分块

语义分块也称为分割，是指将大量文本数据分解成更小、更易于处理的片段。在多模态环境中，这个概念不仅限于文本，还包括图像。在本教程中，我们将深入研究文本分割的 5 个层次，探索各种策略，包括与 LlamaIndex 的有趣整合。

文本分割的层次第 1 级：字符分割——简单入门

在基础层次上，我们遇到了字符分割。这涉及将文本分割成静态字符块，这是一种简单但有限的方法。这里强调的是简单性，块的大小是固定的，与内容或结构无关。

优点：简单易行

缺点：僵化，不考虑文本结构

需要掌握的关键概念：

Chunk Size: 每个分块中的字符数。
Chunk Overlap: 序列块之间的重叠量，防止无意中分离上下文信息。

虽然字符分割可能不是应用的理想选择，但它作为理解语义分块基础知识的一个基石。

第 2 级：递归字符文本分割——穿越分隔符的迷宫

除了简单的字符分割，我们还可以深入到递归字符文本分割的领域。在这里，该过程变得更加复杂，依赖于基于一组定义的分隔符列的递归方法。这些分隔符就像向导一样，帮助我们创建适应文本细微差别的动态块。

优点：提高适应性，动态分块

缺点：复杂性增加

深入探索

在这个层次上，理解过程的递归性质至关重要。将文本想象成一个迷宫，分隔符是引导递归探索的标记。这是一场复杂的舞蹈，允许更微妙的分割，提高模型捕捉上下文的能力。

第 3 级：文档特定分割--为多样性量身定制

文本不是万能的。针对特定文档的分块功能认识到了这一点，并针对不同的文档类型（无论是 PDF、Python 脚本还是 Markdown 文件）提供了各种分块方法。这一级别可确保您的分块策略与不同文档的独特结构保持一致。

优点：为文档类型定制，提高相关性

缺点：需要文档类型知识

制定策略：

试想一下，如果有一个工具包，可以用不同的方法来处理不同的文档格式。这就好比拥有一套多功能钥匙，可以在不同领域释放语义分块的潜力。

第 4 层：语义分割--走向嵌入路径

进入语义分割的领域后，重点转向基于嵌入走势的分块。这涉及对分块内上下文的更深刻理解。这不仅仅是字符或分隔符；而是在每个段落中嵌入意义的本质，创造出一个相互连接的理解网络。

优点：上下文深度，增强语义理解

缺点：计算强度增加

嵌入的艺术：

把这个层次想象成一种艺术形式。每个块成为一块画布，嵌入走势就是描绘文本语义景观更丰富、更复杂图画的笔触。

第 5 层：代理分割——利用类代理系统解放文本

在文本分割创新的巅峰，我们遇到了代理分割。这种实验性方法通过类似代理系统的方式构想文本分割。这是一种革命性的方法，如果你预见到零标记成本的趋势，它会变得特别有价值。

优点：适应性强，面向未来的方法

缺点：需要实验和测试

文本解放的未来：

设想一个代理在文本中导航，动态适应文本的细微差异。这一层暗示着未来的文本不仅可以分割，还可以解放，从而实现无与伦比的适应性和效率。

利用 LlamaIndex 实现卓越的分块：

创新索引：LlamaIndex凭借先进的能力，改变了语义分块的效率，引入了原始文本的替代表示和索引。
衍生形式：集成不仅仅局限于增强，还提供了丰富过程的衍生形式，提升了检索和索引功能。
成功战略评估：严格评估分块策略，使用 LangChain Evals、Llama Index Evals和 RAGAS Evals 等框架。测试对于实现最佳性能至关重要。
分块原则：在错综复杂的技术中，坚持 "分块原则"--将数据转换为可检索的格式，为您的应用增添持久价值。

从本质上讲，LlamaIndex 整合是释放语义分块全部潜能的战略举措，将创新与实用性完美融合。

利用 LlamaIndex 实现语义分块：

现在，让我们深入了解将 LlamaIndex 集成到您的代码中进行语义分块的实际操作。以下代码片段提供了简明的实现指南：

步骤1：安装库

!pip install llama_index html2text trulens_eval sentence-transformers !CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir步骤2：导入库

import os import logging import sys import torch import numpy as np #Setup OPEN API Key os.environ["OPENAI_API_KEY"] = "" logging.basicConfig(stream=sys.stdout, level=logging.INFO) logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout)) from llama_index import vectorStoreIndex, SimpleDirectoryReader, ServiceContext from llama_index.llms import HuggingFaceLLM from llama_index.llms import LlamaCPP from llama_index.llms.llama_utils import messages_to_prompt, completion_to_prompt from llama_index.llama_pack import download_llama_pack from llama_index.response.notebook_utils import display_source_node from semantic_chunking_pack.base import SemanticChunker from llama_index.embeddings import HuggingFaceEmbedding from llama_index.node_parser import SentenceSplitter from llama_index.embeddings import OpenAIEmbedding from llama_index.indices.postprocessor import SentenceTransformerRerank步骤3：下载数据

!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'pg_essay.txt'步骤4：定制LLM

llm = LlamaCPP( # You can pass in the URL to a GGML model to download it automatically model_url='https://huggingface.co/TheBloke/zephyr-7B-alpha-GGUF/resolve/main/zephyr-7b-alpha.Q5_K_M.gguf', # optionally, you can set the path to a pre-downloaded model instead of model_url model_path=None, temperature=0.1, max_new_tokens=256, # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room context_window=3900, # kwargs to pass to __call__() generate_kwargs={}, # kwargs to pass to __init__() # set to at least 1 to use GPU model_kwargs={"n_gpu_layers": -1}, # transform inputs into Llama2 format messages_to_prompt=messages_to_prompt, completion_to_prompt=completion_to_prompt, verbose=True, ) #Download Semantic Chunking Package download_llama_pack( "SemanticChunkingQueryEnginePack", "./semantic_chunking_pack", skip_load=True, # leave the below line commented out if using the notebook on main # llama_hub_url="https://raw.githubusercontent.com/run-llama/llama-hub/jerry/add_semantic_chunker/llama_hub" )步骤5：初始化不同的依赖项

# load documents documents = SimpleDirectoryReader(input_files=["/content/data/pg_essay.txt"]).load_data() # intilaize our custom embeddings embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5") splitter = SemanticChunker( buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model ) # also baseline splitter base_splitter = SentenceSplitter(chunk_size=512) # Initialize the reranker rerank = SentenceTransformerRerank( model="cross-encoder/ms-marco-MiniLM-L-12-v2", top_n=3 ) service_context = ServiceContext.from_defaults( chunk_size=512, llm=llm, embed_model=embed_model ) nodes = splitter.get_nodes_from_documents(documents) print(nodes[1].get_content())

输出如下：

The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in the card reader and press a button to load the program into memory and run it. The result would ordinarily be to print something on the spectacularly loud printer. I was puzzled by the 1401. I couldn't figure out what to do with it. And in retrospect there's not much I could have done with it. The only form of input to programs was data stored on punched cards, and I didn't have any data stored on punched cards. The only other option was to do things that didn't rely on any input, like calculate approximations of pi, but I didn't know enough math to do anything interesting of that type. So I'm not surprised I can't remember any programs I wrote, because they can't have done much. My clearest memory is of the moment I learned it was possible for programs not to terminate, when one of mine didn't. On a machine without time-sharing, this was a social as well as a technical error, as the data center manager's expression made clear. With microcomputers, everything changed. Now you could have a computer sitting right in front of you, on a desk, that could respond to your keystrokes as it was running instead of just churning through a stack of punch cards and then stopping. [1] The first of my friends to get a microcomputer built it himself. It was sold as a kit by Heathkit. I remember vividly how impressed and envious I felt watching him sitting in front of it, typing programs right into the computer. Computers were expensive in those days and it took me years of nagging before I convinced my father to buy one, a TRS-80, in about 1980. The gold standard then was the Apple II, but a TRS-80 was good enough. This was when I really started programming.步骤6：与基线进行比较

base_nodes = base_splitter.get_nodes_from_documents(documents) print(base_nodes[2].get_content())

输出如下：

This was when I really started programming. I wrote simple games, a program to predict how high my model rockets would fly, and a word processor that my father used to write at least one book. There was only room in memory for about 2 pages of text, so he'd write 2 pages at a time and then print them out, but it was a lot better than a typewriter. Though I liked programming, I didn't plan to study it in college. In college I was going to study philosophy, which sounded much more powerful. It seemed, to my naive high school self, to be the study of the ultimate truths, compared to which the things studied in other fields would be mere domain knowledge. What I discovered when I got to college was that the other fields took up so much of the space of ideas that there wasn't much left for these supposed ultimate truths. All that seemed left for philosophy were edge cases that people in other fields felt could safely be ignored. I couldn't have put this into words when I was 18. All I knew at the time was that I kept taking philosophy courses and they kept being boring. So I decided to switch to AI. AI was in the air in the mid 1980s, but there were two things especially that made me want to work on it: a novel by Heinlein called The Moon is a Harsh Mistress, which featured an intelligent computer called Mike, and a PBS documentary that showed Terry Winograd using SHRDLU. I haven't tried rereading The Moon is a Harsh Mistress, so I don't know how well it has aged, but when I read it I was drawn entirely into its world. It seemed only a matter of time before we'd have Mike, and when I saw Winograd using SHRDLU, it seemed like that time would be a few years at most. All you had to do was teach SHRDLU more words. There weren't any classes in AI at Cornell then, not even graduate classes, so I started trying to teach myself. Which meant learning Lisp, since in those days Lisp was regarded as the language of AI. The commonly used programming languages then were pretty primitive, and programmers' ideas correspondingly so. The default language at Cornell was a Pascal-like language called PL/I, and the situation was similar elsewhere.步骤6（2）：将内容向量化

vector_index = VectorStoreIndex(nodes,service_context=service_context) query_engine = vector_index.as_query_engine(node_postprocessors=[rerank]) base_vector_index = VectorStoreIndex(base_nodes, service_context=service_context) base_query_engine = base_vector_index.as_query_engine(node_postprocessors=[rerank]) response = query_engine.query( "Tell me about the author's programming journey through childhood to college" ) print(str(response))

输出如下：

The author's programming journey began in childhood when the only form of input to programs was data stored on punched cards. However, the author did not have any data stored on punched cards, so they were limited in what they could do with programming. They did not know enough math to calculate interesting approximations of pi either. The author's clearest memory of programming during this time was when they learned that programs could fail to terminate. In college, the author initially planned to study philosophy but found it boring and switched to studying AI. They taught themselves Lisp, which was regarded as the language of AI at the time. Learning Lisp expanded their concept of a program and they became passionate about programming.

base_response = base_query_engine.query( "Tell me about the author's programming journey through childhood to college" ) print(str(base_response))

输出如下：

The author's programming journey began in childhood when they started writing simple games and programs to predict the flight of model rockets. They also developed a word processor that their father used to write a book. Despite their interest in programming, the author initially planned to study philosophy in college. However, they found philosophy courses to be boring and decided to switch to studying AI. At that time, there were no AI classes at Cornell, so the author taught themselves by learning Lisp, which was considered the language of AI. The author's programming journey continued to evolve as they encountered new technologies, such as microcomputers, which allowed for more interactive and accessible programming experiences.步骤7：Truelens评估

# Initiate Trulens from trulens_eval import Feedback, Tru, TruLlama from trulens_eval.feedback import Groundedness from trulens_eval.feedback.provider.openai import OpenAI tru = Tru() # Initialize provider class openai = OpenAI() grounded = Groundedness(groundedness_provider=OpenAI()) # Define a groundedness feedback function f_groundedness = Feedback(grounded.groundedness_measure_with_cot_reasons).on( TruLlama.select_source_nodes().node.text.collect() ).on_output( ).aggregate(grounded.grounded_statements_aggregator) # Question/answer relevance between overall question and answer. f_qa_relevance = Feedback(openai.relevance).on_input_output() # Question/statement relevance between question and each context chunk. f_qs_relevance = Feedback(openai.qs_relevance).on_input().on( TruLlama.select_source_nodes().node.text ).aggregate(np.mean)

输出如下：

✅ In groundedness_measure_with_cot_reasons, input source will be set to __record__.app.query.rets.source_nodes[:].node.text.collect() . ✅ In groundedness_measure_with_cot_reasons, input statement will be set to __record__.main_output or `Select.RecordOutput` . ✅ In relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` . ✅ In relevance, input response will be set to __record__.main_output or `Select.RecordOutput` . ✅ In qs_relevance, input question will be set to __record__.main_input or `Select.RecordInput` . ✅ In qs_relevance, input statement will be set to __record__.app.query.rets.source_nodes[:].node.text .

tru_query_engine_recorder = TruLlama(query_engine, app_id='LlamaIndex_App1', feedbacks=[f_groundedness, f_qa_relevance, f_qs_relevance]) # or as context manager with tru_query_engine_recorder as recording: query_engine.query("Tell me about the author's programming journey through childhood to college") tru.run_dashboard()

结论

在完善语言模型应用的过程中，LlamaIndex 的集成被证明是一项关键举措，尤其是在语义分块领域。在我们结束这次文本分块之旅的过程中，我们发现 LlamaIndex 不仅仅是一种工具，而且还是创新和效率的催化剂。

LlamaIndex 将创新的索引功能与语义分块技术完美融合，改变了数据处理的格局。LlamaIndex 的优势是多方面的——从丰富分块语义深度的替代表示法，到确保最佳检索和索引的战略性方法。

在代码实现部分，我们见证了这种集成的实用性。实现的简单性和灵活性为定制奠定了基础，使开发人员能够根据自己的独特需求定制解决方案。

当我们穿越各个层次，评估策略，并恪守分块原则时，LlamaIndex 就像一盏明灯，指引着我们不断提高。它不仅是今天的工具，还是一种前瞻性工具，预示着未来语义分块将成为效率和创新的代名词。

总之，LlamaIndex 与语义分块之间的协同作用证明了语言模型应用的不断发展。这段旅程并不止步于此；它将延伸到未来，在那里每个文本块都有可能产生变革性的影响。拥抱集成，探索可能性，让 LlamaIndex 成为您释放语言模型全部潜能的指导力量。

资源

[1] https://github.com/run-llama/llama-hub/blob/main/llama_hub/llama_packs/node_parser/semantic_chunking/semantic_chunking.ipynb

[2] https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb

[3] Retrieval-Augmented Generation for Large Language Models: A Survey

[4] https://www.youtube.com/watch?v=8OJC21T2SL4

查看全文