The Cloud Experience Everywhere
1819828 Members
3149 Online
109607 Solutions
New Article
Cloud_Experts

Optimized RAG pipeline for large-scale ingestion and retrieval with HPE Private Cloud AI

By Ashish Kumar, Data and AI Infrastructure Solutions Architect, HPE

rag-main.pngThis blog explains the importance of retrieval-augmented generation AI applications and provides detailed guidance on designing ingestion and retrieval pipelines for large-scale RAG use-cases. It highlights the features of the HPE Private Cloud AI solution that can be used to build optimized RAG pipelines.

Generative AI (GenAI) is a key focus area for every business strategy and initiative. It’s a subfield of artificial intelligence (AI) that uses advanced machine learning techniques to generate unique content or data, such as text, images, music, or even 3D models from an input request.

GenAI solutions use large language models (LLMs) that have been trained on massive amounts of data and can predict the next word or pixel using the user input request and the already generated output. LLMs have disrupted traditional machine learning by enabling efficient natural-language interaction. They can understand and generate human-like text across various languages and topics.

In the diagram below, you’ll find industry verticals and techniques for using LLMs.  

Picture1.png

 

LLMs are at the core of intelligent chatbots and natural language processing applications. They enable chatbots to respond in a human-like way. Publicly available LLMs are pretrained on a large body of publicly available data and, therefore, have no inherent knowledge of enterprise private data. One challenge with LLMs is their tendency to hallucinate or produce unpredictable responses by making up facts. LLMs can also become out of date if they are not constantly retrained on new data sets.  To mitigate these issues, LLMs are integrated with retrieval-augmented generation (RAG).

RAG is a use case pipeline that optimizes the output generated by LLM by providing additional context from authorized sources outside of the dataset that was used to train these LLMs. RAG extends the already powerful capability of LLMs to specific domains or an organization’s internal knowledge base mitigating the need to retrain.

Given that the accuracy of user questions raised to LLM bots depends on the implementation of a RAG, it is critical to design a RAG architecture that adopts a large number of domain-specific knowledge sources and efficiently and securely converts those knowledge sources into a knowledge base. This conversion requires an appropriate choice of embedding model that converts the source data with an optimized chunking methodology and represents those chunks as a knowledge vector in the vector database.

Picture2.png

 

HPE Private Cloud AI solution for RAG

To simplify and accelerate development of GenAI applications, HPE and NVIDIA co-developed HPE Private Cloud AI. HPE Private Cloud AI is the first fully integrated, ready-to-deploy solution from the NVIDIA AI Computing by HPE initiative, a partnership to empower enterprises to achieve their AI goals. NVIDIA AI Computing by HPE combines expertise, technology, and cost efficiency to streamline AI deployments, mitigate risks, and manage AI costs over time.

Picture3.png

 
Before you deploy your RAG, you will need to answer these questions:   

  • How do I design and build my RAG pipeline?
  • How do I deploy my embedding model on-premises, and which one should I choose?
  • How do I connect to multiple enterprise data sources to feed the data to the RAG ingestion pipeline?
  • What chunking strategy should I to choose for optimized RAG ingestion and retrieval operation?
  • How do I deploy my vector store, and which one should I choose for RAG operations??
  • How many instances of LLM, embedder, and Reranker should be used in my RAG pipeline?

HPE Private Cloud AI, built on top of HPE AI Essentials, helps simplify and address your questions and challenges by offering RAG on-demand capability. This allows you to create a RAG pipeline with three clicks:

  1. Choose your LLM.
  2. Select your data sources and choose the chunking strategy.
  3. Deploy.

This RAG deployment has in-built VectorDB, and every data source is access-controlled to provide a secure RAG. This RAG can be consumed by any GenAI application using industry-standard APIs such as OpenAI, Langchain Framework APIs, or LlamaIndex Framework APIs. 

Here are the components that are available in PCAI for end-to-end deployment of a RAG pipeline:

Picture4.pngNow let’s dive into details of the ingestion and retrieval portions of a particular RAG pipeline implementation and discuss the performance metrics that are useful when building the RAG.

Ingestion pipeline architecture

The first stage of a RAG use-case implementation is the ingestion pipeline, which enables integration of a knowledge source whereby unstructured enterprise data is onboarded.

Picture5.png

  1. The ingestion pipeline phase of the RAG enables integration with knowledge sources whereby unstructured enterprise data is first onboarded to the object datastore of HPE Private Cloud AI using S3 APIs.
  2. Unstructured data is then pre-processed to segregate text data and image data using a series of parallel tasks. This is achieved with PDF Parser, Text Parser, and Document Parser, depending on the nature of the documents.
  3. Images are further processed using a series of image models for different types of charts and tables. Text is directly fed for the next stage of the pipeline.
  4. Once preprocessing is completed, data is chunked. Each chunk is transformed into vectors using the embedding model.
  5. Finally, generated embeddings are stored in the HPE Private Cloud AI managed vector database.

Retrieval pipeline architecture of RAG

Once the generated embeddings are stored in the vector store, the retrieval pipeline is ready to accept a request, which will trigger retrieval.  

Picture6.png

User queries are first processed through the embedding model. The vector store retrievers then use the resulting vector to find similar vectors in the vector database. These vectors are then replaced with the corresponding document chunks. Further optimization is achieved by using a reranking model to enhance the search results by re-ordering the document chunks to prioritize more relevant answers. The retrieved responses are then processed through the inference models to generate the human-like language response.

All these retrieval LLMs are spun up as NVIDIA Inference Microservices (NIM) with inference endpoints that are managed by HPE Private Cloud AI/HPE AI Essentials. The vector store is built as another service backed up by internal high-performance storage within HPE Private Cloud AI and provides petabyte scale with highly scalable read performance for retrieval operations.

Performance metrics for ingestion and retrieval pipeline

Performance for a RAG pipeline can be measured and analysed across several different factors.  This blog is not intended to provide a comprehensive performance analysis of a complete RAG pipeline, but rather to provide some insights into the different components and how they impact overall performance.

When looking at ingestion and retrieval pipeline performance, there are several key metrics that are critical for the end-to-end performance evaluation of a RAG. 

Picture7.png

One of the critical factors defining the performance metrics of a RAG is chunk size (also called split size). Chunking is the process of splitting large datasets into smaller pieces of information for efficient use by LLMs. Chunk size is decided at the time of ingestion and defines the size of the text passage extracted from source data for generating a retrieval index.  There are many ways to split the data to improve the efficiency of RAG. We will now look at how Ingestion throughput, retrieval throughput, and RAG accuracy metrics are affected by varying chunk sizes.

Ingestion time is characterized by its throughput which will affect the time taken to load a given dataset. It is defined by the number of files and pages processed per second. Since unstructured documents contains multiple types of input (texts and images), one has to choose the appropriate choice of text and image processing models to achieve a high throughput.  In this blog we processed a dataset with a large number of PDF documents that had a mix of texts, images, and charts through the ingestion pipeline.

The graphs below provide ingestion throughput relative to the chunk size.

Picture8.png

Following are key observations from the above ingestion pipeline throughput with varying chunk sizes:

  • Sizing of the chunk/split size plays a critical role in achieving an optimal ingestion pipeline. When we look at the graphs shown above, we see that for a split size between 200-300 tokens, we generate an optimal number of vectors.
  • If we look at computation time, a split size between 200-300 tokens gives us better computation time for vectors generated.
  • Ingestion throughput (Files/Sec) is optimal for split sizes between 200-300 tokens. Lower split size values give suboptimal performance, and higher values give degraded performance.
  • Ingestion throughput (Files/Sec) is optimal for split size between 200-300 tokens. The lower value of split size gives a suboptimal performance, and a higher value gives a degraded performance.

Once the knowledge base is created, queries can be sent through the retrieval pipeline. This pipeline is characterized by the Time to First Token (TTFT), also called latency, and its throughput which is measured in tokens/s or requests/s. The pipeline should be configured with the right level of concurrency to achieve optimal retrieval performance.  

Additionally, the vector search operation relies on semantic search algorithms such as cosine similarity search or Euclidian search. These search algorithms determine the “K” most relevant or similar items retrieved from the dataset in response to a query. “top_k” is a user-defined parameter specifying the number of items to return.

The graphs below provide throughput and latency results for a typical RAG use case that corresponds to a chatbot or virtual assistant. It is characterized by a small number of tokens (words) as input and high number of tokens on output.

Picture9.png

From this data, several key observations can be made regarding the retrieval pipeline throughput for the search operations:

  • To maintain an optimal retrieval pipeline throughput, it is critical to maintain a balance of retrieval per set of NIM services (NIM LLM Inference, NIM Embedder and NIM Reranker). The chain server refers to Langchain chain that is comprised of stages of pipeline including embedding models, llm language model, vector store retriever and vector store Reranker.
  • Choice of Input Token and Output Token defines the use cases considered. In our use case, we consider the input token = 200 and output token = 1000.
  • The  graphs above have been generated for top_k rank = 4. A higher value of k will decrease the performance but improve accuracy by providing additional context to the LLM. The decision of what value to give k depends highly on the quality of the knowledge source and the chunk size.

Another aspect to consider in the performance of a RAG pipeline is the accuracy of the response retrieved. RAG accuracy is measured by comparing the retrieved RAG responses against GroundTruth data. GroundTruth data is pre-generated using the same knowledge source as RAG, but also uses an LLM to generate a list of questions and/or answers. Key metrics for defining the accuracy of response are as follows:

Picture10.png

  • Context Precision is a metric that measures the proportion of relevant chunks in the retrieved contexts. It is the ratio of the number of relevant retrieved chunks at rank ‘K’ to the total number of retrieved chunks at a rank ‘K.’ The range of this Context Precision is 0-1, and a higher value defines better precision.
  • Context Recall measures how many of the relevant documents (or pieces of information) were successfully retrieved. It focuses on not missing important results. Higher recall means fewer relevant documents were left out.
  • Faithfulness measures the factual consistency of the generated answer against the given context. It is calculated from the answer and retrieved context. The answer is scaled to a range of 0,1 where higher is better.
  • Context Relevancy focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information, and higher scores indicate better relevancy.

Increasing chunk size improves the RAG accuracy up to threshold values. Lower chunk size below a threshold value would give suboptimal RAG accuracy.

Optimize high-performance RAG pipelines with HPE Private Cloud AI

Here is a summary of implementing a high-performance RAG pipeline for generative GenAI use cases with the HPE Private Cloud AI solution.

  • RAG enables document pre-processing, ingestion, embedding generation and retrieval operation. RAG Implementation with HPE Private Cloud AI is a simple 3 step processing with steps to choose your LLM, select your data sources and choose the chunking strategy and deploy.
  • An ingestion and retrieval pipeline needs to be optimally designed for efficient RAG operation.
  • Key metrics need to be defined for efficient RAG as there are multiple parameters participating in every stage of RAG pipeline.
  • GenAI applications access enterprise data in real-time with pre-trained LLM that can now remain up to date with RAG.

Picture11.jpgMeet HPE Blogger Ashish Kumar, Data and AI Infrastructure Solutions Architect, HPE

Ashish Kumar is a Data and AI Infrastructure Solutions Architect at Hewlett Packard Enterprise. He is currently focused on defining high-performance generative AI (GenAI) solutions with specialized infrastructure and driving key GenAI use cases using HPE Private Cloud AI.

With 23 years of experience, Ashish has worked in various aspects of data and analytics infrastructure solutions, including conversational AI and vision AI solutions on HPE Private Cloud AI; Kubernetes-based real-time streaming analytics solutions with HPE PCaaS; and cybersecurity graph analytics solutions utilizing HPE’s Memory Driven Computing architecture with HPE Superdome Flex. He also specializes in real-time streaming analytics solutions with the HPE Elastic Platform for Analytics, as well as the HPE Elastic Platform for Analytics for Hadoop, Spark, and NoSQL databases. Additionally, Ashish has developed reference architectures for big data analytics, infrastructure solutions, and private cloud technologies.


Cloud Services Experts
Hewlett Packard Enterprise

twitter.com/HPE_GreenLake
linkedin.com/showcase/hpe-greenlake/
hpe.com/us/en/greenlake

0 Kudos
About the Author

Cloud_Experts

HPE experts share their insights on how you can transform your business with HPE GreenLake edge-to-cloud platform – the cloud that comes to you, wherever your apps and data live.