Company

Building your own RAG application using Together AI and LlamaIndex

January 11, 2024

・

Together AI

Together AI provides the fastest cloud platform for building and running generative AI. Today we are launching the Together Embeddings endpoint. As part of a series of blog posts about the Together Embeddings endpoint release, we are excited to announce that you can build your own powerful RAG-based application right from the Together platform with LlamaIndex.

What is Retrieval Augmented Generation (RAG)?

Retrieval Augmented Generation (RAG) (original paper, Lewis et al.), leverages both generative models and retrieval models for knowledge-intensive tasks. It improves Generative AI applications by providing up-to-date information and domain-specific data from external data sources during response generation, reducing the risk of hallucinations and significantly improving performance and accuracy.

Building a RAG system can be cost and data efficient without requiring technical expertise to train a model while keeping other advantages mentioned above. Note that you can still fine-tune an embedding or generative model to improve the quality of your RAG solution even further! Check out Together fine-tuning API to start.

Quickstart

To build RAG, you first need to create a vector store by indexing your source documents using an embedding model of your choice. LlamaIndex provides libraries to load and transform documents. After this step, you will create a VectorStoreIndex for your document objects with vector embeddings, and store them in a vector store. LlamaIndex supports numerous vector stores. See the complete list of supported vector stores here. Now when you have a query, you will retrieve relevant information from the vector store, augment it with your original query, and use an LLM to get your final output.

Below you will find an example of how you can incorporate a new article into your RAG application using the Together API and LlamaIndex, so that a generative model can respond with the correct information.

First, install the llama-index package from Pip. See the installation documentation for different ways to install.

pip install -U  llama-index

Set the environment variables for the API keys. You can find the Together API key under the settings tab in Together Playground.

import getpass
import os

os.environ["TOGETHER_API_KEY"] = getpass.getpass()

Now we will provide some of our recent blog posts including RedPajama-Data-v2, and ask "What is RedPajama-Data-v2?" with the retrieved information to Mixtral-8x7B-Instruct-v0.1 model, which is trained before the blog post was released. We will use "togethercomputer/m2-bert-80M-8k-retrieval" for embeddings.

from llama_index import SimpleDirectoryReader, ServiceContext, VectorStoreIndex
from llama_index.embeddings import TogetherEmbedding
from llama_index.llms import TogetherLLM


# Provide a template following the LLM's original chat template.
def completion_to_prompt(completion: str) -> str:
  return f"<s>[INST] {completion} [/INST] </s>\n"


def run_rag_completion(
    document_dir: str,
    query_text: str,
    embedding_model: str ="togethercomputer/m2-bert-80M-8k-retrieval",
    generative_model: str ="mistralai/Mixtral-8x7B-Instruct-v0.1"
    ) -> str:
    service_context = ServiceContext.from_defaults(
        llm=TogetherLLM(
            generative_model,
            temperature=0.8,
            max_tokens=256,
            top_p=0.7,
            top_k=50,
            # stop=...,
            # repetition_penalty=...,
            is_chat_model=False,
            completion_to_prompt=completion_to_prompt
        ),
        embed_model=TogetherEmbedding(embedding_model)
    )
    documents = SimpleDirectoryReader(document_dir).load_data()
    index = VectorStoreIndex.from_documents(documents, service_context=service_context)
    response = index.as_query_engine(similarity_top_k=5).query(query_text)

    return str(response)


query_text = "What is RedPajama-Data-v2? Describe in a simple sentence."
document_dir = "./sample_doc_data"

response = run_rag_completion(document_dir, query_text)
print(response)

>>> RedPajama-Data-v2 is an open dataset with 30 trillion tokens for training large language models, built from CommonCrawl data and containing 40+ quality annotations.

The answer reflects the correct and recent information included in the blog post! If we only run the LLM completion using the same query, "What is RedPajama-Data-v2? Describe it in a simple sentence.", it returns a less informative response,

>>> RedPajama-Data-v2 is a large-scale dataset consisting of English text used for training language models.

Conclusion

The above example demonstrates how to build a RAG (Retrieval-Augmented Generation) system using Together and LlamaIndex. By leveraging the power of these tools, you can create a generative model that provides accurate and up-to-date responses by retrieving relevant data from your vector store.

As you continue to explore the capabilities of Together APIs and LlamaIndex, we encourage you to experiment with different use cases and applications. We are excited to see the innovative solutions that you will build using these powerful tools.

Thank you for following along with this tutorial!

Documentations

Lower
Cost
20%
faster
training
4x
network
compression
117x

Q: Should I use the RedPajama-V2 Dataset out of the box?

RedPajama-V2 is conceptualized as a pool of data that serves as a foundation for creating high quality datasets. The dataset is thus not intended to be used out of the box and, depending on the application, data should be filtered out using the quality signals that accompany the data. With this dataset, we take the view that the optimal filtering of data is dependent on the intended use. Our goal is to provide all the signals and tooling that enables this.

Links in this
article

Start

building

yours