This website uses cookies to anonymously analyze website traffic using Google Analytics.
Company

Building your own RAG application using Together AI and LlamaIndex

January 11, 2024

By 

Together AI

Together AI provides the fastest cloud platform for building and running generative AI. Today we are launching the Together Embeddings endpoint. As part of a series of blog posts about the Together Embeddings endpoint release, we are excited to announce that you can build your own powerful RAG-based application right from the Together platform with LlamaIndex.

What is Retrieval Augmented Generation (RAG)?

Retrieval Augmented Generation (RAG) (original paper, Lewis et al.), leverages both generative models and retrieval models for knowledge-intensive tasks. It improves Generative AI applications by providing up-to-date information and domain-specific data from external data sources during response generation, reducing the risk of hallucinations and significantly improving performance and accuracy. 

Building a RAG system can be cost and data efficient without requiring technical expertise to train a model while keeping other advantages mentioned above. Note that you can still fine-tune an embedding or generative model to improve the quality of your RAG solution even further! Check out Together fine-tuning API to start.

Quickstart

To build RAG, you first need to create a vector store by indexing your source documents using an embedding model of your choice. LlamaIndex provides libraries to load and transform documents. After this step, you will create a VectorStoreIndex for your document objects with vector embeddings, and store them in a vector store. LlamaIndex supports numerous vector stores. See the complete list of supported vector stores here. Now when you have a query, you will retrieve relevant information from the vector store, augment it with your original query, and use an LLM to get your final output.

Below you will find an example of how you can incorporate a new article into your RAG application using the Together API and LlamaIndex, so that a generative model can respond with the correct information.

First, install the llama-index package from Pip. See the installation documentation for different ways to install.

pip install -U  llama-index

Set the environment variables for the API keys. You can find the Together API key under the settings tab in Together Playground.

import getpass
import os

os.environ["TOGETHER_API_KEY"] = getpass.getpass()

Now we will provide some of our recent blog posts including RedPajama-Data-v2, and ask "What is RedPajama-Data-v2?" with the retrieved information to Mixtral-8x7B-Instruct-v0.1 model, which is trained before the blog post was released. We will use "togethercomputer/m2-bert-80M-8k-retrieval" for embeddings.

from llama_index import SimpleDirectoryReader, ServiceContext, VectorStoreIndex
from llama_index.embeddings import TogetherEmbedding
from llama_index.llms import TogetherLLM


# Provide a template following the LLM's original chat template.
def completion_to_prompt(completion: str) -> str:
  return f"<s>[INST] {completion} [/INST] </s>\n"


def run_rag_completion(
    document_dir: str,
    query_text: str,
    embedding_model: str ="togethercomputer/m2-bert-80M-8k-retrieval",
    generative_model: str ="mistralai/Mixtral-8x7B-Instruct-v0.1"
    ) -> str:
    service_context = ServiceContext.from_defaults(
        llm=TogetherLLM(
            generative_model,
            temperature=0.8,
            max_tokens=256,
            top_p=0.7,
            top_k=50,
            # stop=...,
            # repetition_penalty=...,
            is_chat_model=False,
            completion_to_prompt=completion_to_prompt
        ),
        embed_model=TogetherEmbedding(embedding_model)
    )
    documents = SimpleDirectoryReader(document_dir).load_data()
    index = VectorStoreIndex.from_documents(documents, service_context=service_context)
    response = index.as_query_engine(similarity_top_k=5).query(query_text)

    return str(response)


query_text = "What is RedPajama-Data-v2? Describe in a simple sentence."
document_dir = "./sample_doc_data"

response = run_rag_completion(document_dir, query_text)
print(response)

>>> RedPajama-Data-v2 is an open dataset with 30 trillion tokens for training large language models, built from CommonCrawl data and containing 40+ quality annotations.

The answer reflects the correct and recent information included in the blog post! If we only run the LLM completion using the same query, "What is RedPajama-Data-v2? Describe it in a simple sentence.", it returns a less informative response,

>>> RedPajama-Data-v2 is a large-scale dataset consisting of English text used for training language models.

Conclusion

The above example demonstrates how to build a RAG (Retrieval-Augmented Generation) system using Together and LlamaIndex. By leveraging the power of these tools, you can create a generative model that provides accurate and up-to-date responses by retrieving relevant data from your vector store.

As you continue to explore the capabilities of Together APIs and LlamaIndex, we encourage you to experiment with different use cases and applications. We are excited to see the innovative solutions that you will build using these powerful tools.

Thank you for following along with this tutorial!

Documentations

  • Lower
    Cost
    20%
  • faster
    training
    4x
  • network
    compression
    117x
Start
building
yours
here →