Research

RedPajama-Data-v2: An open dataset with 30 trillion tokens for training large language models

October 30, 2023

・

Together

Today, we’re releasing a new version of the RedPajama dataset, with 30 trillion filtered and deduplicated tokens (100+ trillions raw) from 84 CommonCrawl dumps covering 5 languages, along with 40+ pre-computed data quality annotations that can be used for further filtering and weighting.

Over the last half a year, we have been pleased to see that RedPajama-1T, which we released in March, has ignited the creation of many new language models. So many people from the community have downloaded this 5TB dataset---more than 190,000 times and have been using them in such creative ways! RedPajama-1T consists of 1 trillion high-quality English tokens, but it was only the first step. Today, with the release of RedPajama-V2, we are making a further step towards the development of open datasets by releasing a massive, 30 trillion token web dataset. This is, to our best knowledge, the largest public dataset released specifically for LLM training. Even more excitingly, we include 40+ pre-computed quality annotations, allowing the community to further filter and weigh the data. Specifically, this release includes:

Over 100 billion text documents with 100+ trillion raw tokens from 84 CommonCrawl dumps;
40+ of the most widely used quality annotations pre-computed for a deduplicated 30 trillion tokens subset;
Five languages: English, French, Spanish, German, and Italian
All data processing scripts are open source and available on GitHub; all data are available on HuggingFace.

Why RedPajama-Data-v2 and How to Use it?

A central ingredient to state-of-the-art open LLMs like Llama, Mistral, Falcon, MPT, and the RedPajama models is the large amounts of high-quality data that these models are trained on. For example, Llama 2 is trained on 2.4 trillion carefully curated tokens. The most prominent data sources are the crawls made publicly available by CommonCrawl. However, this data is crude and is not ideal for direct use for LLM training due to artifacts arising from the conversion of HTML to plain text, sources of generally low quality, and biases inherent to the distribution of content on the web. Getting the right dataset and data mixture is painful and any LLM developer has to go through the laborious, time-consuming, energy-intensive and expensive steps of processing and filtering this crude data. Although there have been several community projects around this effort, such as C4, RedPajama-1T, Refinedweb (Falcon), Dolma (AI2) and SlimPajama, many of them only cover a small portion of the CommonCrawl crawls; moreover, they represent a very specific way in which data are filtered.

With RedPajama-Data-v2, our goal is to lift this burden off the community and provide a pool of web data serving as a base from which high quality datasets for LLM training can be extracted and based on which LLM training data can be thoroughly researched. It provides, to our best knowledge, the most complete coverage on CommonCrawl (with 84 dumps processed). More importantly, we provide 40+ quality annotations — the result of different ML classifiers on data quality, minhash results that can be used for fuzzy deduplication, or heuristics such as “the fraction of words that contain no alphabetical character”. We provide our best effort implementations of quality annotations used in C4, Gopher, Pretrainer’s Guide, RefinedWeb and Data Selection for Language Models via Importance Resampling. These annotations provide a way for an LLM developer to easily slice and filter the data, combining these into a new data quality pipeline to create their own pre-training dataset.

Here are some examples! The following code snippets show how one can implement commonly used filtering rules in combination with the RedPajama-V2 dataset. For example, implementing the Gopher rules and use these to filter out documents that do not comply with the Gopher rules is as easy as:

def gopher_rules_pass(sample) -> bool:
    """ function returns True if the sample complies with Gopher rules """
    signals = json.loads(sample["quality_signals"])

    # rule 1: number of words between 50 and 10'000
    word_count = signals["rps_doc_word_count"][0][2]
    if word_count < 50 or word_count > 10_000:
        return False

    # rule 2: mean word length between 3 and 10
    mean_word_length = signals["rps_doc_mean_word_length"][0][2]
    if mean_word_length < 3 or mean_word_length > 10:
        return False

    # rule 2: symbol to word ratio below 0.1
    symbol_word_ratio = signals["rps_doc_symbol_to_word_ratio"][0][2]
    if  symbol_word_ratio > 0.1:
        return False

    # rule 3: 90% of lines need to start without a bullet point
    n_lines = signals["ccnet_nlines"][0][2]
    n_lines_bulletpoint_start = sum(map(lambda ln: ln[2], signals["rps_lines_start_with_bulletpoint"]))
    if n_lines_bulletpoint_start / n_lines > 0.9:
        return False

    # rule 4: the ratio between characters in the most frequent 2-gram and the total number 
    # of characters must be below 0.2
    top_2_gram_frac = signals["rps_doc_frac_chars_top_2gram"][0][2]
    if top_2_gram_frac > 0.2:
        return False

    # rule 5: ...


    return True

ds = load_dataset("togethercomputer/RedPajama-Data-V2", name="sample")
filtered_dataset = list(filter(gopher_rules_pass, ds["train"]))

In the above snippet, we have used the “sample” config to load just a subset of the dataset. In case you want to load the full dataset for, e.g., snapshot 2023-14 in English, you can run:

ds_iterator = load_dataset(
    "togethercomputer/RedPajama-Data-V2", 
    partition="head_middle",
    snapshots=["2023-14"], 
    languages=["en"], 
    name="default"
)

We can also use the rules used in RedPajama-v1 or C4:

def rpv1_rules_pass(sample) -> bool:
    """ function returns True if the sample complies with the filtering rules used in RP-V1 """
    signals = json.loads(sample["quality_signals"])

    # rule 1: the wikipedia reference classifier score must be higher than 0.25
    wikiref_score = signals["rps_doc_ml_wikiref_score"][0][2]
    if wikiref_score < 0.25:
        return False

    return True

def c4_rules_pass(sample) -> bool:
    """ function returns True if the sample complies with the filtering rules used in C4 """
    signals = json.loads(sample["quality_signals"])

    # rule 1: at least 3 sentences
    num_sentences = signals["rps_doc_num_sentences"][0][2]
    if num_sentences < 3:
        return False

    # rule 2: page may not contain bad words
    n_bad_words = signals["rps_doc_ldnoobw_words"][0][2]
    if n_bad_words > 0:
        return False

    # rule 3: page may not contain placeholder "lorem ipsum" text
    lorem_ipsum = signals["rps_doc_lorem_ipsum"][0][2]
    if lorem_ipsum > 0:
        return False

    # rule 4: ...

    return True

In the current release, we include 40+ quality annotations, but we very much view this as a “living” project where new additions will be made over time as the field moves towards a better understanding of LLM training data. We hope the community provides feedback, and we are looking forward to continuing to enrich our current pool of annotations.

Data Processing Steps

RedPajama-V2 focuses on CommonCrawl. Other data sources such as Wikipedia are available in RedPajama-V1. We also encourage you to enrich your data mixture with the Stack (by BigScience) for code and s2orc (by AI2) for scientific articles. RedPajama-V2 is built from the ground up based on publicly available web data, consisting of 84 crawls provided by CommonCrawl. The core components that this dataset is made of, are the source data (plain text), 40+ quality annotations, and deduplication clusters.

Creating the Source Data

The first processing step in building this dataset is to pass each CommonCrawl snapshot through the CCNet pipeline. We choose this pipeline due to its light processing, aligning with our guiding principle of preserving as much information in the raw dataset as possible and allowing downstream model developers to filter or reweight the dataset. We use the language filter in CCNet and keep five languages in this release: English, French, Spanish, German and Italian. This processing step produces 100 billion individual text documents.

Quality Annotations

In addition to the text documents processed by CCNet, we compute over 40 of the most widely used quality annotations for the “head” and “middle” buckets. The primary purpose of these annotations is to allow downstream model developers to filter or reweight the dataset based on their criteria, and to foster research into how these annotations should be used. In addition, we also plan, with the help of the community, to include more quality signals over time. With this release, we publish a first set of quality annotations, which consists of our implementations of the most common quality annotations that are described in C4, Gopher, Pretrainer’s Guide, RefinedWeb, in addition to several signals described in other papers. These annotations fall into the following categories:

Quality signals indicating how natural a given piece of text is. This includes simple heuristic measures such as the number of sentences, the number of words, the fraction of all-caps words, among others.
Quality signals indicating how repetitive a given piece of text is. Here follow the Gopher rules (Rae et al.) and compute the fraction of characters that appear in duplicated word n-grams and the fraction of characters in the most frequent word n-gram appearing in the documents.
Content-based quality signals are comprised of signals that take the content into account such as the density of words appearing in a list of blocked words (similar to C4), or documents which come from a list of domains flagged as containing potentially harmful or otherwise offensive content.
ML-based quality signals revolve around the idea of measuring how similar a given text is to a high-quality domain. Here we use fasttext classifiers trained on various high quality domains such as Wikipedia, as well as importance weights as proposed by Xie et al.
Deduplication signals with pre-computed Minhash signatures (with 128 permutations) which can be used for fuzzy deduplication at different degrees.

Annotation Tag	Description	Category	Reference
ccnet_bucket	head, middle or tail bucket of the perplexity score	CCNet	CCNet
ccnet_language_score	score of the language identification model	CCNet	CCNet
ccnet_length	number of characters	CCNet	CCNet
ccnet_nlines	number of lines	CCNet	CCNet
ccnet_original_length	number of characters before in-document line deduplication	CCNet	CCNet
ccnet_original_nlines	number of lines before in-document line deduplication	CCNet	CCNet
ccnet_perplexity	perplexity of an LM trained on Wikipedia	CCNet	CCNet
rps_doc_books_importance	Given a bag of {1,2}-wordgram model trained on Books p, and a model trained on the source domain q, This is the logarithm of the ratio p(doc)/q(doc).	ML Heuristics	Importance Resampling (Xie et al.)
rps_doc_openwebtext_importance	Given a bag of {1,2}-wordgram model trained on OpenWebText p, and a model trained on the source domain q, this is the logarithm of the ratio p(doc)/q(doc).	ML Heuristics	Importance Resampling (Xie et al.)
rps_doc_wikipedia_importance	Given a bag of {1,2}-wordgram model trained on Wikipedia articles p, and a model trained on the source domain q, this is the logarithm of the ratio p(doc)/q(doc).	ML Heuristics	Importance Resampling (Xie et al.)
rps_doc_ml_wikiref_score	Fasttext classifier prediction for the document being a Wikipedia reference. This is the same fasttext model used in the RedPajama-1T dataset. Only applies to English data..	ML Heuristics	LLaMA, RedPajama-1T
rps_doc_ml_palm_score	Fasttext classifier prediction for the document being a Wikipedia article, OpenWebText sample or a RedPajama-V1 book. Only for English data.	ML Heuristics	PaLM, GLaM
rps_doc_ml_wikipedia_score	Fasttext classifier prediction for the document being a Wikipedia article. This is used for non-English data	ML Heuristics	-
rps_doc_curly_bracket	The ratio between the number of occurrences of '{' or '}' and the number of characters in the raw text.	Natural Language	C4

Annotation Tag	Description	Category	Reference
rps_doc_frac_all_caps_words	The fraction of words in the content that only consist of uppercase letters. This is based on the raw content.	Natural Language	Pretrainer’s Guide
rps_doc_frac_lines_end_with_ellipsis	The fraction of lines that end with an ellipsis, where an ellipsis is defined as either "..." or "…".	Natural Language	RefinedWeb,Gopher
rps_doc_frac_no_alph_words	The fraction of words that contain no alphabetical character.	Natural Language	RefinedWeb,Gopher
rps_doc_lorem_ipsum	The ratio between the number of occurrences of 'lorem ipsum' and the number of characters in the content after normalisation.	Natural Language	C4
rps_doc_mean_word_length	The mean length of words in the content after normalisation.	Natural Language	RefinedWeb,Gopher
rps_doc_stop_word_fraction	The ratio between the number of stop words and the number of words in the document. Stop words are obtained from https://github.com/6/stopwords-json.	Natural Language	RefinedWeb,Gopher
rps_doc_symbol_to_word_ratio	The ratio of symbols to words in the content.. Symbols are defined "#", "...", and "…".	Natural Language	RefinedWeb,Gopher
rps_doc_frac_unique_words	The fraction of unique words in the content. This is also known as the degeneracy of a text sample. Calculated based on the normalised content.	Natural Language	Pretrainer’s Guide
rps_doc_unigram_entropy	The entropy of the unigram distribution of the content. This measures the diversity of the content and is computed using sum(-x / total * log(x / total)) where the sum is taken over counts of unique words in the normalised content.	Natural Language	-
rps_doc_word_count	The number of words in the content after normalisation.	Natural Language	RefinedWeb,Gopher
rps_lines_ending_with_terminal_punctution_mark	Indicates whether a line ends with a terminal punctuation mark. A terminal punctuation mark is defined as one o: ".", "!", "?", "”".	Natural Language	C4
rps_lines_javascript_counts	The number of occurrences of the word "javascript" in each line.	Natural Language	C4
rps_lines_num_words	The number of words in each line. This is computed based on the normalised text.	Natural Language	C4, RefinedWeb
rps_lines_numerical_chars_fraction	The ratio between number of numerical characters and total number of characters in each line. This is based on the normalised content.	Natural Language	RefinedWeb
rps_lines_start_with_bulletpoint	Whether the lines that start with a bullet point symbol. The following set of unicodes are considered a bullet point: \u2022 (bullet point), \u2023 (triangular bullet point), \u25B6 (black right pointing triangle), \u25C0 (black left pointing triangle), \u25E6 (white bullet point), \u25A0 (black square), \u25A1 (white square), \u25AA (black small square), \u25AB (white small square), \u2013 (en dash).	Natural Language	RefinedWeb,Gopher
rps_lines_uppercase_letter_fraction	The ratio between number of uppercase letters and total number of characters in each line. This is based on the raw text.	Natural Language	RefinedWeb
rps_doc_num_sentences	The number of sentences in the content. This is calculated using the regular expression r'\b[^.!?]+[.!?]*'.	Natural Language	C4

Annotation Tag	Description	Category	Reference
rps_doc_frac_chars_dupe_10grams	The fraction of characters in duplicate word 10grams.	Repetitiveness	RefinedWeb,Gopher
rps_doc_frac_chars_dupe_5grams	The fraction of characters in duplicate word 5grams.	Repetitiveness	RefinedWeb,Gopher
rps_doc_frac_chars_dupe_6grams	The fraction of characters in duplicate word 6grams.	Repetitiveness	RefinedWeb,Gopher
rps_doc_frac_chars_dupe_7grams	The fraction of characters in duplicate word 7grams.	Repetitiveness	RefinedWeb,Gopher
rps_doc_frac_chars_dupe_8grams	The fraction of characters in duplicate word 8grams.	Repetitiveness	RefinedWeb,Gopher
rps_doc_frac_chars_dupe_9grams	The fraction of characters in duplicate word 9grams.	Repetitiveness	RefinedWeb,Gopher
rps_doc_frac_chars_top_2gram	The fraction of characters in the top word 2gram.	Repetitiveness	RefinedWeb,Gopher
rps_doc_frac_chars_top_3gram	The fraction of characters in the top word 3gram.	Repetitiveness	RefinedWeb,Gopher
rps_doc_frac_chars_top_4gram	The fraction of characters in the top word 4gram.	Repetitiveness	RefinedWeb,Gopher
rps_doc_ldnoobw_words	The number of sequences of words that are contained in the List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words blocklist. The blocklist is obtained from https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words	Sensitive / toxic content	C4
rps_doc_ut1_blacklist	A categorical id corresponding to the list of categories of the domain of the document. Categories are obtained from the UT1 blacklist. The list is obtained from https://dsi.ut-capitole.fr/blacklists/	Sensitive / toxic content	RefinedWeb
minhash_signature_0.7	Banded minhash signature of the document, for fuzzy deduplication at Jaccard similarity 0.7. The signature is based on 128 hash functions and grouped into 14 bands and 9 rows for LSH.	Deduplication	SlimPajama,RefinedWeb
minhash_signature_0.8	Banded minhash signature of the document, for fuzzy deduplication at Jaccard similarity 0.8. The signature is based on 128 hash functions and grouped into 9 bands and 13 rows for LSH.	Deduplication	SlimPajama,RefinedWeb
minhash_signature_0.9	Banded minhash signature of the document, for fuzzy deduplication at Jaccard similarity 0.9. The signature is based on 128 hash functions and grouped into 5 bands and 25 rows for LSH..	Deduplication	SlimPajama,RefinedWeb
minhash_signature_1.0	Banded minhash signature of the document, for fuzzy deduplication at Jaccard similarity 1.0. The signature is based on 128 hash functions and grouped into 1 band and 128 rows for LSH.	Deduplication	SlimPajama,RefinedWeb

In addition to these minhash signatures, we conduct exact deduplication with a Bloom filter over the sha1 hash-digest of the document. These are stored as a separate quality annotation file to allow the original non-duplicated distribution to be recovered to facilitate research in this direction.

Dataset Statistics

RedPajama-v2 processed 84 CommonCrawl crawls and consists of 113B documents in the five languages (English, German, French, Spanish, and Italian). While we keep the tail partition of the resulting data, consisting of an estimated 80B documents, we also compute the number of documents and tokens for the head and middle partitions (before and after deduplication). Interestingly, while this reduces the token count by 60%, the number of documents decreases disproportionately more by 71%, indicating that the tail documents are generally shorter.

Partition	# Documents	Estimated Token Count
head + middle + tail	113.3B	123.7T
head + middle	32.8B	50.7T
head + middle (deduplicated)	20.8B	30.4T

We further deduplicated the head+middle documents using a Bloom filter, which leads to a reduction in the dataset size by roughly 40%. In the following figure, we show the development of the number of documents in the head+middle partition, as a function of the point in time of the crawl. What stands out here is that there is a relatively stable number until 2018, and a significantly smaller number of documents between 2014 and 2016 (up to 10x for, e.g., German). It is also worth noting how the number of unique documents over time develops. Specifically, since we ran the deduplication from the newest snapshot to the oldest, one expects an increasingly smaller number of unique documents in the corpus, which can be observed from the figure below (note the log-scale). However, it is worth pointing out the sudden drop in unique documents occurring for the crawls between 2014 and 2017. We believe that this can be explained from a different list of seeds used by the CommonCrawl web crawler during that period.

In the next figure, we show the distribution of the number of tokens per document, for the tail and the head+middle partitions. With a median per-document token count of 380, the tail documents are considerably shorter than the head+middle documents where the median is 741.

While the raw documents provide the basis for the RedPajama-V2 corpus, a further central component are the quality signals which we have computed for all documents in the head+middle partition. In the figure below, we show the distribution of the quality signals computed for documents from the 2023-06 snapshot.

Histograms of CCNet quality Signals for English documents from the 2023-06 snapshot.

Histograms of ML heuristics quality Signals for English documents from the 2023-06 snapshot.

Histograms of natural language quality Signals for English documents from the 2023-06 snapshot.

Histograms of quality signals measuring the repetitiveness of a text document. For English documents from the 2023-06 snapshot.

Dataset Structure

The core of the dataset is composed of the text documents, accompanied by the quality annotations and deduplication clusters. The structure largely follows the one defined by CCNet. Specifically, the documents for a given CommonCrawl snapshot (say, e.g., 2018-43) are partitioned into 5k shards where the key indicates the shard, language of the document, and the perplexity bucket (partition). The quality annotations and duplicates follow the same logic and “mirror” the source filenames:

The document files are left untouched and correspond 1-to-1 to the CCNet output, including the metadata fields. The quality signals, on the other hand, include document ids, metadata, and the quality signals themselves:

{
  "id": "2018-43/0000/en_head.json.gz/0", 
  "id_int": 7972430436813205988, 
  "metadata":{
    "cc_segment": "crawl-data/...",
    "cc_net_source": "2018-43/0000/en_head.json.gz",
    "url": "...",
    "source_domain": "...",
    "language": "en",
    "snapshot_id": "2018-43"
  },
  "quality_signals": {
    "ccnet_original_length": [[0, 7033, 8711.0]],
    "...": "...",
    "rps_doc_stop_word_fraction": [[0, 7033, 0.45121107]],
    "rps_lines_num_words": [[0, 25, 2], ..., [6980, 7033, 10]]
  }
}

Since we have quality signals that can characterise the quality on a line level (e.g., whether a line ends in a terminal punctuation mark), or on a document level we choose the logic used by Dolma, allowing for a unified representation of different types of signals. Specifically, each score corresponds to an array of tuples `(start, end, score)` where start and end correspond to the span in the document string where the score “applies”.

A “Living” Dataset

We envision the release of this dataset to be the start of a larger, community-driven development of large-scale datasets for LLMs. Along the data axis, we hope to continuously grow this pool and enrich it with additional domains and new snapshots over time. Along the data quality side, we view the current set of quality signals as an initial base set of signals that we hope to grow with new additions. In that sense, RedPajama-v2 should be seen as a pool that grows over time as the community learns more about harnessing the data for training performant language models. In the future, we plan to add more quality annotations such as: Contamination annotations against popular LLM benchmarks, topic modelling and classification annotations for each document, and other annotations that the community is excited about!

Model Building at Together

Together is building open models based on RedPajama-Dataset-V2, and we also help companies and organizations build custom models built with principled mixes of open and their proprietary datasets. If you are evaluating solutions to build models, please contact us here.

Acknowledgments

We are appreciative to so many partners and collaborators that together are pushing forward the frontier of open LLM models.

Thank you to the OLMo team at AI2 and friends at OpenGPT-X for the insightful discussions about datasets and data quality! Also for everyone who builds on the RedPajama dataset, including Cerebras for their SlimPajama efforts, and the over 500 models built on RedPajama to date by the open-source AI community.
We are grateful to the great team at EleutherAI for paving the path on open training datasets with The Pile and for open-sourcing code we use in training some of the RedPajama models.
Thank you to our partners of RedPajama-v1, including Ontocord.ai, MILA Québec AI Institute, ETH DS3Lab, Université de Montréal, Stanford Center for Research on Foundation Models (CRFM), Stanford Hazy Research research group and LAION.

Lower
Cost
20%
faster
training
4x
network
compression
117x

Q: Should I use the RedPajama-V2 Dataset out of the box?

RedPajama-V2 is conceptualized as a pool of data that serves as a foundation for creating high quality datasets. The dataset is thus not intended to be used out of the box and, depending on the application, data should be filtered out using the quality signals that accompany the data. With this dataset, we take the view that the optimal filtering of data is dependent on the intended use. Our goal is to provide all the signals and tooling that enables this.