How to evaluate your RAG Part 1: Synthetic Data

Last modified: 2024/08/13 | Estimated Reading Time: 25 min | Author: More Zhou

Evaluating a RAG system is a relatively complex task. In this article, I will start with the topic of how to build a synthetic dataset to share my own experiences.

This article will cover the following topics:

How to build a synthetic dataset to evaluate your RAG from plain text data sources and use LLM as a critique agent to refine it, thereby improving the quality.

How to extend the idea to multimodal scenarios, such PDF, DOCX, PPTX that contain text, tables and images.

Potential risks and drawbacks of these datasets.

As is well known, the evaluation of modern RAG system is usually carried out separately from the retrieval evaluation and response evaluation.

Evaluation of Retrieval-Augmented Generation: A Survey

Retrieval-Augmented Generation (RAG) has recently gained traction in natural language processing. Numerous studies and real-world applications are leveraging its ability to enhance generative...

https://arxiv.org/abs/2405.07437

In this article, we’ll focus on the retrieval.

Of course, there are already many existing tools for evaluating RAG, such as

ragas

explodinggradients • Updated Aug 29, 2024

and

rageval

gomate-community • Updated Aug 26, 2024

. However, this article will not use them but will instead use some basic tools to convey the overall idea.

Build from pure text (markdown files)

Firstly, let's examine how to construct a synthetic dataset from plain text and use it to evaluate the retrieval component within the RAG. This is the flowchart of this part, and I will introduce each part in order.

Split the documents

Data Source can be a collection of any one or multiple types of files, such as those from the OpenAI Cookbook or Azure documentation. However, since we are currently only discussing plain text scenarios, this specifically refers to the Markdown files at the moment. So first, let’s take the OpenAI Cookbook for example and filter out all Markdown files from it.


from langchain_community.document_loaders import GitLoader

loader = GitLoader(
    clone_url="https://github.com/openai/openai-cookbook",
    repo_path="openai-cookbook_data_github/",
    branch="main",
    file_filter=lambda file_path: file_path.endswith(".md"),  # Only get the markdown files
)

docs = loader.load()

docs[:3]

This loader will package each qualifying file into a LangchainDocument. If we print it out, we will find that it not only contains the plain text content of the original document but also includes relevant metadata such as the filename, directory, file type, and so on.


[Document(metadata={'source': 'CONTRIBUTING.md', 'file_path': 'CONTRIBUTING.md', 'file_name': 'CONTRIBUTING.md', 'file_type': '.md'}, page_content="# Welcome, AI Chef\n\nThe OpenAI Cookbook is a community-driven resource aimed at sharing knowledge in a way that is accessible, engaging, and enriching for all AI builders.\n\nBefore contributing, read through the existing issues and pull requests to see if someone else is already working on something similar. That way you can avoid duplicating efforts.\n\n## What makes a good contribution?\n\nGenerally, we have found that the best contributions to the Cookbook are **useful**, **novel** or **creative**, or a combination of these.\n\n- **Useful:** Involves concepts or techniques that can be applied broadly and often, and can translate to practical use-cases and solving real-world problems. If you're doing something often, chances are others are too, and having reusable examples to reference can be very helpful.\n- **Novel:** Showcases new developments or techniques. Look out for new research on how to best use LLMs, or new models and capabilities in the API.\n- **Creative:** Uses LLMs in creative and innovative ways, or combines multiple APIs and tools in novel ways.\n\nAdditionally, we strive to maintain a **neutral** tone, and aim for **high quality** writing.\n\n- **Neutral:** Maintains a neutral stance on tools and products. While it's natural to have preferences for particular tools, a good guide avoids over-evangelizing or marketing specific products, ensuring integrity and inclusivity.\n- **High quality:** Well structured, clear and complete. Writing good content ensures others can fully benefit from it. See the rubric below for more details on how we assess the quality of submissions to the Cookbook.\n\n## Rubric\n\nTo ensure the quality of submissions, we have established a rubric that assesses each contribution on various areas. The purpose of this rating system is to maintain a high standard of quality, relevance, and uniqueness. Each area is rated on a scale from 1 to 4. Contributions that score lower than a 3 in any of the areas will generally be rejected.\n\nWe encourage contributors to familiarize themselves with this rubric before writing content. Understanding the criteria not only increases the chances of your contribution being accepted, but also helps in creating a resource that is comprehensive, clear, and beneficial for all users.\n\nFor additional advice on writing good documentation, refer to [What Makes Documentation Good](https://cookbook.openai.com/what_makes_documentation_good).\n\n| Criteria     | Description                                                                                         | Score |\n| ------------ | --------------------------------------------------------------------------------------------------- | ----- |\n| Relevance    | Is the content related to building with OpenAI technologies? Is it useful to others?                |       |\n| Uniqueness   | Does the content offer new insights or unique information compared to existing documentation?       |       |\n| Clarity      | Is the language easy to understand? Are things well-explained? Is the title clear?                  |       |\n| Correctness  | Are the facts, code snippets, and examples correct and reliable? Does everything execute correctly? |       |\n| Conciseness  | Is the content concise? Are all details necessary? Can it be made shorter?                          |       |\n| Completeness | Is the content thorough and detailed? Are there things that weren’t explained fully?                |       |\n| Grammar      | Are there grammatical or spelling errors present?                                                   |       |\n\n### Breakdown\n\n| Criteria     | 4                                             | 3                                         | 2                                             | 1                                          |\n| ------------ | --------------------------------------------- | ----------------------------------------- | --------------------------------------------- | ------------------------------------------ |\n| Relevance    | Relevant and useful.                          | Relevant but not very useful.             | Tangentially relevant.                        | Not relevant.                              |\n| Uniqueness   | Completely unique with fresh insights.        | Unique with minor overlaps.               | Some unique aspects, but significant overlap. | Many similar guides/examples.              |\n| Clarity      | Clear language and structure.                 | Clear language, unclear structure.        | Some sections unclear.                        | Confusing and unclear.                     |\n| Correctness  | Completely error free.                        | Code works, minor improvements needed.    | Few errors and warnings.                      | Many errors, code doesn't execute.         |\n| Conciseness  | Cannot be reduced in any section, or overall. | Mostly short, but could still be reduced. | Some long sections, and/or long overall.      | Very long sections and overall, redundant. |\n| Completeness | Complete and detailed.                        | Mostly complete, minor additions needed.  | Lacks some explanations.                      | Missing significant portions.              |\n| Grammar      | Perfect grammar.                              | Correct grammar, few typos.               | Some spelling/grammatical errors.             | Numerous spelling/grammatical errors.      |\n"),
 Document(metadata={'source': 'README.md', 'file_path': 'README.md', 'file_name': 'README.md', 'file_type': '.md'}, page_content='<a href="https://cookbook.openai.com" target="_blank">\n  <picture>\n    <source media="(prefers-color-scheme: dark)" srcset="/images/openai-cookbook-white.png" style="max-width: 100%; width: 400px; margin-bottom: 20px">\n    <img alt="OpenAI Cookbook Logo" src="/images/openai-cookbook.png" width="400px">\n  </picture>\n</a>\n\n<h3></h3>\n \n> ✨ Navigate at [cookbook.openai.com](https://cookbook.openai.com)\n\nExample code and guides for accomplishing common tasks with the [OpenAI API](https://platform.openai.com/docs/introduction). To run these examples, you\'ll need an OpenAI account and associated API key ([create a free account here](https://beta.openai.com/signup)). Set an environment variable called `OPENAI_API_KEY` with your API key. Alternatively, in most IDEs such as Visual Studio Code, you can create an `.env` file at the root of your repo containing `OPENAI_API_KEY=<your API key>`, which will be picked up by the notebooks.\n\nMost code examples are written in Python, though the concepts can be applied in any language.\n\nFor other useful tools, guides and courses, check out these [related resources from around the web](https://cookbook.openai.com/related_resources).\n\n## Contributing\n\nThe OpenAI Cookbook is a community-driven resource. Whether you\'re submitting an idea, fixing a typo, adding a new guide, or improving an existing one, your contributions are greatly appreciated!\n\nBefore contributing, read through the existing issues and pull requests to see if someone else is already working on something similar. That way you can avoid duplicating efforts.\n\nIf there are examples or guides you\'d like to see, feel free to suggest them on the [issues page](https://github.com/openai/openai-cookbook/issues).\n\nIf you\'d like to contribute new content, make sure to read through our [contribution guidelines](/CONTRIBUTING.md). We welcome high-quality submissions of new examples and guides, as long as they meet our criteria and fit within the scope of the cookbook.\n\nThe contents of this repo are automatically rendered into [cookbook.openai.com](https://cookbook.openai.com) based on [registry.yaml](/registry.yaml).\n\n[![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://github.com/codespaces/new?hide_repo_select=true&ref=main&repo=468576060&machine=basicLinux32gb&location=EastUs)\n'),
 Document(metadata={'source': '.github/pull_request_template.md', 'file_path': '.github/pull_request_template.md', 'file_name': 'pull_request_template.md', 'file_type': '.md'}, page_content='## Summary\n\nBriefly describe the changes and the goal of this PR. Make sure the PR title summarizes the changes effectively.\n\n## Motivation\n\nWhy are these changes necessary? How do they improve the cookbook?\n\n---\n\n## For new content\n\nWhen contributing new content, read through our [contribution guidelines](https://github.com/openai/openai-cookbook/blob/main/CONTRIBUTING.md), and mark the following action items as completed:\n\n- [ ] I have added a new entry in [registry.yaml](https://github.com/openai/openai-cookbook/blob/main/registry.yaml) (and, optionally, in [authors.yaml](https://github.com/openai/openai-cookbook/blob/main/authors.yaml)) so that my content renders on the cookbook website.\n- [ ] I have conducted a self-review of my content based on the [contribution guidelines](https://github.com/openai/openai-cookbook/blob/main/CONTRIBUTING.md#rubric):\n  - [ ] Relevance: This content is related to building with OpenAI technologies and is useful to others.\n  - [ ] Uniqueness: I have searched for related examples in the OpenAI Cookbook, and verified that my content offers new insights or unique information compared to existing documentation.\n  - [ ] Spelling and Grammar: I have checked for spelling or grammatical mistakes.\n  - [ ] Clarity: I have done a final read-through and verified that my submission is well-organized and easy to understand.\n  - [ ] Correctness: The information I include is correct and all of my code executes successfully.\n  - [ ] Completeness: I have explained everything fully, including all necessary references and citations.\n\nWe will rate each of these areas on a scale from 1 to 4, and will only accept contributions that score 3 or higher on all areas. Refer to our [contribution guidelines](https://github.com/openai/openai-cookbook/blob/main/CONTRIBUTING.md) for more details.\n')]

The next step is to provide these files to the Document Parser service, which will generate Parsed Content. The nature of the content generated primarily depends on the input format acceptable to the LLM.

In the example above, since the data source consists of Markdown files, the Document Parser does not need to perform extensive processing and can directly read the files. However, not all data types are suitable, as will be discussed later. Thus, in the current example, the Parsed Content can be considered as the content within the Markdown files.

The next step involves passing the content to the Document Splitter. The purpose of this step is to divide the file content into smaller chunks. This can be achieved using a straightforward text splitting method.

Here, I used the RecursiveCharacterTextSplitter as the splitter, but there are actually more options available.


from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
				chunk_size=3000,
				chunk_overlap=300,
				add_start_index=True,
				separators=["\n\n", "\n", ".", " ", "", "\n\n\n"],
)

docs_processed = []

for doc in docs[:3]:
    docs_processed += splitter.split_documents([doc])

docs_processed


[Document(metadata={'source': 'CONTRIBUTING.md', 'file_path': 'CONTRIBUTING.md', 'file_name': 'CONTRIBUTING.md', 'file_type': '.md', 'start_index': 0}, page_content="# Welcome, AI Chef\n\nThe OpenAI Cookbook is a community-driven resource aimed at sharing knowledge in a way that is accessible, engaging, and enriching for all AI builders.\n\nBefore contributing, read through the existing issues and pull requests to see if someone else is already working on something similar. That way you can avoid duplicating efforts.\n\n## What makes a good contribution?\n\nGenerally, we have found that the best contributions to the Cookbook are **useful**, **novel** or **creative**, or a combination of these.\n\n- **Useful:** Involves concepts or techniques that can be applied broadly and often, and can translate to practical use-cases and solving real-world problems. If you're doing something often, chances are others are too, and having reusable examples to reference can be very helpful.\n- **Novel:** Showcases new developments or techniques. Look out for new research on how to best use LLMs, or new models and capabilities in the API.\n- **Creative:** Uses LLMs in creative and innovative ways, or combines multiple APIs and tools in novel ways.\n\nAdditionally, we strive to maintain a **neutral** tone, and aim for **high quality** writing."),
 Document(metadata={'source': 'CONTRIBUTING.md', 'file_path': 'CONTRIBUTING.md', 'file_name': 'CONTRIBUTING.md', 'file_type': '.md', 'start_index': 1073}, page_content="Additionally, we strive to maintain a **neutral** tone, and aim for **high quality** writing.\n\n- **Neutral:** Maintains a neutral stance on tools and products. While it's natural to have preferences for particular tools, a good guide avoids over-evangelizing or marketing specific products, ensuring integrity and inclusivity.\n- **High quality:** Well structured, clear and complete. Writing good content ensures others can fully benefit from it. See the rubric below for more details on how we assess the quality of submissions to the Cookbook.\n\n## Rubric\n\nTo ensure the quality of submissions, we have established a rubric that assesses each contribution on various areas. The purpose of this rating system is to maintain a high standard of quality, relevance, and uniqueness. Each area is rated on a scale from 1 to 4. Contributions that score lower than a 3 in any of the areas will generally be rejected.\n\nWe encourage contributors to familiarize themselves with this rubric before writing content. Understanding the criteria not only increases the chances of your contribution being accepted, but also helps in creating a resource that is comprehensive, clear, and beneficial for all users.\n\nFor additional advice on writing good documentation, refer to [What Makes Documentation Good](https://cookbook.openai.com/what_makes_documentation_good)."),
 Document(metadata={'source': 'CONTRIBUTING.md', 'file_path': 'CONTRIBUTING.md', 'file_name': 'CONTRIBUTING.md', 'file_type': '.md', 'start_index': 2426}, page_content='| Criteria     | Description                                                                                         | Score |\n| ------------ | --------------------------------------------------------------------------------------------------- | ----- |\n| Relevance    | Is the content related to building with OpenAI technologies? Is it useful to others?                |       |\n| Uniqueness   | Does the content offer new insights or unique information compared to existing documentation?       |       |\n| Clarity      | Is the language easy to understand? Are things well-explained? Is the title clear?                  |       |\n| Correctness  | Are the facts, code snippets, and examples correct and reliable? Does everything execute correctly? |       |\n| Conciseness  | Is the content concise? Are all details necessary? Can it be made shorter?                          |       |\n| Completeness | Is the content thorough and detailed? Are there things that weren’t explained fully?                |       |\n| Grammar      | Are there grammatical or spelling errors present?                                                   |       |\n\n### Breakdown')]

Although the operation of splitting the content into smaller chunks here resembles chunking, it is not the same as the chunking in regular RAG indexing. This step can be understood as using simple splitting methods to divide the content into relatively short passages, or sometimes you can just skip the splitting entirely.

Build the dataset with LLM

The next step is critical. Each text segment produced by the splitter will be sent to the LLM. The LLM, with its advanced capabilities, analyzes the content of each segment and generates a set of questions and answers based on that content. Additionally, it provides the evidence for each answer.

Specifically, let’s first construct a set of prompts to transform the LLM into an agent capable of reading context and performing self-questioning and answering. The function below constructs a prompt that allows the LLM to generate a corresponding set of questions and answers based on a given piece of context, along with the full justifications from the context.


def build_qa_prompting_msg(context):
    system_message = {
        "role": "system",
        "content": "You are an assistant who is very good at written work and is a logical thinker.",
    }

    QA_generation_prompt = f"""
    Your task is to write a factoid question and an answer given a context.
    Your factoid question should be answerable with a specific, concise piece of factual information from the context.
    Your factoid question should be formulated in the same style as questions users could ask in a search engine.
    You need to provide the specific evidence, which should be an entire paragraph or a whole sentence in context.
    Please note that your evidence MUST NOT mention something like "according to the context" or "the title indicates that".

    Provide your answer as follows:

    Output:::
    Factoid question: (your factoid question)
    Answer: (your answer to the factoid question)
    Evidence: (the evidence sentence from the context that supports the answer)

    Now here is the context.

    Context: {context}\n
    Output:::"""

    user_message = {
        "role": "user",
        "content": QA_generation_prompt,
    }

    messages = []
    messages.append(system_message)
    messages.append(user_message)

    messages = {
        "messages": messages,
        "temperature": 0.0,
    }

    return messages

Consequently, after this step, we obtain a set of QA pairs, which include the following things: source_doc, context, question, answer and evidence.


dataset = []
for d in docs_processed[:3]:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=build_qa_prompting_msg(context=d.page_content),
        temperature=0.0,
    )

    response_text = response.choices[0].message.content
    question = response_text.split("Factoid question: ")[-1].split("Answer: ")[0]
    answer = response_text.split("Answer: ")[-1].split("Evidence: ")[0]
    evidence = response_text.split("Evidence: ")[-1]

    dataset.append(
        {
            "source_doc": d.metadata["source"],
            "context": d.page_content,
            "question": question,
            "answer": answer,
            "evidence": evidence,
        }
    )
dataset

Let's check what we got from LLM.


[{'source_doc': 'CONTRIBUTING.md',
  'context': "# Welcome, AI Chef\n\nThe OpenAI Cookbook is a community-driven resource aimed at sharing knowledge in a way that is accessible, engaging, and enriching for all AI builders.\n\nBefore contributing, read through the existing issues and pull requests to see if someone else is already working on something similar. That way you can avoid duplicating efforts.\n\n## What makes a good contribution?\n\nGenerally, we have found that the best contributions to the Cookbook are **useful**, **novel** or **creative**, or a combination of these.\n\n- **Useful:** Involves concepts or techniques that can be applied broadly and often, and can translate to practical use-cases and solving real-world problems. If you're doing something often, chances are others are too, and having reusable examples to reference can be very helpful.\n- **Novel:** Showcases new developments or techniques. Look out for new research on how to best use LLMs, or new models and capabilities in the API.\n- **Creative:** Uses LLMs in creative and innovative ways, or combines multiple APIs and tools in novel ways.\n\nAdditionally, we strive to maintain a **neutral** tone, and aim for **high quality** writing.",
  'question': 'What are the three characteristics of the best contributions to the OpenAI Cookbook?\n\n',
  'answer': 'Useful, novel, or creative.\n\n',
  'evidence': '"Generally, we have found that the best contributions to the Cookbook are **useful**, **novel** or **creative**, or a combination of these."'},
 {'source_doc': 'CONTRIBUTING.md',
  'context': "Additionally, we strive to maintain a **neutral** tone, and aim for **high quality** writing.\n\n- **Neutral:** Maintains a neutral stance on tools and products. While it's natural to have preferences for particular tools, a good guide avoids over-evangelizing or marketing specific products, ensuring integrity and inclusivity.\n- **High quality:** Well structured, clear and complete. Writing good content ensures others can fully benefit from it. See the rubric below for more details on how we assess the quality of submissions to the Cookbook.\n\n## Rubric\n\nTo ensure the quality of submissions, we have established a rubric that assesses each contribution on various areas. The purpose of this rating system is to maintain a high standard of quality, relevance, and uniqueness. Each area is rated on a scale from 1 to 4. Contributions that score lower than a 3 in any of the areas will generally be rejected.\n\nWe encourage contributors to familiarize themselves with this rubric before writing content. Understanding the criteria not only increases the chances of your contribution being accepted, but also helps in creating a resource that is comprehensive, clear, and beneficial for all users.\n\nFor additional advice on writing good documentation, refer to [What Makes Documentation Good](https://cookbook.openai.com/what_makes_documentation_good).",
  'question': 'What is the minimum score required in each area of the rubric for a contribution to be generally accepted?\n\n',
  'answer': '3\n\n',
  'evidence': '"Each area is rated on a scale from 1 to 4. Contributions that score lower than a 3 in any of the areas will generally be rejected."'},
 {'source_doc': 'CONTRIBUTING.md',
  'context': '| Criteria     | Description                                                                                         | Score |\n| ------------ | --------------------------------------------------------------------------------------------------- | ----- |\n| Relevance    | Is the content related to building with OpenAI technologies? Is it useful to others?                |       |\n| Uniqueness   | Does the content offer new insights or unique information compared to existing documentation?       |       |\n| Clarity      | Is the language easy to understand? Are things well-explained? Is the title clear?                  |       |\n| Correctness  | Are the facts, code snippets, and examples correct and reliable? Does everything execute correctly? |       |\n| Conciseness  | Is the content concise? Are all details necessary? Can it be made shorter?                          |       |\n| Completeness | Is the content thorough and detailed? Are there things that weren’t explained fully?                |       |\n| Grammar      | Are there grammatical or spelling errors present?                                                   |       |\n\n### Breakdown',
  'question': 'What is the criterion for evaluating the correctness of content related to building with OpenAI technologies?\n\n',
  'answer': 'The criterion for evaluating correctness is whether the facts, code snippets, and examples are correct and reliable, and if everything executes correctly.\n\n',
  'evidence': '"Correctness  | Are the facts, code snippets, and examples correct and reliable? Does everything execute correctly?"'}]

As expected, the content generated by the LLM is indeed what we wanted.

source doc: This is the filename corresponding to this segment of text, since it was originally extracted from a specific file.

context: This is the content that was previously segmented from the file. The LLM generates questions and answers based on this.

question and answer: These are the questions and answers generated by the LLM for us.

evidence: This is the basis for the answer generated by the LLM; it comes from a specific sentence within the context.

Up to this point, although the data format meets the requirements, the content quality is not satisfactory. We need to further use the LLM to perform an evaluation.

Refine the dataset with LLM critique agents

According to this work, ReST meets ReAct, the content of samples in the dataset can currently be evaluated from three aspects.

ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent

Answering complex natural language questions often necessitates multi-step reasoning and integrating external information. Several systems have combined knowledge retrieval with a large language...

https://arxiv.org/abs/2312.10003

Groundedness: This represents whether the context clearly indicates an answer to the question. For instance, it is possible that although the LLM generated a question based on the context, due to the inherent hallucination of large models, the context might not clearly indicate that.

Relevance: This represents whether the question is truly meaningful in the target business scenario. Consider who would search for information in the OpenAI Cookbook. I think the potential users are machine learning enthusiasts or AI application developers searching how to use the OpenAI LLM. For example, the second question in the previous example, “What is the minimum score required in each area of the rubric for a contribution to be generally accepted?” is not very meaningful. After all, it has nothing to do with AI, ML, or OpenAI. Therefore, no LLM application developer would ask such a question. Even if the question and answer are correct, we don’t need it.

Standalone: This represents whether answering the question truly requires the given context. For example, even if provided with an introduction to Microsoft as context and the question “Who is the founder of Microsoft?” along with the corresponding answer, the question is meaningless because the LLM would have already learned this from Wikipedia or tons of webpages during its pre-training phase.

Finally, when constructing this LLM agent, we need to ensure that the LLM not only provides scores during evaluation but also gives specific reasons for its judgments.


def build_groundedness_critique_prompting_msg(question, context):
    system_message = {
        "role": "system",
        "content": "You are an assistant who is very good at written work and is a logical thinker.",
    }
    question_groundedness_critique_prompt = f"""
    You will be given a context and a question.
    Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
    Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.

    Provide your answer as follows:

    Answer:::
    Evaluation: (your rationale for the rating, as a text)
    Total rating: (your rating, as a number between 1 and 5)

    You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

    Now here are the question and context.

    Question: {question}\n
    Context: {context}\n
    Answer::: """

    user_message = {
        "role": "user",
        "content": question_groundedness_critique_prompt,
    }

    messages = []
    messages.append(system_message)
    messages.append(user_message)

    return messages


def build_relevance_critique_prompting_msg(question):
    system_message = {
        "role": "system",
        "content": "You are an assistant who is very good at written work and is a logical thinker.",
    }

    question_relevance_critique_prompt = f"""
    You will be given a question.
    Your task is to provide a 'total rating' representing how useful this question can be to AI application developers or working with Large Language Models and OpenAI services.
    Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.....

    Provide your answer as follows:

    Answer:::
    Evaluation: (your rationale for the rating, as a text)
    Total rating: (your rating, as a number between 1 and 5)

    You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

    Now here is the question.

    Question: {question}\n
    Answer::: """
    user_message = {
        "role": "user",
        "content": question_relevance_critique_prompt,
    }

    messages = []
    messages.append(system_message)
    messages.append(user_message)

    return messages


def build_standalone_critique_prompting_msg(question):
    system_message = {
        "role": "system",
        "content": "You are an assistant who is very good at written work and is a logical thinker.",
    }

    question_standalone_critique_prompt = f"""
    You will be given a question.
    Your task is to provide a 'total rating' representing how context-independant this question is.
    Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
    For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
    The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

    For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independant from the context.

    Provide your answer as follows:

    Answer:::
    Evaluation: (your rationale for the rating, as a text)
    Total rating: (your rating, as a number between 1 and 5)

    You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

    Now here is the question.

    Question: {question}\n
    Answer::: """

    user_message = {
        "role": "user",
        "content": question_standalone_critique_prompt,
    }

    messages = []
    messages.append(system_message)
    messages.append(user_message)

    return messages

Let's take a look at how this LLM agent would score the dataset we generated earlier.



for sample in dataset[:3]:
    groundedness_response = client.chat.completions.create(
        model="gpt-4o",
        messages=build_groundedness_critique_prompting_msg(
            context=sample.get("context"),
            question=sample.get("question"),
        ),
        temperature=0.0,
    )

    relevance_response = client.chat.completions.create(
        model="gpt-4o",
        messages=build_relevance_critique_prompting_msg(question=sample.get("question")),
        temperature=0.0,
    )

    standalone_response = client.chat.completions.create(
        model="gpt-4o",
        messages=build_standalone_critique_prompting_msg(question=sample.get("question")),
        temperature=0.0,
    )

    evaluations = {
        "groundedness": groundedness_response.choices[0].message.content,
        "relevance": relevance_response.choices[0].message.content,
        "standalone": standalone_response.choices[0].message.content,
    }

    try:
        for criterion, evaluation in evaluations.items():
            score, eval = (
                int(evaluation.split("Total rating: ")[-1].strip()),
                evaluation.split("Total rating: ")[-2].split("Evaluation: ")[1],
            )
            sample.update(
                {
                    f"{criterion}_score": score,
                    f"{criterion}_eval": eval,
                }
            )
    except Exception as e:
        continue
        
dataset[:3]


[{'source_doc': 'CONTRIBUTING.md',
  'context': "# Welcome, AI Chef\n\nThe OpenAI Cookbook is a community-driven resource aimed at sharing knowledge in a way that is accessible, engaging, and enriching for all AI builders.\n\nBefore contributing, read through the existing issues and pull requests to see if someone else is already working on something similar. That way you can avoid duplicating efforts.\n\n## What makes a good contribution?\n\nGenerally, we have found that the best contributions to the Cookbook are **useful**, **novel** or **creative**, or a combination of these.\n\n- **Useful:** Involves concepts or techniques that can be applied broadly and often, and can translate to practical use-cases and solving real-world problems. If you're doing something often, chances are others are too, and having reusable examples to reference can be very helpful.\n- **Novel:** Showcases new developments or techniques. Look out for new research on how to best use LLMs, or new models and capabilities in the API.\n- **Creative:** Uses LLMs in creative and innovative ways, or combines multiple APIs and tools in novel ways.\n\nAdditionally, we strive to maintain a **neutral** tone, and aim for **high quality** writing.",
  'question': 'What are the three characteristics of the best contributions to the OpenAI Cookbook?\n\n',
  'answer': 'Useful, novel, or creative.\n\n',
  'evidence': '"Generally, we have found that the best contributions to the Cookbook are **useful**, **novel** or **creative**, or a combination of these."',
  'groundedness_score': 5,
  'groundedness_eval': 'The context clearly outlines the three characteristics of the best contributions to the OpenAI Cookbook: useful, novel, and creative. Each characteristic is defined and explained in detail, making it straightforward to answer the question unambiguously.\n',
  'relevance_score': 4,
  'relevance_eval': 'This question is quite useful for AI application developers and those working with Large Language Models and OpenAI services. Understanding the characteristics of the best contributions to the OpenAI Cookbook can help developers create high-quality, valuable content that can benefit the community. It encourages best practices, fosters collaboration, and ensures that contributions are aligned with the needs and standards of the community.\n\n',
  'standalone_score': 3,
  'standalone_eval': 'The question "What are the three characteristics of the best contributions to the OpenAI Cookbook?" is somewhat context-dependent. It assumes the reader knows what the OpenAI Cookbook is and what constitutes a "contribution" to it. However, it does not refer to a specific document or setting, and someone familiar with OpenAI and its resources could reasonably understand and answer the question based on general knowledge or available documentation.\n\n'},
 {'source_doc': 'CONTRIBUTING.md',
  'context': "Additionally, we strive to maintain a **neutral** tone, and aim for **high quality** writing.\n\n- **Neutral:** Maintains a neutral stance on tools and products. While it's natural to have preferences for particular tools, a good guide avoids over-evangelizing or marketing specific products, ensuring integrity and inclusivity.\n- **High quality:** Well structured, clear and complete. Writing good content ensures others can fully benefit from it. See the rubric below for more details on how we assess the quality of submissions to the Cookbook.\n\n## Rubric\n\nTo ensure the quality of submissions, we have established a rubric that assesses each contribution on various areas. The purpose of this rating system is to maintain a high standard of quality, relevance, and uniqueness. Each area is rated on a scale from 1 to 4. Contributions that score lower than a 3 in any of the areas will generally be rejected.\n\nWe encourage contributors to familiarize themselves with this rubric before writing content. Understanding the criteria not only increases the chances of your contribution being accepted, but also helps in creating a resource that is comprehensive, clear, and beneficial for all users.\n\nFor additional advice on writing good documentation, refer to [What Makes Documentation Good](https://cookbook.openai.com/what_makes_documentation_good).",
  'question': 'What is the minimum score required in each area of the rubric for a contribution to be generally accepted?\n\n',
  'answer': '3\n\n',
  'evidence': '"Each area is rated on a scale from 1 to 4. Contributions that score lower than a 3 in any of the areas will generally be rejected."',
  'groundedness_score': 5,
  'groundedness_eval': 'The context clearly states that each area of the rubric is rated on a scale from 1 to 4 and that contributions scoring lower than a 3 in any area will generally be rejected. This directly answers the question about the minimum score required in each area for a contribution to be generally accepted.\n',
  'relevance_score': 2,
  'relevance_eval': 'This question is quite specific and context-dependent, likely pertaining to a particular rubric used in a specific setting (e.g., academic, project evaluation, etc.). While it may be useful for understanding the criteria for acceptance in that specific context, it does not directly relate to AI application development or working with Large Language Models and OpenAI services. It lacks broader applicability to the field of AI and does not provide insights into AI development, deployment, or optimization.\n\n',
  'standalone_score': 1,
  'standalone_eval': 'This question refers to a specific rubric, which is not provided in the question itself. The term "each area of the rubric" implies that there is a predefined set of criteria or categories that need to be known to understand the question fully. Therefore, the question depends on additional information to be understood.\n\n'},
 {'source_doc': 'CONTRIBUTING.md',
  'context': '| Criteria     | Description                                                                                         | Score |\n| ------------ | --------------------------------------------------------------------------------------------------- | ----- |\n| Relevance    | Is the content related to building with OpenAI technologies? Is it useful to others?                |       |\n| Uniqueness   | Does the content offer new insights or unique information compared to existing documentation?       |       |\n| Clarity      | Is the language easy to understand? Are things well-explained? Is the title clear?                  |       |\n| Correctness  | Are the facts, code snippets, and examples correct and reliable? Does everything execute correctly? |       |\n| Conciseness  | Is the content concise? Are all details necessary? Can it be made shorter?                          |       |\n| Completeness | Is the content thorough and detailed? Are there things that weren’t explained fully?                |       |\n| Grammar      | Are there grammatical or spelling errors present?                                                   |       |\n\n### Breakdown',
  'question': 'What is the criterion for evaluating the correctness of content related to building with OpenAI technologies?\n\n',
  'answer': 'The criterion for evaluating correctness is whether the facts, code snippets, and examples are correct and reliable, and if everything executes correctly.\n\n',
  'evidence': '"Correctness  | Are the facts, code snippets, and examples correct and reliable? Does everything execute correctly?"',
  'groundedness_score': 5,
  'groundedness_eval': 'The context provides a detailed list of criteria for evaluating content related to building with OpenAI technologies, including correctness. The criterion for correctness is explicitly mentioned as "Are the facts, code snippets, and examples correct and reliable? Does everything execute correctly?" This directly answers the question about the criterion for evaluating correctness.\n',
  'relevance_score': 5,
  'relevance_eval': 'This question is highly relevant for AI application developers and those working with Large Language Models and OpenAI services. Understanding the criteria for evaluating the correctness of content is crucial for ensuring the quality and reliability of AI-generated outputs. It helps developers set benchmarks, improve model performance, and maintain the integrity of their applications. This question addresses a fundamental aspect of AI development and quality assurance.\n\n',
  'standalone_score': 4,
  'standalone_eval': 'The question asks for the criterion for evaluating the correctness of content related to building with OpenAI technologies. While it mentions a specific context (OpenAI technologies), it does not refer to a particular document, setting, or additional information that is required to understand the question. An operator with access to OpenAI documentation or general knowledge about OpenAI technologies would understand the question.\n\n'}]

From the results above, we can clearly see that the LLM agent we constructed has evaluated the questions in the dataset according to our expectations, based on the three dimensions.

As expected, the second question, "What is the minimum score required in each area of the rubric for a contribution to be generally accepted?" received a very low score in relevance. The reason given is quite reasonable: "While it may be useful for understanding the criteria for acceptance in that specific context, it does not directly relate to AI application development or working with Large Language Models and OpenAI services."

Build from multimodal data

The above example introduced the principles and process of using LLM to build synthetic datasets. However, in real business scenarios, more complex data types are often involved, such as tables, images, audio recordings, etc. So how can this be extended to multimodal scenarios?

The approach for multimodal scenarios is shown in the figure below, now I will explain each part step by step.

Currently, LLM APIs can directly accept only two types of data: text and images. This means that in multimodal scenarios, we need to convert different modal data into text or images as much as possible.

For example, this is Alphabet's financial report for the first quarter of 2024. It is a PDF file containing both tables and text. I will use the first page of this document as an example to explain, but the approach is the same for other pages.

Since LLMs can natively accept only images or text content, such PDF files need some preprocessing. Different modal contents must be extracted separately and then recombined so that the LLM can fully understand them. This is the first step in the figure above: parsing and reconstruction.

Many tools can extract structured content from PDF files, such as

unstructured

Unstructured-IO • Updated Aug 29, 2024

and

PyMuPDF

pymupdf • Updated Aug 29, 2024

. LangChain also provides several pre-packaged APIs to help us build LangChain documents from PDFs:

PDF | 🦜️🔗 LangChain

Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems.

https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/pdf/

However, I will use a new approach to parse PDFs using a vision model. This idea comes from a recent open-source project I saw,

gptpdf

CosmosShadow • Updated Aug 29, 2024

In brief, the project's approach is:

Use PyMuPDF for basic parsing of the PDF to locate non-text parts.

Convert the original PDF to an image. Wrap non-text parts in red rectangles. Crop and save the images within the red boxes separately.

Send the entire image with red boxes to a vision-language model. The model determines the content within the red boxes. If it's a table, it's converted to HTML and returned. If it's an image, it's returned as a sub-figure reference.

Why use this method? Although many PDF parsing methods were mentioned earlier, none are perfect. Common methods still have minor issues with tables, formulas, and images. The vision model approach is relatively flawless, but at a higher cost. However, since our goal is to create a reusable evaluation dataset, quality should be the top priority within acceptable cost limits.

gptpdf uses the following two functions for PDF preprocessing. Due to constraints, I've only kept these two core functions from the entire PDF processing. They are:

Converting the PDF to an image

Marking identified images or table content with red rectangles


import fitz
import shapely.geometry as sg
from shapely.geometry.base import BaseGeometry
from shapely.validation import explain_validity

def _parse_rects(page: fitz.Page) -> List[Tuple[float, float, float, float]]:
    """
    Parse drawings in the page and merge adjacent rectangles.
    """
    drawings = page.get_drawings()

    is_short_line = lambda x: abs(x["rect"][3] - x["rect"][1]) < 1 and abs(x["rect"][2] - x["rect"][0]) < 30
    drawings = [drawing for drawing in drawings if not is_short_line(drawing)]

    # Convert the Shapely
    rect_list = [sg.box(*drawing["rect"]) for drawing in drawings]

    images = page.get_image_info()
    image_rects = [sg.box(*image["bbox"]) for image in images]

    rect_list += image_rects

    merged_rects = _merge_rects(rect_list, distance=10, horizontal_distance=10)
    merged_rects = [rect for rect in merged_rects if explain_validity(rect) == "Valid Geometry"]

    is_large_content = lambda x: (len(x[4]) / max(1, len(x[4].split("\n")))) > 10
    small_text_area_rects = [sg.box(*x[:4]) for x in page.get_text("blocks") if not is_large_content(x)]
    large_text_area_rects = [sg.box(*x[:4]) for x in page.get_text("blocks") if is_large_content(x)]
    _, merged_rects = _adsorb_rects_to_rects(large_text_area_rects, merged_rects, distance=0.1)  # 完全相交
    _, merged_rects = _adsorb_rects_to_rects(small_text_area_rects, merged_rects, distance=5)  # 靠近

    merged_rects = _merge_rects(merged_rects, distance=10)
    merged_rects = [rect for rect in merged_rects if rect.bounds[2] - rect.bounds[0] > 20 and rect.bounds[3] - rect.bounds[1] > 20]
    return [rect.bounds for rect in merged_rects]


def _parse_pdf_to_images(pdf_path: str, output_dir: str = "./") -> List[Tuple[str, List[str]]]:
    """
    Parse PDF to images and save to output_dir.
    """
    # open the pdf file
    pdf_document = fitz.open(pdf_path)
    image_infos = []

    for page_index, page in enumerate(pdf_document):
        logging.info(f"parse page: {page_index}")
        rect_images = []
        rects = _parse_rects(page)
        for index, rect in enumerate(rects):
            fitz_rect = fitz.Rect(rect)
            fitz_rect.y0 -= 35
            fitz_rect.y1 += 50
            # convert th png
            pix = page.get_pixmap(clip=fitz_rect, matrix=fitz.Matrix(4, 4))
            name = f"{page_index}_{index}.png"
            pix.save(os.path.join(output_dir, name))
            rect_images.append(name)
            # draw the red rect
            big_fitz_rect = fitz.Rect(fitz_rect.x0 - 1, fitz_rect.y0 - 1, fitz_rect.x1 + 1, fitz_rect.y1 + 1)
            page.draw_rect(big_fitz_rect, color=(1, 0, 0), width=1)
            # add the image name in the left top corner
            text_x = fitz_rect.x0 + 2
            text_y = fitz_rect.y0 + 10
            text_rect = fitz.Rect(text_x, text_y - 9, text_x + 80, text_y + 2)
            page.draw_rect(text_rect, color=(1, 1, 1), fill=(1, 1, 1))
            page.insert_text((text_x, text_y), name, fontsize=10, color=(1, 0, 0))
        page_image_with_rects = page.get_pixmap(matrix=fitz.Matrix(3, 3))
        page_image = os.path.join(output_dir, f"{page_index}.png")
        page_image_with_rects.save(page_image)
        image_infos.append((page_image, rect_images))

    pdf_document.close()
    return image_infos

After processing with these two functions, the original PDF file is converted into a PNG file with red rectangle markings. This covers steps 1 and 2 mentioned earlier.

In step 3, we need to send the image with red rectangle markings to the LLM. We then use prompt engineering to generate further parsing results. The main work in this part involves prompt engineering.


DEFAULT_PROMPT = """
You are a PDF document parser. Use markdown and LaTeX syntax to output the content of the images.

Use markdown syntax to convert the text recognized in the image to markdown format output. You must:
1. Output the content in the same language as recognized in the image. For example, if the text recognized is in English, the output must also be in English.
2. Do not explain or output irrelevant text. Directly output the content from the image. For instance, do not output examples like "Below is the markdown text generated based on the image content." Instead, directly output the markdown.

In the image, some areas are marked with red rectagles. If the area is an image, insert it into the output content using the ![]() format.
If the area is a table, please use HTML format to represent it.
Pay close attention to the format of the table area and ensure that the HTML format matches the original table format exactly. Enclose the table with three backticks before and after, as shown below:

```
<table>
  <!-- Table content here -->
</table>
```
"""

Let's examine the results returned by the LLM.


# Alphabet Announces First Quarter 2024 Results

MOUNTAIN VIEW, Calif. – April 25, 2024 – Alphabet Inc. (NASDAQ: GOOG, GOOGL) today announced financial results for the quarter ended March 31, 2024.

Sundar Pichai, CEO, said: “Our results in the first quarter reflect strong performance from Search, YouTube and Cloud. We are well under way with our Gemini era and there’s great momentum across the company. Our leadership in AI research and infrastructure, and our global product footprint, position us well for the next wave of AI innovation.”

Ruth Porat, President and Chief Investment Officer, CFO said: “Our strong financial results for the first quarter reflect revenue strength across the company and ongoing efforts to durably reengineer our cost base. We delivered revenues of $80.5 billion, up 15% year-on-year, and operating margin expansion.”

## Q1 2024 Financial Highlights (unaudited)

The following table summarizes our consolidated financial results for the quarters ended March 31, 2023 and 2024 (in millions, except for per share information and percentages).

```
<table>
  <tr>
    <td></td>
    <td>Quarter Ended March 31,</td>
  </tr>
  <tr>
    <td></td>
    <td>2023</td>
    <td>2024</td>
  </tr>
  <tr>
    <td></td>
    <td>(unaudited)</td>
  </tr>
  <tr>
    <td>Revenues</td>
    <td>$ 69,787</td>
    <td>$ 80,539</td>
  </tr>
  <tr>
    <td>Change in revenues year over year</td>
    <td>3 %</td>
    <td>15 %</td>
  </tr>
  <tr>
    <td>Change in constant currency revenues year over year<sup>1</sup></td>
    <td>6 %</td>
    <td>16 %</td>
  </tr>
  <tr>
    <td>Operating income</td>
    <td>$ 17,415</td>
    <td>$ 25,472</td>
  </tr>
  <tr>
    <td>Operating margin</td>
    <td>25 %</td>
    <td>32 %</td>
  </tr>
  <tr>
    <td>Other income (expense), net</td>
    <td>$ 790</td>
    <td>$ 2,843</td>
  </tr>
  <tr>
    <td>Net income</td>
    <td>$ 15,051</td>
    <td>$ 23,662</td>
  </tr>
  <tr>
    <td>Diluted EPS</td>
    <td>$ 1.17</td>
    <td>$ 1.89</td>
  </tr>
</table>
```

<sup>1</sup> Non-GAAP measure. See the section captioned “Reconciliation from GAAP Revenues to Non-GAAP Constant Currency Revenues and GAAP Percentage Change in Revenues to Non-GAAP Percentage Change in Constant Currency Revenues” for more details.

Looks great, the results match our expectations. The uploaded image contains both text and table content. The table is parsed and returned in HTML. Next, we extract the table and store it in a separate HTML file. At this point, we have used the LLM to convert the text-table mixed PDF file into a text file expressed in Markdown syntax and an HTML file expressed in HTML syntax. These form the Document Segment.

Following the previous process, we send this Document Segment sequentially to the LLM to generate the question and answer, along with supporting evidence. In multimodal scenarios, if the evidence is in the text, it should be the original text. If the evidence comes from information in images or tables, it can be replaced with the ID of the table or image. This is designed from the perspective of evaluating retrieval functionality.

First, let's convert the markdown and HTML file into LangChain Documents based on their filetype.


from langchain.docstore.document import Document

docs = []
filelist = ["0_0.md", "0_0.html"]
for file in filelist:
    if file.endswith(tuple([".html", ".md"])):
        with open(file, "r") as f:
            content = f.read()
            docs.append(
                Document(
                    page_content=content,
                    metadata={"source": file, "type": "text"},
                )
            )
    elif file.endswith(tuple([".jpg", ".jpeg", ".png"])):
        data_url = local_image_to_data_url(file)
        docs.append(
            Document(
                page_content=data_url,
                metadata={"source": file, "type": "image"},
            )
        )
        
docs

Here are two LangChain Documents generated using the previously extracted text and table.


[Document(metadata={'source': '0_0.md', 'type': 'text'}, page_content='# Alphabet Announces First Quarter 2024 Results\n\nMOUNTAIN VIEW, Calif. – April 25, 2024 – Alphabet Inc. (NASDAQ: GOOG, GOOGL) today announced financial results for the quarter ended March 31, 2024.\n\nSundar Pichai, CEO, said: “Our results in the first quarter reflect strong performance from Search, YouTube and Cloud. We are well under way with our Gemini era and there’s great momentum across the company. Our leadership in AI research and infrastructure, and our global product footprint, position us well for the next wave of AI innovation.”\n\nRuth Porat, President and Chief Investment Officer, CFO said: “Our strong financial results for the first quarter reflect revenue strength across the company and ongoing efforts to durably reengineer our cost base. We delivered revenues of $80.5 billion, up 15% year-on-year, and operating margin expansion.”\n\n## Q1 2024 Financial Highlights (unaudited)\n\nThe following table summarizes our consolidated financial results for the quarters ended March 31, 2023 and 2024 (in millions, except for per share information and percentages).'),
 Document(metadata={'source': '0_0.html', 'type': 'text'}, page_content='<table>\n    <tr>\n        <td></td>\n        <td></td>\n        <td>Quarter Ended March 31,</td>\n    </tr>\n    <tr>\n        <td></td>\n        <td></td>\n        <td>2023</td>\n        <td>2024</td>\n    </tr>\n    <tr>\n        <td></td>\n        <td></td>\n        <td>(unaudited)</td>\n    </tr>\n    <tr>\n        <td>Revenues</td>\n        <td>$ 69,787</td>\n        <td>$ 80,539</td>\n    </tr>\n    <tr>\n        <td>Change in revenues year over year</td>\n        <td>3 %</td>\n        <td>15 %</td>\n    </tr>\n    <tr>\n        <td>Change in constant currency revenues year over year<sup>(1)</sup></td>\n        <td>6 %</td>\n        <td>16 %</td>\n    </tr>\n    <tr>\n        <td>Operating income</td>\n        <td>$ 17,415</td>\n        <td>$ 25,472</td>\n    </tr>\n    <tr>\n        <td>Operating margin</td>\n        <td>25 %</td>\n        <td>32 %</td>\n    </tr>\n    <tr>\n        <td>Other income (expense), net</td>\n        <td>$ 790</td>\n        <td>$ 2,843</td>\n    </tr>\n    <tr>\n        <td>Net income</td>\n        <td>$ 15,051</td>\n        <td>$ 23,662</td>\n    </tr>\n    <tr>\n        <td>Diluted EPS</td>\n        <td>$ 1.17</td>\n        <td>$ 1.89</td>\n    </tr>\n    <tr>\n        <td colspan="4"><sup>(1)</sup> Non-GAAP measure. See the section captioned “Reconciliation from GAAP Revenues to Non-GAAP Constant Currency Revenues and GAAP Percentage Change in Revenues to Non-GAAP Percentage Change in Constant Currency Revenues” for more details.</td>\n    </tr>    \n</table>')]

Next, similar to the previous process, we will let the LLM generate the question and answer for us based on the multimodal context. Compared to before, this prompt includes some content related to tables and images. If you are interested, feel free to pay extra attention to them.


import base64
from mimetypes import guess_type


def local_image_to_data_url(image_path):
    mime_type, _ = guess_type(image_path)
    if mime_type is None:
        mime_type = "application/octet-stream"
    with open(image_path, "rb") as image_file:
        base64_encoded_data = base64.b64encode(image_file.read()).decode("utf-8")
    return f"data:{mime_type};base64,{base64_encoded_data}"


def build_qa_prompting_msg_list(Documents: list):
    system_message = {
        "role": "system",
        "content": "You are an assistant who is very good at written work and is a logical thinker.",
    }

    QA_generation_prompt = f"""
    Your task is to write a factoid question and an answer given a context list. The context list will be provided to you in order. 
    Your factoid question should be answerable with a specific, concise piece of factual information from the context. 
    Your factoid question should be formulated in the same style as questions users could ask in a search engine.
    The context list may be in the form of text, tables, or images. Tables are presented in HTML format, and images are expressed in base64 format. Please pay attention to their order.

    You need to provide the specific evidence.
    If the evidence comes from the text section, please return the reference from the text, which should be an entire paragraph or a whole sentence in context.
    If the evicence comes from a table or an image, please provide the name in order, such as table_1 or table_2.
    Please note that your evidence MUST NOT mention something like "according to the context" or "the title indicates that".

    Provide your answer as follows:

    Output:::
    Factoid question: (your factoid question)
    Answer: (your answer to the factoid question)
    Evidence: (the evidence sentence from the context that supports the answer)

    Now here is the context list."""

    human_messge_content = [{"type": "text", "text": QA_generation_prompt}]

    for d in Documents:
        if d.metadata["type"] == "image":
            human_messge_content.append(
                {
                    "type": "image_url",
                    "image_url": {"url": d.page_content},
                }
            )
        elif d.metadata["type"] == "text":
            human_messge_content.append(
                {
                    "type": "text",
                    "text": d.page_content,
                }
            )

    human_message = {
        "role": "user",
        "content": human_messge_content,
    }

    messages = []
    messages.append(system_message)
    messages.append(human_message)

    return messages
    
    
document_message = build_qa_prompting_msg_list(docs)
response = client.chat.completions.create(model="gpt-4o", messages=document_message)

response.choices[0].message.content

I ran this twice, once I got questions from the text, and the other time I got questions from the table, which aligns with our expectations.


[{'question': "What were Alphabet's revenues for the first quarter of 2024?\n",
  'answer': '$80.5 billion\n',
  'evidence': 'Ruth Porat, President and Chief Investment Officer, CFO said: “Our strong financial results for the first quarter reflect revenue strength across the company and ongoing efforts to durably reengineer our cost base. We delivered revenues of $80.5 billion, up 15% year-on-year, and operating margin expansion.”'}]


[{'question': "What was Alphabet Inc.'s operating income for the first quarter of 2024?\n",
  'answer': '$25,472 million\n',
  'evidence': 'table_1'}]

The subsequent method is the same as before. We then construct a critique agent to score the generated questions, selecting the high-scoring ones for the dataset. So I’ll skip it..

Overall, the approach to constructing a multimodal dataset is consistent with the previous methods. The difference lies in the need for more complex preprocessing of mixed-modal files, converting them into text or images that can be directly processed by the LLM. In other words, it involves reconstructing the original PDF, DOCX, and other mixed-modal files, then having the LLM read them in sequence (which mimics the human reading order) to generate the desired dataset.

Citation

Cited as:


@article{zhou2024rag,
title   = "How to evaluate your RAG Part 1: Synthetic Dataset",
author  = "Zhou, More",
journal = "blog.xxm.plus",
year    = "2024",
month   = "Aug",
url     = "https://blog.xxm.plus/how-to-evaluate-your-rag-part-1-synthetic-data"
}