Automatisation des tests LLM : Évaluation GPT-4 et suivi sur Google Sheets

Name: Automatisation des tests LLM : Évaluation GPT-4 et suivi sur Google Sheets
Rating: 4.5 (10 reviews)
Author: Adam Janes

Avancé

Ceci est unEngineering, AI Summarizationworkflow d'automatisation du domainecontenant 17 nœuds.Utilise principalement des nœuds comme Set, Limit, Merge, Webhook, HttpRequest. Automatisation des tests LLM : évaluation GPT-4 et suivi avec Google Sheets

Prérequis

•Point de terminaison HTTP Webhook (généré automatiquement par n8n)
•Peut nécessiter les informations d'identification d'authentification de l'API cible
•Informations d'identification Google Sheets API

Nœuds utilisés (17)

OutputParserStructured

Catégorie

Ingénierie

Résumé IA

Aperçu du workflow

Visualisation des connexions entre les nœuds, avec support du zoom et du déplacement

Fusion

Analyseur de sortie structurée

Mettre à jour les résultats

Limite

Extraire les données

Obtenir les tests

Exécuter le sous-workflow

Webhook

Chaîne LLM de base

Modèle de chat OpenRouter

Conserver les données d'origine

Déclencheur manuel

React Flow

Exporter le workflow

Copiez la configuration JSON suivante dans n8n pour importer et utiliser ce workflow

{
  "meta": {
    "instanceId": "45e293393b5dd8437fb351e5b1ef5511ef67e6e0826a1c10b9b68be850b67593"
  },
  "nodes": [
    {
      "id": "2dbc4a8a-4fb6-4679-9d96-2724f79fbac1",
      "name": "Fusion",
      "type": "n8n-nodes-base.merge",
      "position": [
        1980,
        600
      ],
      "parameters": {
        "mode": "combine",
        "options": {},
        "combineBy": "combineByPosition"
      },
      "typeVersion": 3.1
    },
    {
      "id": "146a6af3-58ec-4555-9202-3ce87a83af28",
      "name": "Analyseur de sortie structurée",
      "type": "@n8n/n8n-nodes-langchain.outputParserStructured",
      "position": [
        1540,
        520
      ],
      "parameters": {
        "jsonSchemaExample": "{\n  \"reasoning\": \"The Assistant fabricated a $1 million figure and a 12-month provision that are not found in the source. This breaches factual correctness and completeness. The output would mislead business stakeholders if used without correction.\",\n  \"decision\": \"Fail\"\n}"
      },
      "typeVersion": 1.2
    },
    {
      "id": "83da8236-e5fb-4847-8033-6559f575c7ff",
      "name": "Mettre à jour les résultats",
      "type": "n8n-nodes-base.googleSheets",
      "position": [
        960,
        200
      ],
      "parameters": {
        "columns": {
          "value": {
            "ID": "={{ $json.ID }}",
            "Input": "={{ $json.Input }}",
            "Output": "={{ $json.Output }}",
            "Decision": "={{ $json.output.decision }}",
            "Test No.": "={{ $json[\"Test No\"][\"\"] }}",
            "Reasoning": "={{ $json.output.reasoning }}",
            "AI Platform": "={{ $json[\"AI Platform\"] }}",
            "Reference Answer": "={{ $json[\"Reference Answer\"] }}"
          },
          "schema": [
            {
              "id": "ID",
              "type": "string",
              "display": true,
              "removed": false,
              "required": false,
              "displayName": "ID",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Test No.",
              "type": "string",
              "display": true,
              "required": false,
              "displayName": "Test No.",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "AI Platform",
              "type": "string",
              "display": true,
              "required": false,
              "displayName": "AI Platform",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Input",
              "type": "string",
              "display": true,
              "required": false,
              "displayName": "Input",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Output",
              "type": "string",
              "display": true,
              "required": false,
              "displayName": "Output",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Reference Answer",
              "type": "string",
              "display": true,
              "required": false,
              "displayName": "Reference Answer",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Decision",
              "type": "string",
              "display": true,
              "required": false,
              "displayName": "Decision",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Reasoning",
              "type": "string",
              "display": true,
              "required": false,
              "displayName": "Reasoning",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            }
          ],
          "mappingMode": "defineBelow",
          "matchingColumns": [
            "ID"
          ],
          "attemptToConvertTypes": false,
          "convertFieldsToString": false
        },
        "options": {},
        "operation": "appendOrUpdate",
        "sheetName": {
          "__rl": true,
          "mode": "list",
          "value": 537199982,
          "cachedResultUrl": "https://docs.google.com/spreadsheets/d/1c73be3fHkKr0DVJYIt9qlNfJcfuUV6DTShp93fa55Ig/edit#gid=537199982",
          "cachedResultName": "Results"
        },
        "documentId": {
          "__rl": true,
          "mode": "url",
          "value": "https://docs.google.com/spreadsheets/d/1c73be3fHkKr0DVJYIt9qlNfJcfuUV6DTShp93fa55Ig/edit?usp=sharing"
        }
      },
      "credentials": {
        "googleSheetsOAuth2Api": {
          "id": "04iXS2lwUVyzn6F2",
          "name": "Google Sheets account"
        }
      },
      "typeVersion": 4.5
    },
    {
      "id": "824c06fb-9104-4c65-a77f-33db0167c0f6",
      "name": "Note adhésive4",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        560,
        -20
      ],
      "parameters": {
        "color": 4,
        "height": 720,
        "content": "## 2. Execute Subworkflow\nThis node runs immediately (batching requests), but waits for the result before moving to the next step."
      },
      "typeVersion": 1
    },
    {
      "id": "3a20e99f-b183-4362-b909-2fffdd48d0d2",
      "name": "Note adhésive8",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        -680,
        160
      ],
      "parameters": {
        "width": 460,
        "height": 280,
        "content": "## Data format\nOur Tests Sheet contains the following columns:\n- ID: A unique identifier for each row\n- Test No.: The test that the LLM was given\n- AI Platform: The LLM that was given the test.\n- Input: The input prompt that the LLM was given.\n- Output: The response that the LLM gave.\n- Reference Answer: The \"gold standard\" answer to the input in question, showing how the LLM is expected to respond."
      },
      "typeVersion": 1
    },
    {
      "id": "16fe7cb7-ca24-40f1-855b-e1867bf29b56",
      "name": "Note adhésive9",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        0,
        -20
      ],
      "parameters": {
        "color": 6,
        "width": 360,
        "height": 180,
        "content": "## 1. Fetch test cases\nWe start by grabbing our list of test cases stored in a Google Sheet [here](https://docs.google.com/spreadsheets/d/1c73be3fHkKr0DVJYIt9qlNfJcfuUV6DTShp93fa55Ig/edit?usp=sharing).\n\nTo start the workflow, you should click \"Execute workflow\" button to the left of the Manual Trigger node."
      },
      "typeVersion": 1
    },
    {
      "id": "86f611e8-ca94-4b9f-a858-45d0fcbdfcfa",
      "name": "Note adhésive15",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        900,
        -20
      ],
      "parameters": {
        "color": 6,
        "width": 260,
        "height": 180,
        "content": "## 4. Update results\nWe create a new row in our output sheet, containing our original data together with the judge decision/reasoning."
      },
      "typeVersion": 1
    },
    {
      "id": "caa54653-920b-4d4f-abb6-bab54c64350b",
      "name": "Note adhésive16",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        1320,
        -20
      ],
      "parameters": {
        "color": 4,
        "width": 360,
        "height": 340,
        "content": "## 3. Judge LLM outputs\nOur prompt judges the LLM input/output and decides if the LLM passed the test, based on how well the output fits to the reference answer. \n\nWe also ask for a reason why the judge made its decision, which we can use to refine our eval later.\n\nWe're using OpenRouter here, which lets us easily tweak which LLM we want to use.\n\nThe output parser makes sure that the output is in JSON format, making the data easy to parse in the next step."
      },
      "typeVersion": 1
    },
    {
      "id": "9b22fb78-d6fa-4dad-a543-1b02828d2f2e",
      "name": "Limite",
      "type": "n8n-nodes-base.limit",
      "disabled": true,
      "position": [
        360,
        220
      ],
      "parameters": {
        "maxItems": 3
      },
      "typeVersion": 1
    },
    {
      "id": "faad2c18-defc-4644-b9a3-3650c26f5891",
      "name": "Extraire les données",
      "type": "n8n-nodes-base.set",
      "position": [
        1000,
        400
      ],
      "parameters": {
        "mode": "raw",
        "options": {},
        "jsonOutput": "={{ $json.body }}"
      },
      "typeVersion": 3.4
    },
    {
      "id": "ec8629e4-7715-410c-aa6d-560fd284a1ca",
      "name": "Obtenir les tests",
      "type": "n8n-nodes-base.googleSheets",
      "position": [
        140,
        220
      ],
      "parameters": {
        "options": {},
        "sheetName": {
          "__rl": true,
          "mode": "list",
          "value": "gid=0",
          "cachedResultUrl": "https://docs.google.com/spreadsheets/d/1c73be3fHkKr0DVJYIt9qlNfJcfuUV6DTShp93fa55Ig/edit#gid=0",
          "cachedResultName": "Tests"
        },
        "documentId": {
          "__rl": true,
          "mode": "url",
          "value": "https://docs.google.com/spreadsheets/d/1c73be3fHkKr0DVJYIt9qlNfJcfuUV6DTShp93fa55Ig/edit?usp=sharing"
        }
      },
      "credentials": {
        "googleSheetsOAuth2Api": {
          "id": "04iXS2lwUVyzn6F2",
          "name": "Google Sheets account"
        }
      },
      "typeVersion": 4.5
    },
    {
      "id": "d7160cac-8bea-4464-bfea-00c785b8ac7e",
      "name": "Exécuter le sous-workflow",
      "type": "n8n-nodes-base.httpRequest",
      "onError": "continueErrorOutput",
      "maxTries": 2,
      "position": [
        620,
        220
      ],
      "parameters": {
        "url": "https://webhook-processor-production-48f8.up.railway.app/webhook/llm-as-a-judge",
        "method": "POST",
        "options": {
          "batching": {
            "batch": {
              "batchSize": 1,
              "batchInterval": 500
            }
          }
        },
        "jsonBody": "={{ $json }}",
        "sendBody": true,
        "specifyBody": "json"
      },
      "retryOnFail": false,
      "typeVersion": 4.2
    },
    {
      "id": "6920a43b-bdbf-47c0-a644-1f75375e1127",
      "name": "Webhook",
      "type": "n8n-nodes-base.webhook",
      "position": [
        620,
        480
      ],
      "webhookId": "1cbce320-d28e-4e97-8663-bf2c6a36a358",
      "parameters": {
        "path": "llm-as-a-judge",
        "options": {},
        "httpMethod": "POST",
        "responseData": "allEntries",
        "responseMode": "lastNode"
      },
      "typeVersion": 2
    },
    {
      "id": "70cc9edd-f481-420e-bfcc-02b25f4353db",
      "name": "Chaîne LLM de base",
      "type": "@n8n/n8n-nodes-langchain.chainLlm",
      "onError": "continueErrorOutput",
      "position": [
        1380,
        340
      ],
      "parameters": {
        "text": "=INPUT:\n\n{\n  \"task\": {{ $('Extract Data').item.json['Input'] }},\n  \"answer_key\": {{ $('Extract Data').item.json['Reference Answer'] }},\n  \"output\": {{ $('Extract Data').item.json['Output'] }}\n}\n\nOUTPUT:",
        "messages": {
          "messageValues": [
            {
              "message": "=## Context\n\nYou are an evaluator of LLMs in the legal domain.\n\n## Inputs Provided for Each Task\n\n- task: The legal question or instruction.\n- answer_key: The correct answer for this task, found in the answer key column of the same Google Sheet.\n- output: The answer generated by the AI Assistant.\n\n\n## Evaluation Rules\n\nGrade the AI Assistant's output as Pass or Fail by comparing it ONLY to the answer_key for that task.\n\nDo not use or reference the original source material or any other information.\n\n## Criteria for Pass\n\n1. Factual Correctness\n- The output must accurately reflect the information in the answer_key.\n- Minor differences in paraphrasing, wording, or formatting (including clause numbering, references, or synonyms) are acceptable if the substantive information matches the answer_key.\n- If the answer key provides multiple possible correct answers (e.g., separated by \"OR\"), any output that matches any one of the alternatives is acceptable.\n\n\n2. Relevance to the Query\n- The output must directly answer the task as covered in the answer_key.\n- Do not introduce unrelated or off-topic information.\n\n\n3. Completeness\n- If the output contains extra information that does not contradict or misrepresent the answer key, it is acceptable.\n- Omitting any critical point present in the answer_key = Fail.\n\n\n## Key Rule\n- If the output materially fails any one of the three requirements compared to the answer_key, grade as Fail.\n- Minor paraphrasing or stylistic differences are acceptable if the substantive meaning is identical.\n\n\n## Required Output Format\n\nYour evaluation must be provided in JSON with two keys only:\n\n- decision: Pass or Fail\n- reasoning: A brief explanation, strictly comparing the output to the answer_key.\n\n\n### Example Input 1\n\n{\n \"task\": \"Extract the liability cap and time-based provisions from a limitation of liability clause.\",\n \"answer_key\": \"The liability cap is $1 million with a 12-month limit.\",\n \"output\": \"The liability cap is $1 million with a 12-month limit.\"\n}\n\n### Example Output 1\n\n{\n  \"output\": {\n    {\n     \"decision\": \"Pass\",\n     \"reasoning\": \"The output exactly matches the answer key, so it is factually correct, relevant, and complete.\"\n     }\n  }\n}\n\n### Example Input 2\n\n{\n \"task\": \"Extract the liability cap and time-based provisions from a limitation of liability clause.\",\n \"answer_key\": \"The liability cap is $1 million with a 12-month limit.\",\n \"output\": \"The liability cap is $2 million and there is no time limit.\"\n}\n\n### Example Output 2\n\n{\n  \"output\": {\n    {\n     \"decision\": \"Fail\",\n     \"reasoning\": \"The output gives a $2 million cap and omits the 12-month limit from the answer key. This fails both factual correctness and completeness.\"\n    }\n}\n\n### Example Input 3\n\n{\n \"task\": \"State the governing law.\",\n \"answer_key\": \"Singapore law.\",\n \"output\": \"This agreement is governed by Singapore law. All disputes will be subject to the exclusive jurisdiction of Singapore courts.\"\n}\n\n### Example Output 3\n\n{\n  \"output\": {\n    \"reasoning\": \"All required information from the answer_key is present. The extra information does not contradict or misrepresent the answer_key.\"\n \"decision\": \"Pass\",\n  }\n}\n\n### Example Input 4\n\n{\n \"task\": \"Identify the relevant clause.\",\n \"answer_key\": \"Clause 5\",\n \"output\": \"clause 5\"\n}\n\n### Example Output 4\n\n{\n  \"output\": {\n    \"reasoning\": \"The output matches the answer key despite minor formatting differences.\"\n    \"decision\": \"Pass\",\n  }\n}\n\n### Example Input 5\n\n{\n \"task\": \"Extract the parties to the contract.\",\n \"answer_key\": \"Company A and Company B OR The Buyer and the Seller\",\n \"output\": \"The Buyer and the Seller\"\n}\n\n### Example Output 5\n\n{\n  \"output\": {\n    \"reasoning\": \"The output matches one of the acceptable answer_key alternatives.\"\n    \"decision\": \"Pass\",\n  }\n}\n\n## Reminder\nAlways grade solely by comparison to the answer_key column for each task in the input data."
            }
          ]
        },
        "promptType": "define",
        "hasOutputParser": true
      },
      "typeVersion": 1.4
    },
    {
      "id": "f4ddb551-cbaa-4c2d-96ca-3769a199ce1a",
      "name": "Modèle de chat OpenRouter",
      "type": "@n8n/n8n-nodes-langchain.lmChatOpenRouter",
      "position": [
        1380,
        520
      ],
      "parameters": {
        "model": "openai/gpt-4.1",
        "options": {}
      },
      "credentials": {
        "openRouterApi": {
          "id": "ipzDVYsZqbum9bX4",
          "name": "OpenRouter account 2"
        }
      },
      "typeVersion": 1
    },
    {
      "id": "b8eedf4a-eb85-4b4a-ad4b-61d9d31984c1",
      "name": "Conserver les données d'origine",
      "type": "n8n-nodes-base.set",
      "position": [
        1480,
        820
      ],
      "parameters": {
        "mode": "raw",
        "options": {},
        "jsonOutput": "={{ $json.body }}"
      },
      "typeVersion": 3.4
    },
    {
      "id": "69c41be1-ff93-4098-8b9d-cd5cc88d9271",
      "name": "Déclencheur manuel",
      "type": "n8n-nodes-base.manualTrigger",
      "position": [
        -80,
        220
      ],
      "parameters": {},
      "typeVersion": 1
    }
  ],
  "pinData": {},
  "connections": {
    "9b22fb78-d6fa-4dad-a543-1b02828d2f2e": {
      "main": [
        [
          {
            "node": "d7160cac-8bea-4464-bfea-00c785b8ac7e",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "6920a43b-bdbf-47c0-a644-1f75375e1127": {
      "main": [
        [
          {
            "node": "b8eedf4a-eb85-4b4a-ad4b-61d9d31984c1",
            "type": "main",
            "index": 0
          },
          {
            "node": "faad2c18-defc-4644-b9a3-3650c26f5891",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "ec8629e4-7715-410c-aa6d-560fd284a1ca": {
      "main": [
        [
          {
            "node": "9b22fb78-d6fa-4dad-a543-1b02828d2f2e",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "faad2c18-defc-4644-b9a3-3650c26f5891": {
      "main": [
        [
          {
            "node": "70cc9edd-f481-420e-bfcc-02b25f4353db",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "69c41be1-ff93-4098-8b9d-cd5cc88d9271": {
      "main": [
        [
          {
            "node": "ec8629e4-7715-410c-aa6d-560fd284a1ca",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "70cc9edd-f481-420e-bfcc-02b25f4353db": {
      "main": [
        [
          {
            "node": "2dbc4a8a-4fb6-4679-9d96-2724f79fbac1",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "b8eedf4a-eb85-4b4a-ad4b-61d9d31984c1": {
      "main": [
        [
          {
            "node": "2dbc4a8a-4fb6-4679-9d96-2724f79fbac1",
            "type": "main",
            "index": 1
          }
        ]
      ]
    },
    "d7160cac-8bea-4464-bfea-00c785b8ac7e": {
      "main": [
        [
          {
            "node": "83da8236-e5fb-4847-8033-6559f575c7ff",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "f4ddb551-cbaa-4c2d-96ca-3769a199ce1a": {
      "ai_languageModel": [
        [
          {
            "node": "70cc9edd-f481-420e-bfcc-02b25f4353db",
            "type": "ai_languageModel",
            "index": 0
          }
        ]
      ]
    },
    "146a6af3-58ec-4555-9202-3ce87a83af28": {
      "ai_outputParser": [
        [
          {
            "node": "70cc9edd-f481-420e-bfcc-02b25f4353db",
            "type": "ai_outputParser",
            "index": 0
          }
        ]
      ]
    }
  }
}

Foire aux questions

Comment utiliser ce workflow ?

Copiez le code de configuration JSON ci-dessus, créez un nouveau workflow dans votre instance n8n et sélectionnez "Importer depuis le JSON", collez la configuration et modifiez les paramètres d'authentification selon vos besoins.

Dans quelles scénarios ce workflow est-il adapté ?

Avancé - Ingénierie, Résumé IA

Est-ce payant ?

Ce workflow est entièrement gratuit et peut être utilisé directement. Veuillez noter que les services tiers utilisés dans le workflow (comme l'API OpenAI) peuvent nécessiter un paiement de votre part.

Workflows recommandés

Benchmarking des performances de LLM sur des documents juridiques avec Google Sheets et OpenRouter

Benchmark des performances des LLM sur des documents juridiques avec Google Sheets et OpenRouter

Intelligence Artificielle

Analyse intelligente quotidienne des groupes WhatsApp : Analyse GPT-4.1 et transcription des messages vocaux

Analyse intelligente des groupes WhatsApp quotidiens : analyse GPT-4.1 et transcription des messages vocaux

52 NœudsDaniel Lianes

Divers

Génération d'e-mails de vente personnalisés à partir de données LinkedIn via Claude 3.5 Sonnet et OpenRouter

Basé sur les données LinkedIn, utiliser OpenRouter avec Claude 3.7 pour générer des e-mails de vente personnalisés

Explorer les nœuds n8n dans la bibliothèque de références visuelles

Explorer les nœuds n8n dans la base de références visuelles

01 Analyser la performance des publicités Facebook avec un acheteur d'espace médias IA et envoyer les informations dans Google Sheets

Analyser les publicités Facebook et envoyer les insights dans Google Sheets avec Gemini AI

Prospection et workflow d'e-mails

Utiliser Google Maps, SendGrid et l'IA pour automatiser le développement de prospects B2B et le marketing par e-mail

141 NœudsEzema Kingsley Chibuzo

Génération de leads

Informations sur le workflow

Niveau de difficulté

Avancé

Nombre de nœuds17

Catégorie2

Types de nœuds11

Description de la difficulté

Adapté aux utilisateurs avancés, avec des workflows complexes contenant 16+ nœuds

Auteur

Adam Janes

@adamjanes

I am a product-minded technologist with hacker DNA building things in AI automation. I have a broad and varied background - having worked in Product, Design, and Sales - combined with deep technical experience as a Senior Developer and Fractional CTO. I am also a best-selling Udemy instructor (with 25K+ students), and founder of WOOFCODE - a free coding camp for fullstack developers. I practice non-violent communication, motivational interviewing, and Tibetan Buddhist meditation.

Liens externes

Voir sur n8n.io →

Partager ce workflow

Automatisation des tests LLM : Évaluation GPT-4 et suivi sur Google Sheets

Nœuds utilisés (17)

Catégorie

Comment utiliser ce workflow ?

Dans quelles scénarios ce workflow est-il adapté ?

Est-ce payant ?

Workflows recommandés

Catégories