Analizar, normalizar, extraer contenido de PDF y almacenarlo en Pinecone para RAG

Name: Analizar, normalizar, extraer contenido de PDF y almacenarlo en Pinecone para RAG
Rating: 4.5 (10 reviews)
Author: Alok Kumar

Avanzado

Este es unAI RAG, Multimodal AIflujo de automatización del dominio deautomatización que contiene 18 nodos.Utiliza principalmente nodos como If, Code, Wait, GoogleDrive, HttpRequest. Construye un sistema de preguntas y respuestas para PDFs usando LlamaIndex, incrustaciones de OpenAI y base de datos vectorial Pinecone

Requisitos previos

•Credenciales de API de Google Drive
•Pueden requerirse credenciales de autenticación para la API de destino
•Clave de API de OpenAI
•Clave de API de Pinecone

Nodos utilizados (18)

DocumentDefaultDataLoader

TextSplitterRecursiveCharacterTextSplitter

Categoría

RAG de IA

IA Multimodal

Vista previa del flujo de trabajo

Visualización de las conexiones entre nodos, con soporte para zoom y panorámica

Google Drive Trigger

Descargar archivo

Default Data Loader

Esperar

Esperar2

Subir a Llama Cloud

Verificar estado de análisis

Extraer Markdown de Llama Cloud

Normalizar texto

Fragmentar texto

Generar embeddings

Almacenar en Pinecone

React Flow

Exportar flujo de trabajo

Copie la siguiente configuración JSON en n8n para importar y usar este flujo de trabajo

{
  "id": "xDiuqZUZnShKpPzX",
  "meta": {
    "instanceId": "70273a2379644db63ce659827cfd8abac2d0b189210eafa02dd5376e3a62cd1d",
    "templateCredsSetupCompleted": true
  },
  "name": "Parse, Normalize, Extract, and Store PDF Content for RAG in Pinecone",
  "tags": [],
  "nodes": [
    {
      "id": "19b009db-a418-458c-a216-bdcc9af6fd2f",
      "name": "Google Drive Trigger",
      "type": "n8n-nodes-base.googleDriveTrigger",
      "position": [
        -1504,
        2080
      ],
      "parameters": {
        "event": "fileCreated",
        "options": {},
        "pollTimes": {
          "item": [
            {
              "mode": "everyMinute"
            }
          ]
        },
        "triggerOn": "specificFolder",
        "folderToWatch": {
          "__rl": true,
          "mode": "list",
          "value": ""
        }
      },
      "credentials": {
        "googleDriveOAuth2Api": {
          "id": "aU33fzddE6s3ZQw6",
          "name": "LearnBy-Google-Drive"
        }
      },
      "typeVersion": 1
    },
    {
      "id": "ff933f76-d719-40b5-b193-8a29e5fa2197",
      "name": "Descargar archivo",
      "type": "n8n-nodes-base.googleDrive",
      "position": [
        -1248,
        2096
      ],
      "parameters": {
        "fileId": {
          "__rl": true,
          "mode": "id",
          "value": "={{ $json.id }}"
        },
        "options": {},
        "operation": "download"
      },
      "credentials": {
        "googleDriveOAuth2Api": {
          "id": "aU33fzddE6s3ZQw6",
          "name": "LearnBy-Google-Drive"
        }
      },
      "typeVersion": 3
    },
    {
      "id": "127b41ed-ad45-4234-b87f-4f3c2b6ea531",
      "name": "Default Data Loader",
      "type": "@n8n/n8n-nodes-langchain.documentDefaultDataLoader",
      "position": [
        528,
        2192
      ],
      "parameters": {
        "options": {},
        "textSplittingMode": "custom"
      },
      "typeVersion": 1.1
    },
    {
      "id": "0316c7d4-449f-4275-a9b1-8848545beba8",
      "name": "Nota adhesiva",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        336,
        1712
      ],
      "parameters": {
        "width": 736,
        "height": 832,
        "content": "## Save to Vector DB"
      },
      "typeVersion": 1
    },
    {
      "id": "b06702b5-c322-4a5a-949a-855c8b97dadc",
      "name": "Nota adhesiva1",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        -1088,
        1720
      ],
      "parameters": {
        "color": 4,
        "width": 1392,
        "height": 656,
        "content": "## Prepare data - Parse and Normalize\n"
      },
      "typeVersion": 1
    },
    {
      "id": "05034e35-f6bf-45a6-860e-94f4da566daf",
      "name": "Esperar",
      "type": "n8n-nodes-base.wait",
      "position": [
        -720,
        2088
      ],
      "webhookId": "a0518843-31f8-44f9-bd8e-1189e16de0f1",
      "parameters": {
        "amount": 30
      },
      "typeVersion": 1.1
    },
    {
      "id": "9bb49bf6-a02e-4cf7-a1d1-ca4addff2bc6",
      "name": "Si",
      "type": "n8n-nodes-base.if",
      "position": [
        -272,
        2016
      ],
      "parameters": {
        "options": {},
        "conditions": {
          "options": {
            "version": 2,
            "leftValue": "",
            "caseSensitive": true,
            "typeValidation": "strict"
          },
          "combinator": "and",
          "conditions": [
            {
              "id": "7a07aec1-fc5f-4b76-94d9-6fa8f509ac8e",
              "operator": {
                "name": "filter.operator.equals",
                "type": "string",
                "operation": "equals"
              },
              "leftValue": "={{ $json.status }}",
              "rightValue": "SUCCESS"
            }
          ]
        }
      },
      "typeVersion": 2.2
    },
    {
      "id": "28654aff-e603-4080-9e7a-e706aaee47c4",
      "name": "Esperar2",
      "type": "n8n-nodes-base.wait",
      "position": [
        -48,
        2184
      ],
      "webhookId": "8da5da31-1ebd-4c82-8c6a-476d5d277cdd",
      "parameters": {
        "amount": 60
      },
      "typeVersion": 1.1
    },
    {
      "id": "ba935542-d2c0-4781-b6f9-5e1e007a9740",
      "name": "Nota adhesiva2",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        0,
        1584
      ],
      "parameters": {
        "width": 288,
        "height": 352,
        "content": "## Normalized Content\n\n* Removes noise\n* Reduces duplication\n* Improves retrieval quality \n* Preserves context \n* Consistent format \n* Prevents wasted tokens \n\n**Note : Update the code based in your requirement**"
      },
      "typeVersion": 1
    },
    {
      "id": "45efcdf9-89a9-4638-a9ed-cac39506270f",
      "name": "Nota adhesiva3",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        -2080,
        1472
      ],
      "parameters": {
        "width": 464,
        "height": 1200,
        "content": "## Try It Out!  \n### This n8n template demonstrates how to normalize, index, and query insurance PDFs using AI and Pinecone for a full **RAG (Retrieval-Augmented Generation)** workflow.  \n\n### Use cases include: creating **chatbots or Q&A systems** for structured documents, extracting insights from insurance policies, or managing compliance/legal PDFs efficiently.  \n\n---\n\n## How it works\n* New PDFs are automatically detected from a **Google Drive** folder.  \n* PDFs are sent to **LlamaIndex Cloud** for parsing → returns clean Markdown text.  \n* Text is normalized to remove headers, footers, page numbers, and formatting artifacts.  \n* The normalized text is split into chunks (~1200 characters with 150-character overlap) for better embedding.  \n* **OpenAI embeddings** are generated for each chunk.  \n* Chunks and metadata are stored in **Pinecone** for semantic search.  \n* A **Chat Agent** queries Pinecone to retrieve answers from your document vector database.  \n\n---\n\n### How to use\n* Update the folder name in google drive trigger node. \n* Place a pdf file in the same folder in google drive.\n* Customize the `Normalized Content` function node to adjust regex for headers/footers specific to your documents.  \n* Adjust chunk size or metadata namespace in the Pinecone node to fit your project needs.  \n\n---\n\n### Requirements\n* Google Drive account for PDF source files.  \n* **LlamaIndex Cloud** account (parsing API key).  \n* **Pinecone** account for vector storage.  \n* **OpenAI** account for model and embeddings.   \n\n---\n\n### Need Help?  \nask in the [n8n Forum](https://community.n8n.io/)!  \n\nHappy Automating! 🚀\n"
      },
      "typeVersion": 1
    },
    {
      "id": "34f5ba5e-7f4e-4c94-a4e8-41bfbaf163a1",
      "name": "Subir a Llama Cloud",
      "type": "n8n-nodes-base.httpRequest",
      "position": [
        -944,
        2088
      ],
      "parameters": {
        "url": "https://api.cloud.llamaindex.ai/api/v1/parsing/upload",
        "method": "POST",
        "options": {},
        "sendBody": true,
        "contentType": "multipart-form-data",
        "sendHeaders": true,
        "authentication": "genericCredentialType",
        "bodyParameters": {
          "parameters": [
            {
              "name": "file",
              "parameterType": "formBinaryData",
              "inputDataFieldName": "data"
            }
          ]
        },
        "genericAuthType": "httpBearerAuth",
        "headerParameters": {
          "parameters": [
            {
              "name": "accept",
              "value": "application/json"
            },
            {
              "name": "Content-Type",
              "value": "multipart/form-data"
            }
          ]
        }
      },
      "credentials": {
        "httpBearerAuth": {
          "id": "FlAAm17M7G6as02l",
          "name": "learnby_llama_cloud"
        }
      },
      "executeOnce": false,
      "retryOnFail": true,
      "typeVersion": 4.2,
      "alwaysOutputData": false
    },
    {
      "id": "1199e4ff-1952-4225-b655-1f63875f8903",
      "name": "Verificar estado de análisis",
      "type": "n8n-nodes-base.httpRequest",
      "position": [
        -496,
        2088
      ],
      "parameters": {
        "url": "=https://api.cloud.llamaindex.ai/api/parsing/job/{{ $('Upload to Llama Cloud').item.json.id }}",
        "options": {},
        "sendHeaders": true,
        "authentication": "genericCredentialType",
        "genericAuthType": "httpBearerAuth",
        "headerParameters": {
          "parameters": [
            {
              "name": "accept",
              "value": "application/json"
            }
          ]
        }
      },
      "credentials": {
        "httpBearerAuth": {
          "id": "FlAAm17M7G6as02l",
          "name": "learnby_llama_cloud"
        }
      },
      "retryOnFail": true,
      "typeVersion": 4.2
    },
    {
      "id": "cfaf9e10-0297-423a-b3f4-c25561c92078",
      "name": "Extraer Markdown de Llama Cloud",
      "type": "n8n-nodes-base.httpRequest",
      "position": [
        -48,
        1968
      ],
      "parameters": {
        "url": "=https://api.cloud.llamaindex.ai/api/v1/parsing/job/{{ $json.id }}/result/markdown",
        "options": {},
        "sendHeaders": true,
        "authentication": "genericCredentialType",
        "genericAuthType": "httpBearerAuth",
        "headerParameters": {
          "parameters": [
            {
              "name": "accept",
              "value": "application/json"
            }
          ]
        }
      },
      "credentials": {
        "httpBearerAuth": {
          "id": "FlAAm17M7G6as02l",
          "name": "learnby_llama_cloud"
        }
      },
      "retryOnFail": true,
      "typeVersion": 4.2
    },
    {
      "id": "564a0930-e80b-4db6-a62a-5224248e5cd9",
      "name": "Normalizar texto",
      "type": "n8n-nodes-base.code",
      "position": [
        176,
        1968
      ],
      "parameters": {
        "mode": "runOnceForEachItem",
        "jsCode": "// Get the input text from the previous node\nconst input = $json.markdown || $json.text || \"\";\n\nlet text = input.replace(/Car Insurance Policy\\s*\\d+/gi, \"\");\n\n// Remove \"Page X\" markers\ntext = text.replace(/Page\\s*\\d+/gi, \"\");\n\n// Replace --- dividers with a single newline\ntext = text.replace(/-{3,}/g, \"\\n\");\n\n// Decode & cleanup artifacts\ntext = text.replace(/&#x26;/g, \"&\");   // fix HTML entities\ntext = text.replace(/[ⓤ]/g, \"-\");      // replace bullet symbols with dashes\n\n// Collapse whitespace\ntext = text.replace(/\\n{2,}/g, \"\\n\\n\"); // keep paragraph breaks\ntext = text.replace(/[ \\t]+/g, \" \");    // collapse spaces\n\n// Step 5: Trim\ntext = text.trim();\n\n// Output for next node\nreturn { json: { normalizedText: text } };\n"
      },
      "typeVersion": 2
    },
    {
      "id": "fe32694a-2cbc-4ad4-88aa-4eb3dba0256c",
      "name": "Fragmentar texto",
      "type": "@n8n/n8n-nodes-langchain.textSplitterRecursiveCharacterTextSplitter",
      "position": [
        608,
        2400
      ],
      "parameters": {
        "options": {
          "splitCode": "markdown"
        },
        "chunkSize": 1200,
        "chunkOverlap": 150
      },
      "typeVersion": 1
    },
    {
      "id": "b399c9fa-03d7-4126-9026-674d091b9ddf",
      "name": "Generar embeddings",
      "type": "@n8n/n8n-nodes-langchain.embeddingsOpenAi",
      "position": [
        400,
        2192
      ],
      "parameters": {
        "options": {}
      },
      "credentials": {
        "openAiApi": {
          "id": "Yj4Rt75fspowAEru",
          "name": "nextweb-openai"
        }
      },
      "typeVersion": 1.2
    },
    {
      "id": "1da27207-b77e-41d0-a249-5096ec8ac259",
      "name": "Almacenar en Pinecone",
      "type": "@n8n/n8n-nodes-langchain.vectorStorePinecone",
      "position": [
        432,
        1968
      ],
      "parameters": {
        "mode": "insert",
        "options": {
          "pineconeNamespace": "rag"
        },
        "pineconeIndex": {
          "__rl": true,
          "mode": "id",
          "value": "demo"
        }
      },
      "credentials": {
        "pineconeApi": {
          "id": "uo1lZDPNWTsMAeOC",
          "name": "learnby-PineconeApi-account"
        }
      },
      "notesInFlow": false,
      "typeVersion": 1.3
    },
    {
      "id": "4f65ec8b-f936-41cb-b05c-0cc710df1c9e",
      "name": "Nota adhesiva4",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        -1568,
        1728
      ],
      "parameters": {
        "color": 6,
        "width": 464,
        "height": 640,
        "content": "## Extract Data"
      },
      "typeVersion": 1
    }
  ],
  "active": false,
  "pinData": {},
  "settings": {
    "executionOrder": "v1"
  },
  "versionId": "5ec0ee83-34cd-423d-8bd5-41400bde4a4a",
  "connections": {
    "9bb49bf6-a02e-4cf7-a1d1-ca4addff2bc6": {
      "main": [
        [
          {
            "node": "cfaf9e10-0297-423a-b3f4-c25561c92078",
            "type": "main",
            "index": 0
          }
        ],
        [
          {
            "node": "28654aff-e603-4080-9e7a-e706aaee47c4",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "05034e35-f6bf-45a6-860e-94f4da566daf": {
      "main": [
        [
          {
            "node": "1199e4ff-1952-4225-b655-1f63875f8903",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "28654aff-e603-4080-9e7a-e706aaee47c4": {
      "main": [
        [
          {
            "node": "1199e4ff-1952-4225-b655-1f63875f8903",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "fe32694a-2cbc-4ad4-88aa-4eb3dba0256c": {
      "ai_textSplitter": [
        [
          {
            "node": "127b41ed-ad45-4234-b87f-4f3c2b6ea531",
            "type": "ai_textSplitter",
            "index": 0
          }
        ]
      ]
    },
    "ff933f76-d719-40b5-b193-8a29e5fa2197": {
      "main": [
        [
          {
            "node": "34f5ba5e-7f4e-4c94-a4e8-41bfbaf163a1",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "564a0930-e80b-4db6-a62a-5224248e5cd9": {
      "main": [
        [
          {
            "node": "1da27207-b77e-41d0-a249-5096ec8ac259",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "1da27207-b77e-41d0-a249-5096ec8ac259": {
      "main": [
        []
      ]
    },
    "127b41ed-ad45-4234-b87f-4f3c2b6ea531": {
      "ai_document": [
        [
          {
            "node": "1da27207-b77e-41d0-a249-5096ec8ac259",
            "type": "ai_document",
            "index": 0
          }
        ]
      ]
    },
    "b399c9fa-03d7-4126-9026-674d091b9ddf": {
      "ai_embedding": [
        [
          {
            "node": "1da27207-b77e-41d0-a249-5096ec8ac259",
            "type": "ai_embedding",
            "index": 0
          }
        ]
      ]
    },
    "1199e4ff-1952-4225-b655-1f63875f8903": {
      "main": [
        [
          {
            "node": "9bb49bf6-a02e-4cf7-a1d1-ca4addff2bc6",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "19b009db-a418-458c-a216-bdcc9af6fd2f": {
      "main": [
        [
          {
            "node": "ff933f76-d719-40b5-b193-8a29e5fa2197",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "34f5ba5e-7f4e-4c94-a4e8-41bfbaf163a1": {
      "main": [
        [
          {
            "node": "05034e35-f6bf-45a6-860e-94f4da566daf",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "cfaf9e10-0297-423a-b3f4-c25561c92078": {
      "main": [
        [
          {
            "node": "564a0930-e80b-4db6-a62a-5224248e5cd9",
            "type": "main",
            "index": 0
          }
        ]
      ]
    }
  }
}

Preguntas frecuentes

¿Cómo usar este flujo de trabajo?

Copie el código de configuración JSON de arriba, cree un nuevo flujo de trabajo en su instancia de n8n y seleccione "Importar desde JSON", pegue la configuración y luego modifique la configuración de credenciales según sea necesario.

¿En qué escenarios es adecuado este flujo de trabajo?

Avanzado - RAG de IA, IA Multimodal

¿Es de pago?

Este flujo de trabajo es completamente gratuito, puede importarlo y usarlo directamente. Sin embargo, tenga en cuenta que los servicios de terceros utilizados en el flujo de trabajo (como la API de OpenAI) pueden requerir un pago por su cuenta.

Flujos de trabajo relacionados recomendados

Copywriting de IA RAG con mezcla de contexto

Sincronización de Google Drive a Supabase para base de datos vectorial de contexto para aplicaciones RAG

76 NodosMichael Taleb

RAG de IA

Automatización de prospección de socios comerciales con Google Maps, GPT-4 y WhatsApp

Automatización de la expansión de socios comerciales con Google Maps, GPT-4 y WhatsApp

64 NodosKhairul Muhtadin

Chatbot de IA

Re-publicación automática de trabajos usando Rag

Extracción y publicación automatizadas de puestos de trabajo usando RAG, Jina AI y OpenAI a WordPress

56 NodosKhairul Muhtadin

Recursos Humanos

Extracción y plantilla de publicación automática de ofertas de empleo

Extracción automática de listados de trabajos y plantillas de publicación

53 NodosKhairul Muhtadin

Recursos Humanos

Ingesta de documentos

Automatización de la ingesta de documentos y sistemas RAG con Google Drive, Sheets y OpenAI

28 NodosMohamed Abdelwahab

Varios

Sistema RAG completo con Qdrant, con actualización automática de documentos

Construye un sistema RAG autoactualizable usando OpenAI, Google Gemini, Qdrant y Google Drive

Información del flujo de trabajo

Nivel de dificultad

Avanzado

Número de nodos18

Categoría2

Tipos de nodos11

Descripción de la dificultad

Adecuado para usuarios avanzados, flujos de trabajo complejos con 16+ nodos

Autor

Alok Kumar

@alokkumar

I am a Principal Software Engineer based in Ireland with a deep passion for AI and emerging technologies. With extensive experience in designing and implementing scalable software solutions, I focus on leveraging artificial intelligence to solve real-world problems. I enjoy exploring innovative applications of AI, from intelligent automation to data-driven insights, and I’m dedicated to building systems that are both efficient and impactful.

Enlaces externos

Ver en n8n.io →

Compartir este flujo de trabajo

Analizar, normalizar, extraer contenido de PDF y almacenarlo en Pinecone para RAG

Nodos utilizados (18)

Categoría

¿Cómo usar este flujo de trabajo?

¿En qué escenarios es adecuado este flujo de trabajo?

¿Es de pago?

Flujos de trabajo relacionados recomendados

Categorías