Overview

Document Processing tools allow you to extract text content from various document formats including PDF, DOCX, and Word documents. These tools are essential for document analysis, content extraction, and text processing workflows.

Key Features

  • PDF_EXTRACT_TEXT
    • Extract text content from PDF files
  • DOCX_EXTRACT_TEXT
    • Extract text content from DOCX files
  • WORD_TABLE_EXTRACT
    • Extract table data from Word documents as JSON
  • WORD_TABLE_UPDATE
    • Update table data in Word documents
  • TRANSLATE_PPTX_FILE
    • Translate text content in PowerPoint (PPTX) files to different languages

Authentication

No authentication required. Document processing tools work directly with file URLs or base64 encoded files.

Example: Basic Text Extraction

- id: extract_pdf_text
  name: extract_pdf_text
  tool: PDF_EXTRACT_TEXT
  input:
    - name: base64_file
      value: "{{steps.upload_pdf.result.base64}}"

- id: extract_docx_text
  name: extract_docx_text
  tool: DOCX_EXTRACT_TEXT
  input:
    - name: base64_file
      value: "{{steps.upload_docx.result.base64}}"

- id: analyze_extracted_text
  name: analyze_extracted_text
  tool: OPENAI_INVOKE
  config:
    - name: version
      value: gpt-4
  input:
    - name: prompt
      value: |
        Please analyze the following extracted text and provide:
        1. A summary of the main topics
        2. Key findings or important points
        3. Any action items mentioned
        
        PDF Content: {{steps.extract_pdf_text.result.text}}
        DOCX Content: {{steps.extract_docx_text.result.text}}

Example: Word Table Processing

- id: extract_word_table
  name: extract_word_table
  tool: WORD_TABLE_EXTRACT
  config:
    - name: timeout
      value: 300000
  input:
    - name: file_url
      value: "https://example.com/document.docx"
    - name: table_index
      value: 0

- id: process_table_data
  name: process_table_data
  tool: PYTHON_SANDBOX_RUN
  input:
    - name: script
      value: |
        import json
        
        # Get table data from previous step
        table_data = {{steps.extract_word_table.result.table}}
        
        # Process the table data
        processed_data = []
        headers = table_data[0] if table_data else []
        
        for row in table_data[1:]:  # Skip header row
            row_dict = {}
            for i, cell in enumerate(row):
                if i < len(headers):
                    row_dict[headers[i]] = cell
            processed_data.append(row_dict)
        
        print(json.dumps({"processed_table": processed_data}))

- id: save_to_spreadsheet
  name: save_to_spreadsheet
  tool: GOOGLE_SPREADSHEET_ADD_ROWS
  config:
    - name: credentials
      value: "{{secrets.GOOGLE_SHEETS_CREDENTIALS}}"
  input:
    - name: spreadsheet_id
      value: "your_spreadsheet_id"
    - name: sheet_name
      value: "ExtractedData"
    - name: values
      value: "{{steps.extract_word_table.result.table}}"

Example: Document Analysis Pipeline

- id: upload_document
  name: upload_document
  tool: INPUT_FILE
  input:
    - name: value
      value: "Document to analyze"

- id: determine_file_type
  name: determine_file_type
  tool: PYTHON_SANDBOX_RUN
  input:
    - name: script
      value: |
        import json
        import base64
        
        # Get file info
        file_info = {{steps.upload_document.result}}
        file_name = file_info.get('filename', '').lower()
        
        if file_name.endswith('.pdf'):
            file_type = 'pdf'
        elif file_name.endswith('.docx'):
            file_type = 'docx'
        else:
            file_type = 'unknown'
        
        print(json.dumps({"file_type": file_type, "filename": file_name}))

- id: extract_text_pdf
  name: extract_text_pdf
  tool: PDF_EXTRACT_TEXT
  condition: "{{steps.determine_file_type.result.file_type == 'pdf'}}"
  input:
    - name: base64_file
      value: "{{steps.upload_document.result.base64}}"

- id: extract_text_docx
  name: extract_text_docx
  tool: DOCX_EXTRACT_TEXT
  condition: "{{steps.determine_file_type.result.file_type == 'docx'}}"
  input:
    - name: base64_file
      value: "{{steps.upload_document.result.base64}}"

- id: process_extracted_text
  name: process_extracted_text
  tool: PYTHON_SANDBOX_RUN
  input:
    - name: script
      value: |
        import json
        
        # Get extracted text based on file type
        pdf_text = {{steps.extract_text_pdf.result.text if steps.extract_text_pdf else ""}}
        docx_text = {{steps.extract_text_docx.result.text if steps.extract_text_docx else ""}}
        
        # Combine and clean text
        extracted_text = pdf_text or docx_text or ""
        
        # Basic text analysis
        word_count = len(extracted_text.split())
        char_count = len(extracted_text)
        
        # Extract potential important sections
        lines = extracted_text.split('\n')
        important_lines = [line.strip() for line in lines if 
                          any(keyword in line.lower() for keyword in 
                              ['summary', 'conclusion', 'action', 'todo', 'next steps'])]
        
        result = {
            "text": extracted_text,
            "word_count": word_count,
            "character_count": char_count,
            "important_sections": important_lines[:10]  # Limit to top 10
        }
        
        print(json.dumps(result))

- id: generate_document_summary
  name: generate_document_summary
  tool: OPENAI_INVOKE
  config:
    - name: version
      value: gpt-4
  input:
    - name: prompt
      value: |
        Please create a comprehensive summary of this document:
        
        Document: {{steps.determine_file_type.result.filename}}
        Word Count: {{steps.process_extracted_text.result.word_count}}
        
        Content:
        {{steps.process_extracted_text.result.text}}
        
        Please provide:
        1. Executive Summary (2-3 sentences)
        2. Key Topics and Themes
        3. Important Facts or Data Points
        4. Action Items or Recommendations
        5. Overall Assessment

Example: Batch Document Processing

- id: get_document_list
  name: get_document_list
  tool: INPUT_JSON
  input:
    - name: value
      value: {
        "documents": [
          {"url": "https://example.com/doc1.pdf", "name": "Report1"},
          {"url": "https://example.com/doc2.docx", "name": "Report2"},
          {"url": "https://example.com/doc3.pdf", "name": "Report3"}
        ]
      }

- id: process_documents
  name: process_documents
  tool: PYTHON_SANDBOX_RUN
  input:
    - name: script
      value: |
        import json
        import requests
        import base64
        
        documents = {{steps.get_document_list.result.documents}}
        processed_docs = []
        
        for doc in documents:
            try:
                # Download document
                response = requests.get(doc['url'])
                if response.status_code == 200:
                    base64_content = base64.b64encode(response.content).decode('utf-8')
                    
                    processed_docs.append({
                        "name": doc['name'],
                        "url": doc['url'],
                        "base64": base64_content,
                        "type": "pdf" if doc['url'].endswith('.pdf') else "docx",
                        "status": "ready"
                    })
            except Exception as e:
                processed_docs.append({
                    "name": doc['name'],
                    "url": doc['url'],
                    "status": "error",
                    "error": str(e)
                })
        
        print(json.dumps({"documents": processed_docs}))

Example: PowerPoint Translation

- id: upload_presentation
  name: upload_presentation
  tool: INPUT_FILE
  input:
    - name: description
      value: "Upload PowerPoint file to translate"

- id: translate_to_spanish
  name: translate_to_spanish
  tool: TRANSLATE_PPTX_FILE
  input:
    - name: value
      value: "{{steps.upload_presentation.result.file_url}}"
    - name: target_lang
      value: "es"

- id: translate_to_french
  name: translate_to_french
  tool: TRANSLATE_PPTX_FILE
  input:
    - name: value
      value: "{{steps.upload_presentation.result.file_url}}"
    - name: target_lang
      value: "fr"

- id: translate_to_japanese
  name: translate_to_japanese
  tool: TRANSLATE_PPTX_FILE
  input:
    - name: value
      value: "{{steps.upload_presentation.result.file_url}}"
    - name: target_lang
      value: "ja"

- id: create_download_links
  name: create_download_links
  tool: PYTHON_SANDBOX_RUN
  input:
    - name: code
      value: |
        import json
        
        # Collect translation results
        original_file = "{{steps.upload_presentation.result.filename}}"
        spanish_url = "{{steps.translate_to_spanish.result.url}}"
        french_url = "{{steps.translate_to_french.result.url}}"
        japanese_url = "{{steps.translate_to_japanese.result.url}}"
        
        translations = {
            "original_file": original_file,
            "translations": [
                {"language": "Spanish (es)", "download_url": spanish_url},
                {"language": "French (fr)", "download_url": french_url},
                {"language": "Japanese (ja)", "download_url": japanese_url}
            ]
        }
        
        print("=== PowerPoint Translation Complete ===")
        print(f"Original file: {original_file}")
        print("\nTranslated versions available:")
        for trans in translations["translations"]:
            print(f"  • {trans['language']}: {trans['download_url']}")
        
        # Output as JSON for further processing
        print(f"\nJSON Output: {json.dumps(translations)}")

Example: Multilingual Presentation Workflow

- id: get_presentation_url
  name: get_presentation_url
  tool: INPUT_TEXT
  input:
    - name: description
      value: "Enter the URL of the PowerPoint file to translate"

- id: get_target_languages
  name: get_target_languages
  tool: INPUT_JSON
  input:
    - name: description
      value: "Enter target languages as JSON array"
    - name: value
      value: ["es", "fr", "de", "it", "pt", "ja", "ko", "zh"]

- id: process_translations
  name: process_translations
  tool: PYTHON_SANDBOX_RUN
  input:
    - name: code
      value: |
        import json
        import time
        
        presentation_url = "{{steps.get_presentation_url.result}}"
        target_languages = {{steps.get_target_languages.result}}
        
        # Language code to name mapping
        language_names = {
            "es": "Spanish", "fr": "French", "de": "German", "it": "Italian",
            "pt": "Portuguese", "ja": "Japanese", "ko": "Korean", "zh": "Chinese"
        }
        
        print("Starting batch translation process...")
        print(f"Source file: {presentation_url}")
        print(f"Target languages: {', '.join([language_names.get(lang, lang) for lang in target_languages])}")
        
        # Store the languages for the next steps
        batch_info = {
            "source_url": presentation_url,
            "languages": target_languages,
            "language_names": language_names,
            "total_count": len(target_languages)
        }
        
        print(f"\nJSON Output: {json.dumps(batch_info)}")

# Note: In a real workflow, you would need to create separate translation steps
# for each language or use a loop mechanism. This example shows the concept.

- id: translate_spanish
  name: translate_spanish
  tool: TRANSLATE_PPTX_FILE
  condition: "{{contains(steps.process_translations.result.languages, 'es')}}"
  input:
    - name: value
      value: "{{steps.process_translations.result.source_url}}"
    - name: target_lang
      value: "es"

- id: translate_french
  name: translate_french
  tool: TRANSLATE_PPTX_FILE
  condition: "{{contains(steps.process_translations.result.languages, 'fr')}}"
  input:
    - name: value
      value: "{{steps.process_translations.result.source_url}}"
    - name: target_lang
      value: "fr"

- id: compile_results
  name: compile_results
  tool: PYTHON_SANDBOX_RUN
  input:
    - name: code
      value: |
        import json
        
        results = []
        
        # Check if Spanish translation was completed
        if "{{steps.translate_spanish.result.url if steps.translate_spanish else ''}}":
            results.append({
                "language": "Spanish",
                "code": "es",
                "status": "completed",
                "download_url": "{{steps.translate_spanish.result.url}}"
            })
        
        # Check if French translation was completed
        if "{{steps.translate_french.result.url if steps.translate_french else ''}}":
            results.append({
                "language": "French",
                "code": "fr", 
                "status": "completed",
                "download_url": "{{steps.translate_french.result.url}}"
            })
        
        print("=== Translation Summary ===")
        print(f"Completed translations: {len(results)}")
        
        for result in results:
            print(f"  ✅ {result['language']} ({result['code']})")
            print(f"     Download: {result['download_url']}")
        
        # Create a summary report
        summary = {
            "total_completed": len(results),
            "translations": results,
            "timestamp": "{{now}}"
        }
        
        print(f"\nFinal Summary: {json.dumps(summary, indent=2)}")

- id: notify_completion
  name: notify_completion
  tool: SLACK_POST_MESSAGE
  input:
    - name: channel
      value: "#translations"
    - name: text
      value: |
        🎉 PowerPoint translation job completed!
        
        📄 Source file processed
        ✅ {{steps.compile_results.result.total_completed}} language(s) completed
        
        Results:
        {{#each steps.compile_results.result.translations}}
        • {{language}} ({{code}}): {{download_url}}
        {{/each}}

Supported Languages

The TRANSLATE_PPTX_FILE tool supports translation to various languages using standard language codes:
  • en: English
  • es: Spanish
  • fr: French
  • de: German
  • it: Italian
  • pt: Portuguese
  • ru: Russian
  • ja: Japanese
  • ko: Korean
  • zh: Chinese (Simplified)
  • ar: Arabic
  • hi: Hindi
  • nl: Dutch
  • sv: Swedish
  • da: Danish
  • no: Norwegian
  • fi: Finnish

Usage Notes:

  • Translation preserves original formatting and slide structure
  • Text in images cannot be translated (only text boxes and shapes)
  • Complex animations and transitions are preserved
  • File size may vary slightly after translation
  • Processing time depends on presentation size and complexity

Tips and Best Practices

  • Support for base64 encoded files enables secure file processing
  • Always validate file types before processing
  • Consider file size limitations when processing large documents
  • Use appropriate timeouts for large document processing
  • Implement error handling for corrupt or unsupported files
  • Extract tables separately from text for structured data analysis
  • Consider text cleaning and preprocessing for better analysis results