Overview
Document Processing tools allow you to extract text content from various document formats including PDF, DOCX, and Word documents. These tools are essential for document analysis, content extraction, and text processing workflows.Key Features
PDF_EXTRACT_TEXT
- Extract text content from PDF files
DOCX_EXTRACT_TEXT
- Extract text content from DOCX files
WORD_TABLE_EXTRACT
- Extract table data from Word documents as JSON
WORD_TABLE_UPDATE
- Update table data in Word documents
TRANSLATE_PPTX_FILE
- Translate text content in PowerPoint (PPTX) files to different languages
Authentication
No authentication required. Document processing tools work directly with file URLs or base64 encoded files.Example: Basic Text Extraction
Copy
- id: extract_pdf_text
name: extract_pdf_text
tool: PDF_EXTRACT_TEXT
input:
- name: base64_file
value: "{{steps.upload_pdf.result.base64}}"
- id: extract_docx_text
name: extract_docx_text
tool: DOCX_EXTRACT_TEXT
input:
- name: base64_file
value: "{{steps.upload_docx.result.base64}}"
- id: analyze_extracted_text
name: analyze_extracted_text
tool: OPENAI_INVOKE
config:
- name: version
value: gpt-4
input:
- name: prompt
value: |
Please analyze the following extracted text and provide:
1. A summary of the main topics
2. Key findings or important points
3. Any action items mentioned
PDF Content: {{steps.extract_pdf_text.result.text}}
DOCX Content: {{steps.extract_docx_text.result.text}}
Example: Word Table Processing
Copy
- id: extract_word_table
name: extract_word_table
tool: WORD_TABLE_EXTRACT
config:
- name: timeout
value: 300000
input:
- name: file_url
value: "https://example.com/document.docx"
- name: table_index
value: 0
- id: process_table_data
name: process_table_data
tool: PYTHON_SANDBOX_RUN
input:
- name: script
value: |
import json
# Get table data from previous step
table_data = {{steps.extract_word_table.result.table}}
# Process the table data
processed_data = []
headers = table_data[0] if table_data else []
for row in table_data[1:]: # Skip header row
row_dict = {}
for i, cell in enumerate(row):
if i < len(headers):
row_dict[headers[i]] = cell
processed_data.append(row_dict)
print(json.dumps({"processed_table": processed_data}))
- id: save_to_spreadsheet
name: save_to_spreadsheet
tool: GOOGLE_SPREADSHEET_ADD_ROWS
config:
- name: credentials
value: "{{secrets.GOOGLE_SHEETS_CREDENTIALS}}"
input:
- name: spreadsheet_id
value: "your_spreadsheet_id"
- name: sheet_name
value: "ExtractedData"
- name: values
value: "{{steps.extract_word_table.result.table}}"
Example: Document Analysis Pipeline
Copy
- id: upload_document
name: upload_document
tool: INPUT_FILE
input:
- name: value
value: "Document to analyze"
- id: determine_file_type
name: determine_file_type
tool: PYTHON_SANDBOX_RUN
input:
- name: script
value: |
import json
import base64
# Get file info
file_info = {{steps.upload_document.result}}
file_name = file_info.get('filename', '').lower()
if file_name.endswith('.pdf'):
file_type = 'pdf'
elif file_name.endswith('.docx'):
file_type = 'docx'
else:
file_type = 'unknown'
print(json.dumps({"file_type": file_type, "filename": file_name}))
- id: extract_text_pdf
name: extract_text_pdf
tool: PDF_EXTRACT_TEXT
condition: "{{steps.determine_file_type.result.file_type == 'pdf'}}"
input:
- name: base64_file
value: "{{steps.upload_document.result.base64}}"
- id: extract_text_docx
name: extract_text_docx
tool: DOCX_EXTRACT_TEXT
condition: "{{steps.determine_file_type.result.file_type == 'docx'}}"
input:
- name: base64_file
value: "{{steps.upload_document.result.base64}}"
- id: process_extracted_text
name: process_extracted_text
tool: PYTHON_SANDBOX_RUN
input:
- name: script
value: |
import json
# Get extracted text based on file type
pdf_text = {{steps.extract_text_pdf.result.text if steps.extract_text_pdf else ""}}
docx_text = {{steps.extract_text_docx.result.text if steps.extract_text_docx else ""}}
# Combine and clean text
extracted_text = pdf_text or docx_text or ""
# Basic text analysis
word_count = len(extracted_text.split())
char_count = len(extracted_text)
# Extract potential important sections
lines = extracted_text.split('\n')
important_lines = [line.strip() for line in lines if
any(keyword in line.lower() for keyword in
['summary', 'conclusion', 'action', 'todo', 'next steps'])]
result = {
"text": extracted_text,
"word_count": word_count,
"character_count": char_count,
"important_sections": important_lines[:10] # Limit to top 10
}
print(json.dumps(result))
- id: generate_document_summary
name: generate_document_summary
tool: OPENAI_INVOKE
config:
- name: version
value: gpt-4
input:
- name: prompt
value: |
Please create a comprehensive summary of this document:
Document: {{steps.determine_file_type.result.filename}}
Word Count: {{steps.process_extracted_text.result.word_count}}
Content:
{{steps.process_extracted_text.result.text}}
Please provide:
1. Executive Summary (2-3 sentences)
2. Key Topics and Themes
3. Important Facts or Data Points
4. Action Items or Recommendations
5. Overall Assessment
Example: Batch Document Processing
Copy
- id: get_document_list
name: get_document_list
tool: INPUT_JSON
input:
- name: value
value: {
"documents": [
{"url": "https://example.com/doc1.pdf", "name": "Report1"},
{"url": "https://example.com/doc2.docx", "name": "Report2"},
{"url": "https://example.com/doc3.pdf", "name": "Report3"}
]
}
- id: process_documents
name: process_documents
tool: PYTHON_SANDBOX_RUN
input:
- name: script
value: |
import json
import requests
import base64
documents = {{steps.get_document_list.result.documents}}
processed_docs = []
for doc in documents:
try:
# Download document
response = requests.get(doc['url'])
if response.status_code == 200:
base64_content = base64.b64encode(response.content).decode('utf-8')
processed_docs.append({
"name": doc['name'],
"url": doc['url'],
"base64": base64_content,
"type": "pdf" if doc['url'].endswith('.pdf') else "docx",
"status": "ready"
})
except Exception as e:
processed_docs.append({
"name": doc['name'],
"url": doc['url'],
"status": "error",
"error": str(e)
})
print(json.dumps({"documents": processed_docs}))
Example: PowerPoint Translation
Copy
- id: upload_presentation
name: upload_presentation
tool: INPUT_FILE
input:
- name: description
value: "Upload PowerPoint file to translate"
- id: translate_to_spanish
name: translate_to_spanish
tool: TRANSLATE_PPTX_FILE
input:
- name: value
value: "{{steps.upload_presentation.result.file_url}}"
- name: target_lang
value: "es"
- id: translate_to_french
name: translate_to_french
tool: TRANSLATE_PPTX_FILE
input:
- name: value
value: "{{steps.upload_presentation.result.file_url}}"
- name: target_lang
value: "fr"
- id: translate_to_japanese
name: translate_to_japanese
tool: TRANSLATE_PPTX_FILE
input:
- name: value
value: "{{steps.upload_presentation.result.file_url}}"
- name: target_lang
value: "ja"
- id: create_download_links
name: create_download_links
tool: PYTHON_SANDBOX_RUN
input:
- name: code
value: |
import json
# Collect translation results
original_file = "{{steps.upload_presentation.result.filename}}"
spanish_url = "{{steps.translate_to_spanish.result.url}}"
french_url = "{{steps.translate_to_french.result.url}}"
japanese_url = "{{steps.translate_to_japanese.result.url}}"
translations = {
"original_file": original_file,
"translations": [
{"language": "Spanish (es)", "download_url": spanish_url},
{"language": "French (fr)", "download_url": french_url},
{"language": "Japanese (ja)", "download_url": japanese_url}
]
}
print("=== PowerPoint Translation Complete ===")
print(f"Original file: {original_file}")
print("\nTranslated versions available:")
for trans in translations["translations"]:
print(f" • {trans['language']}: {trans['download_url']}")
# Output as JSON for further processing
print(f"\nJSON Output: {json.dumps(translations)}")
Example: Multilingual Presentation Workflow
Copy
- id: get_presentation_url
name: get_presentation_url
tool: INPUT_TEXT
input:
- name: description
value: "Enter the URL of the PowerPoint file to translate"
- id: get_target_languages
name: get_target_languages
tool: INPUT_JSON
input:
- name: description
value: "Enter target languages as JSON array"
- name: value
value: ["es", "fr", "de", "it", "pt", "ja", "ko", "zh"]
- id: process_translations
name: process_translations
tool: PYTHON_SANDBOX_RUN
input:
- name: code
value: |
import json
import time
presentation_url = "{{steps.get_presentation_url.result}}"
target_languages = {{steps.get_target_languages.result}}
# Language code to name mapping
language_names = {
"es": "Spanish", "fr": "French", "de": "German", "it": "Italian",
"pt": "Portuguese", "ja": "Japanese", "ko": "Korean", "zh": "Chinese"
}
print("Starting batch translation process...")
print(f"Source file: {presentation_url}")
print(f"Target languages: {', '.join([language_names.get(lang, lang) for lang in target_languages])}")
# Store the languages for the next steps
batch_info = {
"source_url": presentation_url,
"languages": target_languages,
"language_names": language_names,
"total_count": len(target_languages)
}
print(f"\nJSON Output: {json.dumps(batch_info)}")
# Note: In a real workflow, you would need to create separate translation steps
# for each language or use a loop mechanism. This example shows the concept.
- id: translate_spanish
name: translate_spanish
tool: TRANSLATE_PPTX_FILE
condition: "{{contains(steps.process_translations.result.languages, 'es')}}"
input:
- name: value
value: "{{steps.process_translations.result.source_url}}"
- name: target_lang
value: "es"
- id: translate_french
name: translate_french
tool: TRANSLATE_PPTX_FILE
condition: "{{contains(steps.process_translations.result.languages, 'fr')}}"
input:
- name: value
value: "{{steps.process_translations.result.source_url}}"
- name: target_lang
value: "fr"
- id: compile_results
name: compile_results
tool: PYTHON_SANDBOX_RUN
input:
- name: code
value: |
import json
results = []
# Check if Spanish translation was completed
if "{{steps.translate_spanish.result.url if steps.translate_spanish else ''}}":
results.append({
"language": "Spanish",
"code": "es",
"status": "completed",
"download_url": "{{steps.translate_spanish.result.url}}"
})
# Check if French translation was completed
if "{{steps.translate_french.result.url if steps.translate_french else ''}}":
results.append({
"language": "French",
"code": "fr",
"status": "completed",
"download_url": "{{steps.translate_french.result.url}}"
})
print("=== Translation Summary ===")
print(f"Completed translations: {len(results)}")
for result in results:
print(f" ✅ {result['language']} ({result['code']})")
print(f" Download: {result['download_url']}")
# Create a summary report
summary = {
"total_completed": len(results),
"translations": results,
"timestamp": "{{now}}"
}
print(f"\nFinal Summary: {json.dumps(summary, indent=2)}")
- id: notify_completion
name: notify_completion
tool: SLACK_POST_MESSAGE
input:
- name: channel
value: "#translations"
- name: text
value: |
🎉 PowerPoint translation job completed!
📄 Source file processed
✅ {{steps.compile_results.result.total_completed}} language(s) completed
Results:
{{#each steps.compile_results.result.translations}}
• {{language}} ({{code}}): {{download_url}}
{{/each}}
Supported Languages
TheTRANSLATE_PPTX_FILE
tool supports translation to various languages using standard language codes:
Popular Language Codes:
- en: English
- es: Spanish
- fr: French
- de: German
- it: Italian
- pt: Portuguese
- ru: Russian
- ja: Japanese
- ko: Korean
- zh: Chinese (Simplified)
- ar: Arabic
- hi: Hindi
- nl: Dutch
- sv: Swedish
- da: Danish
- no: Norwegian
- fi: Finnish
Usage Notes:
- Translation preserves original formatting and slide structure
- Text in images cannot be translated (only text boxes and shapes)
- Complex animations and transitions are preserved
- File size may vary slightly after translation
- Processing time depends on presentation size and complexity
Tips and Best Practices
- Support for base64 encoded files enables secure file processing
- Always validate file types before processing
- Consider file size limitations when processing large documents
- Use appropriate timeouts for large document processing
- Implement error handling for corrupt or unsupported files
- Extract tables separately from text for structured data analysis
- Consider text cleaning and preprocessing for better analysis results