Overview

Jinba Modules provide powerful data processing capabilities including extraction, parsing, and validation. These tools use advanced AI and machine learning techniques to handle complex data transformation tasks with high accuracy and flexibility.

Key Features

JINBA_MODULES_EXTRACT

  • AI-powered data extraction from various sources
  • Configurable extraction modes (FAST, BALANCED, QUALITY)
  • User-defined JSON schema support
  • Intelligent content recognition and parsing

JINBA_MODULES_PARSE

  • Advanced document and data parsing
  • Structure recognition and preservation
  • Multi-format support
  • Context-aware content interpretation

JINBA_MODULES_CHECKER_V2

  • Enhanced data validation using JSON rules
  • Complex rule engine with multiple validation types
  • Detailed validation reporting
  • Improved performance and accuracy

Authentication

No authentication required for Jinba Modules tools.

Example: Intelligent Document Extraction

- id: upload_document
  name: upload_document
  tool: INPUT_FILE
  input:
    - name: description
      value: "Upload document for intelligent extraction"

- id: extract_structured_data
  name: extract_structured_data
  tool: JINBA_MODULES_EXTRACT
  input:
    - name: task_name
      value: "Invoice Data Extraction"
    - name: file_url
      value: "{{steps.upload_document.result.file_url}}"
    - name: data_schema
      value: |
        {
          "$schema": "http://json-schema.org/draft-07/schema#",
          "type": "object",
          "properties": {
            "invoice_number": {
              "type": "string",
              "description": "Invoice number or ID"
            },
            "date": {
              "type": "string",
              "format": "date",
              "description": "Invoice date"
            },
            "vendor": {
              "type": "object",
              "properties": {
                "name": {"type": "string"},
                "address": {"type": "string"},
                "phone": {"type": "string"},
                "email": {"type": "string"}
              }
            },
            "items": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "description": {"type": "string"},
                  "quantity": {"type": "number"},
                  "unit_price": {"type": "number"},
                  "total": {"type": "number"}
                }
              }
            },
            "total_amount": {
              "type": "number",
              "description": "Total invoice amount"
            },
            "tax_amount": {
              "type": "number",
              "description": "Tax amount if present"
            }
          },
          "required": ["invoice_number", "date", "total_amount"]
        }
    - name: extraction_mode
      value: "QUALITY"  # Options: FAST, BALANCED, QUALITY

- id: validate_extracted_data
  name: validate_extracted_data
  tool: JINBA_MODULES_CHECKER_V2
  input:
    - name: file_url
      value: "{{steps.extract_structured_data.result.file_url}}"
    - name: rules_json
      value: |
        {
          "validation_rules": [
            {
              "field": "invoice_number",
              "type": "required",
              "error_message": "Invoice number is required"
            },
            {
              "field": "total_amount",
              "type": "number",
              "min": 0,
              "error_message": "Total amount must be a positive number"
            },
            {
              "field": "date",
              "type": "date",
              "format": "YYYY-MM-DD",
              "error_message": "Date must be in valid format"
            },
            {
              "field": "vendor.email",
              "type": "email",
              "required": false,
              "error_message": "Email must be valid format if provided"
            }
          ]
        }

- id: process_extraction_results
  name: process_extraction_results
  tool: PYTHON_SANDBOX_RUN
  input:
    - name: code
      value: |
        import json
        
        # Process extraction results
        extracted_data = json.loads('''{{steps.extract_structured_data.result.extracted_data}}''')
        validation_results = json.loads('''{{steps.validate_extracted_data.result.validation_results}}''')
        
        print("Document Extraction Results")
        print("=" * 35)
        
        # Display extracted data
        print("📄 Extracted Information:")
        print(f"Invoice Number: {extracted_data.get('invoice_number', 'N/A')}")
        print(f"Date: {extracted_data.get('date', 'N/A')}")
        print(f"Vendor: {extracted_data.get('vendor', {}).get('name', 'N/A')}")
        print(f"Total Amount: ${extracted_data.get('total_amount', 0):,.2f}")
        
        if 'items' in extracted_data:
            print(f"Items Count: {len(extracted_data['items'])}")
        
        print("\n🔍 Validation Results:")
        valid_count = sum(1 for r in validation_results if r.get('status') == 'valid')
        total_rules = len(validation_results)
        print(f"Valid: {valid_count}/{total_rules}")
        
        # Show any validation errors
        errors = [r for r in validation_results if r.get('status') == 'invalid']
        if errors:
            print("\n❌ Validation Errors:")
            for error in errors:
                print(f"  - {error.get('field', 'Unknown')}: {error.get('message', 'Unknown error')}")
        else:
            print("✅ All validations passed")

- id: export_processed_data
  name: export_processed_data
  tool: OUTPUT_FILE
  input:
    - name: content
      value: "{{steps.extract_structured_data.result.extracted_data}}"
    - name: filename
      value: "extracted_invoice_data_{{date | format('YYYY-MM-DD')}}.json"
    - name: fileType
      value: "json"

Example: Batch Document Processing

- id: setup_batch_processing
  name: setup_batch_processing
  tool: PYTHON_SANDBOX_RUN
  input:
    - name: code
      value: |
        # Define batch processing configuration
        batch_config = {
            "document_types": ["invoice", "receipt", "contract"],
            "extraction_schema": {
                "common_fields": ["date", "amount", "vendor", "document_type"],
                "invoice_fields": ["invoice_number", "line_items", "tax_amount"],
                "receipt_fields": ["merchant", "payment_method", "receipt_number"],
                "contract_fields": ["parties", "terms", "effective_date", "expiration_date"]
            },
            "validation_rules": {
                "amount_validation": {"type": "number", "min": 0},
                "date_validation": {"type": "date", "format": "flexible"},
                "email_validation": {"type": "email", "required": false}
            }
        }
        
        print("Batch processing configured for document types:")
        for doc_type in batch_config["document_types"]:
            print(f"  - {doc_type.title()}")

- id: process_document_batch
  name: process_document_batch
  tool: JINBA_MODULES_EXTRACT
  input:
    - name: task_name
      value: "Batch Document Processing"
    - name: file_url
      value: "{{input.batch_file_url}}"
    - name: data_schema
      value: |
        {
          "$schema": "http://json-schema.org/draft-07/schema#",
          "type": "object",
          "properties": {
            "documents": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "document_type": {"type": "string"},
                  "date": {"type": "string"},
                  "amount": {"type": "number"},
                  "vendor": {"type": "string"},
                  "metadata": {
                    "type": "object",
                    "additionalProperties": true
                  }
                },
                "required": ["document_type", "date", "amount"]
              }
            }
          }
        }
    - name: extraction_mode
      value: "BALANCED"

- id: parse_complex_structures
  name: parse_complex_structures
  tool: JINBA_MODULES_PARSE
  input:
    - name: input_data
      value: "{{steps.process_document_batch.result.extracted_data}}"
    - name: parsing_options
      value: |
        {
          "preserve_structure": true,
          "normalize_dates": true,
          "standardize_amounts": true,
          "extract_entities": true,
          "group_by_type": true
        }

- id: comprehensive_validation
  name: comprehensive_validation
  tool: JINBA_MODULES_CHECKER_V2
  input:
    - name: data_content
      value: "{{steps.parse_complex_structures.result.parsed_data}}"
    - name: rules_json
      value: |
        {
          "validation_rules": [
            {
              "field": "documents[*].document_type",
              "type": "enum",
              "values": ["invoice", "receipt", "contract"],
              "error_message": "Document type must be invoice, receipt, or contract"
            },
            {
              "field": "documents[*].amount",
              "type": "number",
              "min": 0,
              "max": 1000000,
              "error_message": "Amount must be between 0 and 1,000,000"
            },
            {
              "field": "documents[*].date",
              "type": "date",
              "min_date": "2020-01-01",
              "max_date": "2025-12-31",
              "error_message": "Date must be between 2020 and 2025"
            },
            {
              "field": "documents[*].vendor",
              "type": "string",
              "min_length": 2,
              "max_length": 200,
              "error_message": "Vendor name must be 2-200 characters"
            }
          ],
          "summary_rules": [
            {
              "rule": "document_count_check",
              "expression": "documents.length > 0",
              "error_message": "At least one document must be processed"
            },
            {
              "rule": "total_amount_check", 
              "expression": "sum(documents[*].amount) > 0",
              "error_message": "Total amount must be greater than zero"
            }
          ]
        }

- id: generate_processing_report
  name: generate_processing_report
  tool: PYTHON_SANDBOX_RUN
  input:
    - name: code
      value: |
        import json
        from datetime import datetime
        
        # Compile processing report
        extracted = json.loads('''{{steps.process_document_batch.result.extracted_data}}''')
        parsed = json.loads('''{{steps.parse_complex_structures.result.parsed_data}}''')
        validation = json.loads('''{{steps.comprehensive_validation.result.validation_results}}''')
        
        report = {
            "processing_summary": {
                "timestamp": datetime.now().isoformat(),
                "total_documents": len(extracted.get('documents', [])),
                "extraction_mode": "BALANCED",
                "validation_passed": all(r.get('status') == 'valid' for r in validation)
            },
            "document_breakdown": {},
            "validation_summary": {
                "total_rules": len(validation),
                "passed": sum(1 for r in validation if r.get('status') == 'valid'),
                "failed": sum(1 for r in validation if r.get('status') == 'invalid')
            },
            "recommendations": []
        }
        
        # Document type breakdown
        if 'documents' in extracted:
            doc_types = {}
            total_amount = 0
            for doc in extracted['documents']:
                doc_type = doc.get('document_type', 'unknown')
                doc_types[doc_type] = doc_types.get(doc_type, 0) + 1
                total_amount += doc.get('amount', 0)
            
            report['document_breakdown'] = doc_types
            report['processing_summary']['total_amount'] = total_amount
        
        # Add recommendations
        if report['validation_summary']['failed'] > 0:
            report['recommendations'].append("Review failed validations and correct data issues")
        
        if report['processing_summary']['total_documents'] > 100:
            report['recommendations'].append("Consider processing in smaller batches for better performance")
        
        print(json.dumps(report, indent=2))

- id: save_processing_report
  name: save_processing_report
  tool: OUTPUT_FILE
  input:
    - name: content
      value: "{{steps.generate_processing_report.result.stdout}}"
    - name: filename
      value: "batch_processing_report_{{date | format('YYYY-MM-DD-HHmm')}}.json"
    - name: fileType
      value: "json"

Extraction Modes

FAST Mode

  • Speed: Fastest processing
  • Accuracy: Good for simple documents
  • Use cases: High-volume, simple document processing
  • Processing time: ~1-3 seconds per document
  • Speed: Moderate processing speed
  • Accuracy: High accuracy for most documents
  • Use cases: General-purpose document processing
  • Processing time: ~3-8 seconds per document

QUALITY Mode

  • Speed: Slower but thorough processing
  • Accuracy: Highest accuracy for complex documents
  • Use cases: Critical documents, complex layouts
  • Processing time: ~8-15 seconds per document

Data Schema Design

Basic Schema Structure

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "field_name": {
      "type": "string|number|object|array",
      "description": "Clear description of the field",
      "format": "date|email|uri|etc",
      "pattern": "regex_pattern_if_needed"
    }
  },
  "required": ["list_of_required_fields"]
}

Advanced Schema Features

  • Nested objects: Complex data structures
  • Arrays: Multiple items of the same type
  • Conditional fields: Fields dependent on other values
  • Pattern matching: Regex validation
  • Format validation: Date, email, URL formats

Validation Rules

Field-level Validation

  • Type checking: String, number, boolean, array, object
  • Range validation: Min/max values for numbers
  • Length validation: Min/max length for strings
  • Format validation: Email, date, URL patterns
  • Enum validation: Allowed values from a list

Document-level Validation

  • Required fields: Mandatory data presence
  • Cross-field validation: Rules spanning multiple fields
  • Business logic: Custom validation rules
  • Consistency checks: Data coherence validation

Use Cases

  • Invoice Processing: Automated invoice data extraction and validation
  • Document Digitization: Convert paper documents to structured data
  • Data Migration: Extract data from legacy systems
  • Compliance Checking: Validate documents against regulations
  • Research Data: Extract structured data from research documents
  • Form Processing: Automate form data extraction
  • Contract Analysis: Extract key terms from contracts
  • Financial Processing: Process financial statements and reports

Best Practices

Schema Design

  • Keep schemas simple and focused
  • Use clear, descriptive field names
  • Include comprehensive descriptions
  • Test schemas with sample data
  • Version your schemas for consistency

Extraction Optimization

  • Choose appropriate extraction mode for your use case
  • Provide high-quality input documents
  • Use consistent document formats when possible
  • Monitor extraction accuracy and adjust as needed

Validation Strategy

  • Implement layered validation (field → document → business)
  • Provide clear error messages
  • Log validation results for analysis
  • Continuously improve validation rules based on results

Performance Considerations

  • Batch similar documents together
  • Use FAST mode for simple, high-volume processing
  • Monitor processing times and adjust extraction modes
  • Implement error handling for failed extractions