Ingestion API Schema for Contextual AI Assistant

This document defines the schema and processing logic for the /ingest endpoint of the Contextual AI Assistant, which is responsible for receiving and processing screen data from client applications.

1. Endpoint Definition

POST /ingest

The ingestion endpoint receives screen data every 5 seconds from client applications, processes it to extract entities and relationships, and stores the structured information in the Mem0 memory system.

2. Request Schema

2.1 Headers

Header	Description	Required	Example
`Content-Type`	Media type of the request body	Yes	`application/json`
`X-User-ID`	Unique identifier for the user	Yes	`user_123`
`X-Client-ID`	Identifier for the client application	Yes	`desktop_app_v1.2`
`X-Session-ID`	Unique identifier for the user session	No	`session_456`
`X-Timestamp`	Client-side timestamp (ISO 8601)	No	`2025-04-17T00:30:45Z`

2.2 Request Body

{
  "content": {
    "text": "string",          // Raw text content from the screen
    "html": "string",          // Optional HTML content for rich formatting
    "structure": {},           // Optional structured representation of the content
    "metadata": {}             // Optional additional metadata
  },
  "context": {
    "app": {
      "name": "string",        // Application name (e.g., "WhatsApp", "GitHub")
      "type": "string",        // Application type (e.g., "messaging", "development")
      "version": "string",     // Application version
      "window_title": "string" // Window title
    },
    "user": {
      "active": true,          // Whether the user is actively engaging with the app
      "focus_duration_ms": 0,  // How long the user has been focused on this window
      "last_input_ms": 0       // Time since last user input
    },
    "device": {
      "type": "string",        // Device type (e.g., "desktop", "mobile")
      "os": "string",          // Operating system
      "screen_resolution": {   // Screen resolution
        "width": 0,
        "height": 0
      }
    },
    "timestamp": "string",     // ISO 8601 timestamp
    "timezone": "string"       // User's timezone
  },
  "capture": {
    "type": "full" | "diff" | "event", // Type of screen capture
    "sequence_id": 0,                 // Sequence number for ordering captures
    "diff_base_id": "string",         // Reference to previous capture (for diffs)
    "image": "string"                 // Optional: Base64-encoded screenshot
  }
}

2.2.1 Content Object

The content object contains the actual screen data captured by the client:

text: Plain text content extracted from the screen
html: Optional HTML representation for richer content analysis
structure: Optional structured representation of the content (application-specific)
metadata: Additional content-related metadata

2.2.2 Context Object

The context object provides information about the environment in which the screen data was captured:

app: Information about the application being used
- name: Application name (e.g., "WhatsApp", "GitHub")
- type: Application type (e.g., "messaging", "development")
- version: Application version
- window_title: Window title
user: Information about user activity
- active: Whether the user is actively engaging with the app
- focus_duration_ms: How long the user has been focused on this window
- last_input_ms: Time since last user input