Tutorial

Adding Multi-Modal Support to a Chatbot Without Rebuilding Backend

This case study shows how we integrated images, audio, and documents into a chatbot with Acontext. Unified session management and message storage reduced a 5–7 day build to just one day.

multi-modal chat bot

TL;DR

We added multi-modal support (images, audio, and documents) to a chatbot using Acontext, which simplified session management, message storage, and multi-format message conversion. What would have taken us 5-7 days to implement from scratch was reduced to 1 day with Acontext. In this post, we break down the process, from integrating Acontext for session management to enabling multi-modal capabilities, showing how this approach can save developers significant time and effort.

The Goal: Enabling Multi-Modal Support in a Chatbot

The objective was to add support for images, audio, and documents to the chatbot, while ensuring that the conversation history and user sessions were handled properly. These features not only enhance the chatbot's interaction capabilities but also demand a backend that can handle a variety of data types and provide seamless storage and retrieval.

Challenges We Faced:

  1. Message Format Conversion: Different LLM providers, such as OpenAI, Anthropic, and Gemini, use different message formats, which complicates integration.
  2. Multi-Modal Handling: Managing images, audio, and documents across various platforms required writing custom handling code for each format.
  3. Session Management: Rolling our own session management meant dealing with databases, edge cases, and migrations.
  4. Token Management: Truncating conversation history to stay within token limits while preserving important context was a tricky problem to solve.

How Acontext Simplified the Process

Step 1: Setting Up Basic Session Management

The first task was to set up session management. Acontext made it easy by handling session storage automatically. Here’s what we did:

  • Set up the Acontext client: We integrated the Acontext client into the project to manage message persistence and session creation.
  • Replaced in-memory message handling: Instead of manually managing message history, we used Acontext sessions to automatically store and retrieve user messages.

Step 2: Adding Multi-Modal Support

With basic session management in place, we added multi-modal support for images, audio, and documents:

  • File upload UI: We built a simple user interface to upload images, audio, and documents.
  • Convert files to base64: The files were encoded into base64 format, making them compatible with the message format we needed for OpenAI.
  • Store multi-modal messages: Acontext handled the storage and conversion of multi-modal content, ensuring compatibility with multiple AI models.

Time Estimate for Integration

Task

Time

Basic integration

~2-3 hours

Multi-modal support

~3-4 hours

Testing & debugging

~2 hours

Total

~1 day

Challenges Without a Backend vs. Acontext Approach

Here's a comparison of what we would have had to do manually versus how Acontext simplified each step:

Challenge

Without Acontext

With Acontext

Message Format Conversion

Each LLM provider (OpenAI, Anthropic, Gemini) uses different message formats, requiring manual conversion between them. This process is tedious and error-prone.

Acontext automatically handles message format conversion, allowing messages to be stored in one format (e.g., OpenAI) and retrieved in another (e.g., Anthropic).

Multi-Modal Complexity

Handling images, audio, and documents requires building different JSON structures for each LLM making the integration complex and prone to mistakes.

Acontext simplifies multi-modal support by automatically handling the storage and format conversion of images, audio, and documents with base64 encoding.

Session Persistence

Rolling session management system requires building and maintaining databases, managing edge cases, and dealing with data migrations.

Acontext provides built-in session management, automatically storing messages and creating sessions without requiring custom database schemas.

Token Management

Truncating conversation history while maintaining context coherence requires careful implementation, as each provider has different token limits.

Acontext provides edit_strategies with token_limit, making it easy to properly truncate history and maintain context coherence automatically.

Possible Workloads Without Acontext

Step

Time

Database schema design

4-6 hours

Message storage API

6-8 hours

Multi-modal format handling

8-12 hours

Multi-provider conversion

10-15 hours

Token management

4-6 hours

Testing & edge cases

6-8 hours

Total

5-7 days

Key Takeaway:

What would have taken 3-5 days to build manually (session management, format conversion, multi-modal handling) was reduced to 1 day with Acontext. The biggest win was not having to worry about message format differences between LLM providers — just store and retrieve, and Acontext handles the rest.


Step-by-Step Breakdown

Current Message Flow

User Input → API /api/chat → OpenAI (streamText) → Stream Response
                ↓
        Memobase stores messages → Extract user profiles/events

Text-only messages - No support for images, audio, or documents

No session persistence - Messages are not persisted in Acontext format

Target Message Flow

User Input (text/image/audio/document)
        ↓
API /api/chat
        ↓
┌───────────────────────────────────────┐
│ 1. Get/Create Acontext Session        │
│ 2. Store user message to Acontext     │
│ 3. Call OpenAI with multi-modal input │
│ 4. Store assistant response           │
└───────────────────────────────────────┘
        ↓
Stream Response + Update UI

Detailed Migration Steps

Phase 1: Basic Integration (Replace Message Storage)

Step 1.1: Install Dependencies

pnpm add @acontext/acontext

Step 1.2: Create Acontext Client

New file: utils/acontext/client.ts

import { AcontextClient } from '@acontext/acontext';

export const acontextClient = new AcontextClient({
  apiKey: process.env.ACONTEXT_API_KEY!
});

Step 1.3: Add Environment Variables

Modify file: .env.example and .env

# Acontext Configuration
ACONTEXT_API_KEY=sk-ac-your-api-key

Step 1.4: Update Chat API for Acontext

Modify file: app/api/chat/route.ts

import { openai } from "@/lib/openai";
import { jsonSchema, streamText } from "ai";
import { createClient } from "@/utils/supabase/server";
import { acontextClient } from "@/utils/acontext/client";

export const maxDuration = 30;

export async function POST(req: Request) {
  const supabase = await createClient();
  const { data, error } = await supabase.auth.getUser();
  if (error || !data?.user) {
    return new Response("Unauthorized", { status: 401 });
  }

  try {
    const { messages, tools, sessionId } = await req.json();
    
    // 1. Get or create Acontext Session
    let session;
    if (sessionId) {
      session = { id: sessionId };
    } else {
      session = await acontextClient.sessions.create({ 
        user: data.user.id 
      });
    }

    // 2. Store user message to Acontext
    const lastUserMessage = messages[messages.length - 1];
    await acontextClient.sessions.storeMessage(session.id, lastUserMessage, {
      format: 'openai'
    });

    // 3. Build system prompt
    const systemPrompt = `You're Memobase Assistant, a helpful assistant that demonstrates the capabilities of Memobase Memory.`;
    
    // 4. Call LLM
    const result = streamText({
      model: openai(process.env.OPENAI_MODEL!),
      messages,
      system: systemPrompt,
      tools: Object.fromEntries(
        Object.entries<{ parameters: unknown }>(tools).map(([name, tool]) => [
          name,
          { parameters: jsonSchema(tool.parameters!) },
        ])
      ),
    });

    // 5. Store assistant response after completion
    result.then(async (finalResult) => {
      const text = await finalResult.text;
      if (text) {
        await acontextClient.sessions.storeMessage(session.id, {
          role: 'assistant',
          content: text
        }, { format: 'openai' });
      }
    });

    return result.toDataStreamResponse({
      headers: {
        "x-session-id": session.id,
      },
    });
  } catch (error) {
    console.error(error);
    return new Response("Internal Server Error", { status: 500 });
  }
}


Phase 2: Multi-modal Support (OpenAI)

Step 2.1: Create File Upload Component

New file: components/file-upload.tsx

"use client";

import { useRef } from "react";
import { Button } from "@/components/ui/button";
import { ImageIcon, FileIcon, MicIcon } from "lucide-react";

export type AttachmentType = 'image' | 'audio' | 'document';

export interface Attachment {
  type: AttachmentType;
  base64: string;
  mimeType: string;
  filename: string;
}

interface FileUploadProps {
  onFileSelect: (attachment: Attachment) => void;
  disabled?: boolean;
}

export function FileUpload({ onFileSelect, disabled }: FileUploadProps) {
  const imageInputRef = useRef<HTMLInputElement>(null);
  const audioInputRef = useRef<HTMLInputElement>(null);
  const docInputRef = useRef<HTMLInputElement>(null);

  const handleFileChange = async (
    e: React.ChangeEvent<HTMLInputElement>,
    type: AttachmentType
  ) => {
    const file = e.target.files?.[0];
    if (!file) return;

    const reader = new FileReader();
    reader.onload = () => {
      const base64 = (reader.result as string).split(',')[1];
      onFileSelect({
        type,
        base64,
        mimeType: file.type,
        filename: file.name,
      });
    };
    reader.readAsDataURL(file);
    
    // Reset input
    e.target.value = '';
  };

  return (
    <div className="flex gap-1">
      <input
        ref={imageInputRef}
        type="file"
        accept="image/png,image/jpeg,image/gif,image/webp"
        className="hidden"
        onChange={(e) => handleFileChange(e, 'image')}
      />
      <input
        ref={audioInputRef}
        type="file"
        accept="audio/wav,audio/mp3,audio/webm"
        className="hidden"
        onChange={(e) => handleFileChange(e, 'audio')}
      />
      <input
        ref={docInputRef}
        type="file"
        accept=".pdf"
        className="hidden"
        onChange={(e) => handleFileChange(e, 'document')}
      />
      
      <Button
        variant="ghost"
        size="icon"
        disabled={disabled}
        onClick={() => imageInputRef.current?.click()}
        title="Upload image"
      >
        <ImageIcon className="h-4 w-4" />
      </Button>
      <Button
        variant="ghost"
        size="icon"
        disabled={disabled}
        onClick={() => audioInputRef.current?.click()}
        title="Upload audio"
      >
        <MicIcon className="h-4 w-4" />
      </Button>
      <Button
        variant="ghost"
        size="icon"
        disabled={disabled}
        onClick={() => docInputRef.current?.click()}
        title="Upload document"
      >
        <FileIcon className="h-4 w-4" />
      </Button>
    </div>
  );
}

Step 2.2: Create Multi-modal Message Builder

New file: lib/multimodal.ts

import type { Attachment } from "@/components/file-upload";

/**
 * Build OpenAI-format multi-modal message content
 */
export function buildMultimodalContent(
  text: string,
  attachments?: Attachment[]
): string | Array<{ type: string; [key: string]: any }> {
  // If no attachments, return plain text
  if (!attachments || attachments.length === 0) {
    return text;
  }

  const content: Array<{ type: string; [key: string]: any }> = [];

  // Add text part
  if (text) {
    content.push({ type: 'text', text });
  }

  // Add attachment parts
  for (const attachment of attachments) {
    switch (attachment.type) {
      case 'image':
        content.push({
          type: 'image_url',
          image_url: {
            url: `data:${attachment.mimeType};base64,${attachment.base64}`,
            detail: 'auto'
          }
        });
        break;
        
      case 'audio':
        content.push({
          type: 'input_audio',
          input_audio: {
            data: attachment.base64,
            format: attachment.mimeType.split('/')[1] || 'wav'
          }
        });
        break;
        
      case 'document':
        // Note: OpenAI doesn't natively support PDF in chat
        // Store in Acontext for reference, but convert to text description
        content.push({
          type: 'text',
          text: `[Attached document: ${attachment.filename}]`
        });
        break;
    }
  }

  return content;
}

/**
 * Build complete OpenAI-format message with attachments
 */
export function buildUserMessage(
  text: string,
  attachments?: Attachment[]
) {
  return {
    role: 'user' as const,
    content: buildMultimodalContent(text, attachments)
  };
}

Step 2.3: Update Chat API for Multi-modal

Modify file: app/api/chat/route.ts

import { openai } from "@/lib/openai";
import { jsonSchema, streamText } from "ai";
import { createClient } from "@/utils/supabase/server";
import { acontextClient } from "@/utils/acontext/client";

export const maxDuration = 30;

export async function POST(req: Request) {
  const supabase = await createClient();
  const { data, error } = await supabase.auth.getUser();
  if (error || !data?.user) {
    return new Response("Unauthorized", { status: 401 });
  }

  try {
    const { messages, tools, sessionId } = await req.json();
    
    // 1. Get or create Acontext Session
    let session;
    if (sessionId) {
      session = { id: sessionId };
    } else {
      session = await acontextClient.sessions.create({ 
        user: data.user.id 
      });
    }

    // 2. Store user message to Acontext (supports multi-modal)
    const lastUserMessage = messages[messages.length - 1];
    await acontextClient.sessions.storeMessage(session.id, lastUserMessage, {
      format: 'openai'
    });

    // 3. Build system prompt
    const systemPrompt = `You're Memobase Assistant, a helpful assistant that demonstrates the capabilities of Memobase Memory.`;
    
    // 4. Call OpenAI (GPT-4o supports vision)
    const result = streamText({
      model: openai(process.env.OPENAI_MODEL!), // Use gpt-4o for multi-modal
      messages,
      system: systemPrompt,
      tools: tools ? Object.fromEntries(
        Object.entries<{ parameters: unknown }>(tools).map(([name, tool]) => [
          name,
          { parameters: jsonSchema(tool.parameters!) },
        ])
      ) : undefined,
    });

    // 5. Store assistant response after completion
    result.then(async (finalResult) => {
      const text = await finalResult.text;
      if (text) {
        await acontextClient.sessions.storeMessage(session.id, {
          role: 'assistant',
          content: text
        }, { format: 'openai' });
      }
    });

    const lastMessage = Array.isArray(lastUserMessage.content)
      ? lastUserMessage.content.find((c: any) => c.type === 'text')?.text || ''
      : lastUserMessage.content;

    return result.toDataStreamResponse({
      headers: {
        "x-session-id": session.id,
        "x-last-user-message": encodeURIComponent(lastMessage),
      },
    });
  } catch (error) {
    console.error(error);
    return new Response("Internal Server Error", { status: 500 });
  }
}

Step 2.4: Update Frontend Page

Modify file: app/page.tsx (key changes)

// Add state for session and attachments
const [sessionId, setSessionId] = useState<string | null>(null);
const [attachments, setAttachments] = useState<Attachment[]>([]);

// Update runtime config
const runtime = useChatRuntime({
  api: `${process.env["NEXT_PUBLIC_BASE_PATH"] || ""}/api/chat`,
  body: {
    sessionId,
  },
  onResponse: (response) => {
    if (response.status !== 200) return;
    
    // Get session ID from response
    const newSessionId = response.headers.get("x-session-id");
    if (newSessionId && !sessionId) {
      setSessionId(newSessionId);
    }
    
    const message = response.headers.get("x-last-user-message") || "";
    lastUserMessageRef.current = decodeURIComponent(message);
  },
  // ... rest of config
});

// Clear attachments after sending
const handleSend = () => {
  setAttachments([]);
};


In short, Acontext eliminates backend complexity, saving you time on tasks such as session management, format conversion, and multi-modal handling.

Save yourself days of work: Try Acontext now.