Responses API

The Stateful Responses API is a middleware layer that sits between your application and an OpenAI Responses API-compatible inference backend. It implements the OpenAI Responses API format and adds:

  • Conversation persistence: Responses are stored and linked via previous_response_id, so the server reconstructs full conversation history automatically.

  • Instructions inheritance: System prompts carry forward across turns without resending them.

  • Streaming: Real-time token delivery via Server-Sent Events (SSE).

  • Tool calling: Client-executed function tools and server-executed MCP (Model Context Protocol) tools.

  • Async jobs: Fire-and-forget background processing for long-running requests.

  • Guardrails: Input safety checks via LlamaGuard integration.

API Compatibility

The API follows the OpenAI Responses API specification. You can use the OpenAI Python SDK, PydanticAI, LangGraph (via langchain-openai), or plain HTTP/curl to interact with it, just point the client to your deployment URL.

Quick Example

  • curl

  • Python (OpenAI SDK)

  • Python (PydanticAI)

  • Python (LangGraph)

curl -X POST https://your-deployment-url/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $AA_TOKEN" \
  -d '{
    "model": "qwen3-32b-tool",
    "input": "What is the capital of Germany?"
  }'
from openai import OpenAI

client = OpenAI(
    base_url="https://your-deployment-url/v1",
    api_key=AA_TOKEN,
)

response = client.responses.create(
    model="qwen3-32b-tool",
    input="What is the capital of Germany?",
)

print(response.output_text)
import asyncio
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIResponsesModel
from pydantic_ai.providers.openai import OpenAIProvider

provider = OpenAIProvider(
    base_url="https://your-deployment-url/v1",
    api_key=AA_TOKEN,
)

agent = Agent(
    model=OpenAIResponsesModel("qwen3-32b-tool", provider=provider),
    system_prompt="You are a helpful assistant.",
)

async def main():
    result = await agent.run("What is the capital of Germany?")
    print(result.output)

asyncio.run(main())
from typing import Annotated, TypedDict

from langchain_core.messages import BaseMessage, HumanMessage
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph
from langgraph.graph.message import add_messages

llm = ChatOpenAI(
    model="qwen3-32b-tool",
    base_url="https://your-deployment-url/v1",
    api_key=AA_TOKEN,
    use_responses_api=True,
)

class State(TypedDict):
    messages: Annotated[list[BaseMessage], add_messages]
    previous_response_id: str | None

def chat(state: State) -> dict:
    kwargs = {}
    if pid := state.get("previous_response_id"):
        kwargs["previous_response_id"] = pid
    ai = llm.invoke(state["messages"], **kwargs)
    return {
        "messages": [ai],
        "previous_response_id": ai.response_metadata.get("id"),
    }

graph = (
    StateGraph(State)
    .add_node("chat", chat)
    .set_entry_point("chat")
    .set_finish_point("chat")
    .compile()
)

result = graph.invoke({
    "messages": [HumanMessage("What is the capital of Germany?")],
    "previous_response_id": None,
})
print(result["messages"][-1].text)

Prerequisites

  • A valid AA_TOKEN for authentication

  • Network access to your deployment environment

  • For Python examples: openai, pydantic-ai, and/or langgraph + langchain-openai packages