Responses API
The Stateful Responses API is a middleware layer that sits between your application and an OpenAI Responses API-compatible inference backend. It implements the OpenAI Responses API format and adds:
-
Conversation persistence: Responses are stored and linked via
previous_response_id, so the server reconstructs full conversation history automatically. -
Instructions inheritance: System prompts carry forward across turns without resending them.
-
Streaming: Real-time token delivery via Server-Sent Events (SSE).
-
Tool calling: Client-executed function tools and server-executed MCP (Model Context Protocol) tools.
-
Async jobs: Fire-and-forget background processing for long-running requests.
-
Guardrails: Input safety checks via LlamaGuard integration.
API Compatibility
The API follows the OpenAI Responses API specification. You can use the OpenAI Python SDK, PydanticAI, LangGraph (via langchain-openai), or plain HTTP/curl to interact with it, just point the client to your deployment URL.
Quick Example
-
curl
-
Python (OpenAI SDK)
-
Python (PydanticAI)
-
Python (LangGraph)
curl -X POST https://your-deployment-url/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $AA_TOKEN" \
-d '{
"model": "qwen3-32b-tool",
"input": "What is the capital of Germany?"
}'
from openai import OpenAI
client = OpenAI(
base_url="https://your-deployment-url/v1",
api_key=AA_TOKEN,
)
response = client.responses.create(
model="qwen3-32b-tool",
input="What is the capital of Germany?",
)
print(response.output_text)
import asyncio
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIResponsesModel
from pydantic_ai.providers.openai import OpenAIProvider
provider = OpenAIProvider(
base_url="https://your-deployment-url/v1",
api_key=AA_TOKEN,
)
agent = Agent(
model=OpenAIResponsesModel("qwen3-32b-tool", provider=provider),
system_prompt="You are a helpful assistant.",
)
async def main():
result = await agent.run("What is the capital of Germany?")
print(result.output)
asyncio.run(main())
from typing import Annotated, TypedDict
from langchain_core.messages import BaseMessage, HumanMessage
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph
from langgraph.graph.message import add_messages
llm = ChatOpenAI(
model="qwen3-32b-tool",
base_url="https://your-deployment-url/v1",
api_key=AA_TOKEN,
use_responses_api=True,
)
class State(TypedDict):
messages: Annotated[list[BaseMessage], add_messages]
previous_response_id: str | None
def chat(state: State) -> dict:
kwargs = {}
if pid := state.get("previous_response_id"):
kwargs["previous_response_id"] = pid
ai = llm.invoke(state["messages"], **kwargs)
return {
"messages": [ai],
"previous_response_id": ai.response_metadata.get("id"),
}
graph = (
StateGraph(State)
.add_node("chat", chat)
.set_entry_point("chat")
.set_finish_point("chat")
.compile()
)
result = graph.invoke({
"messages": [HumanMessage("What is the capital of Germany?")],
"previous_response_id": None,
})
print(result["messages"][-1].text)