Getting Started
This guide walks you through your first interactions with the Responses API: checking the service health, making a request, and customizing model behavior with instructions.
Client Setup
All examples use these values; replace them with your deployment details:
BASE_URL = https://your-deployment-url MODEL = qwen3-32b-tool AA_TOKEN = <your token>
-
curl
-
Python (OpenAI SDK)
-
Python (PydanticAI)
-
Python (LangGraph)
No setup needed; just use the headers in each request:
export BASE_URL="https://your-deployment-url"
export AA_TOKEN="your-token"
from openai import OpenAI
client = OpenAI(
base_url=f"{BASE_URL}/v1",
api_key=AA_TOKEN,
)
PydanticAI uses async/await. Wrap calls in async def main() + asyncio.run(main()), or run in a Jupyter notebook / async framework.
import httpx
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIResponsesModel
from pydantic_ai.providers.openai import OpenAIProvider
provider = OpenAIProvider(
base_url=f"{BASE_URL}/v1",
api_key=AA_TOKEN,
)
LangGraph wraps langchain_openai.ChatOpenAI(use_responses_api=True), which speaks the Responses API directly. Build the LLM once and reuse it across graphs.
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="qwen3-32b-tool",
base_url=f"{BASE_URL}/v1",
api_key=AA_TOKEN,
use_responses_api=True,
)
1. Health Check
Verify the API is reachable before making LLM requests.
-
curl
-
Python
curl $BASE_URL/health \
-H "Authorization: Bearer $AA_TOKEN"
import httpx
response = httpx.get(
f"{BASE_URL}/health",
headers={"Authorization": f"Bearer {AA_TOKEN}"},
)
print(response.json())
Response:
{
"status": "healthy"
}
2. Making Your First Request
The POST /v1/responses endpoint is the core of the API. At minimum you need model and input.
-
curl
-
Python (OpenAI SDK)
-
Python (PydanticAI)
-
Python (LangGraph)
curl -X POST $BASE_URL/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $AA_TOKEN" \
-d '{
"model": "qwen3-32b-tool",
"input": "What is the capital of Germany?"
}'
response = client.responses.create(
model="qwen3-32b-tool",
input="What is the capital of Germany?",
)
print(response.id) # "resp_abc123..."
print(response.status) # "completed"
print(response.output_text) # "The capital of Germany is Berlin."
agent = Agent(
model=OpenAIResponsesModel("qwen3-32b-tool", provider=provider),
system_prompt="You are a helpful assistant.",
)
result = await agent.run("What is the capital of Germany?")
print(result.output) # "The capital of Germany is Berlin."
Build a one-node graph whose state carries the message history plus the last response id, so chaining via previous_response_id is just another field in the state.
from typing import Annotated, TypedDict
from langchain_core.messages import BaseMessage, HumanMessage
from langgraph.graph import StateGraph
from langgraph.graph.message import add_messages
class State(TypedDict):
messages: Annotated[list[BaseMessage], add_messages]
previous_response_id: str | None
def chat(state: State) -> dict:
kwargs = {}
if pid := state.get("previous_response_id"):
kwargs["previous_response_id"] = pid
ai = llm.invoke(state["messages"], **kwargs)
return {
"messages": [ai],
"previous_response_id": ai.response_metadata.get("id"),
}
graph = (
StateGraph(State)
.add_node("chat", chat)
.set_entry_point("chat")
.set_finish_point("chat")
.compile()
)
result = graph.invoke({
"messages": [HumanMessage("What is the capital of Germany?")],
"previous_response_id": None,
})
print(result["messages"][-1].text) # "The capital of Germany is Berlin."
print(result["previous_response_id"]) # "resp_abc123...", pass to the next turn
Understanding the Response
The response object contains:
| Field | Description |
|---|---|
|
Unique identifier (e.g. |
|
Always |
|
Unix timestamp of creation. |
|
The model that generated the response. |
|
|
|
Array of output items, may contain |
|
Token counts: |
Example response (JSON):
{
"id": "resp_abc123",
"object": "response",
"created_at": 1711000000,
"model": "qwen3-32b-tool",
"status": "completed",
"output": [
{
"type": "message",
"role": "assistant",
"content": [
{
"type": "output_text",
"text": "The capital of Germany is Berlin."
}
]
}
],
"usage": {
"input_tokens": 12,
"output_tokens": 8,
"total_tokens": 20
}
}
3. Adding Instructions (System Prompt)
The instructions field sets a system prompt that guides the model’s behavior: its persona, output format, or constraints.
-
curl
-
Python (OpenAI SDK)
-
Python (PydanticAI)
-
Python (LangGraph)
curl -X POST $BASE_URL/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $AA_TOKEN" \
-d '{
"model": "qwen3-32b-tool",
"input": "Explain what a neural network is.",
"instructions": "You are a helpful assistant that explains concepts in simple terms, using at most 2 sentences."
}'
response = client.responses.create(
model="qwen3-32b-tool",
input="Explain what a neural network is.",
instructions="You are a helpful assistant that explains concepts in simple terms, using at most 2 sentences.",
)
print(response.output_text)
agent = Agent(
model=OpenAIResponsesModel("qwen3-32b-tool", provider=provider),
system_prompt="You are a helpful assistant that explains concepts in simple terms, using at most 2 sentences.",
)
result = await agent.run("Explain what a neural network is.")
print(result.output)
Pass a SystemMessage first in the state and the server lifts it into the instructions field for you.
from langchain_core.messages import HumanMessage, SystemMessage
result = graph.invoke({
"messages": [
SystemMessage(
"You are a helpful assistant that explains concepts in simple terms, "
"using at most 2 sentences."
),
HumanMessage("Explain what a neural network is."),
],
"previous_response_id": None,
})
print(result["messages"][-1].text)
Instructions are inherited automatically when you continue a conversation with previous_response_id; you don’t need to resend them on every turn. See Conversations for details.
4. Structured Input
Instead of a plain string, you can pass structured input as an array of message objects:
-
curl
-
Python (OpenAI SDK)
-
Python (LangGraph)
curl -X POST $BASE_URL/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $AA_TOKEN" \
-d '{
"model": "qwen3-32b-tool",
"input": [
{"type": "message", "role": "user", "content": "What is 5 + 3?"}
]
}'
response = client.responses.create(
model="qwen3-32b-tool",
input=[
{"type": "message", "role": "user", "content": "What is 5 + 3?"},
],
)
print(response.output_text) # "8"
LangGraph already speaks message objects. Pass typed message instances or plain dicts in the same shape:
result = graph.invoke({
"messages": [
{"type": "message", "role": "user", "content": "What is 5 + 3?"},
],
"previous_response_id": None,
})
print(result["messages"][-1].text) # "8"
This is useful when you need to pass specific message types like function_call_output for tool calling flows. See Tool Calling for examples.
Request Parameters Reference
| Parameter | Type | Required | Description |
|---|---|---|---|
|
string |
Yes |
The LLM model to use (e.g. |
|
string or array |
Yes |
User prompt: plain string or structured input items |
|
string |
No |
System prompt / instructions for the model |
|
string |
No |
Chain onto a previous response for multi-turn conversations |
|
boolean |
No |
Enable SSE streaming (default: |
|
boolean |
No |
Whether to persist the response for later retrieval (default: |
|
object |
No |
Key-value string pairs for tagging responses (max 16 keys, 64-char keys, 512-char values) |
|
string or object |
No |
Group this response into a conversation by ID, accepts |
|
number |
No |
Sampling temperature (0.0–2.0) |
|
number |
No |
Nucleus sampling parameter (0.0–1.0) |
|
integer |
No |
Maximum tokens to generate |
|
array of strings |
No |
Up to 4 sequences where the model will stop generating |
|
array |
No |
Function or MCP tool definitions |
|
string |
No |
|
|
boolean |
No |
Whether the model may call multiple tools in parallel (default: |
|
integer |
No |
Maximum number of tool calls the model may make |
|
boolean |
No |
Run as async job (default: |
|
string |
No |
Controls how input is truncated when exceeding the context window |
|
object |
No |
Configuration for reasoning / chain-of-thought behavior |
|
object |
No |
Configuration for text output (e.g. format constraints) |
|
number |
No |
Penalizes tokens based on whether they appear in the text so far (-2.0–2.0) |
|
number |
No |
Penalizes tokens based on their frequency in the text so far (-2.0–2.0) |
|
integer |
No |
Number of most likely tokens to return at each position (0–20) |
|
array of strings |
No |
Extra data to include in the response (e.g. |
|
string |
No |
The service tier to use for this request |