Getting Started

This guide walks you through your first interactions with the Responses API: checking the service health, making a request, and customizing model behavior with instructions.

Client Setup

All examples use these values; replace them with your deployment details:

BASE_URL = https://your-deployment-url
MODEL    = qwen3-32b-tool
AA_TOKEN = <your token>
  • curl

  • Python (OpenAI SDK)

  • Python (PydanticAI)

  • Python (LangGraph)

No setup needed; just use the headers in each request:

export BASE_URL="https://your-deployment-url"
export AA_TOKEN="your-token"
from openai import OpenAI

client = OpenAI(
    base_url=f"{BASE_URL}/v1",
    api_key=AA_TOKEN,
)

PydanticAI uses async/await. Wrap calls in async def main() + asyncio.run(main()), or run in a Jupyter notebook / async framework.

import httpx
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIResponsesModel
from pydantic_ai.providers.openai import OpenAIProvider

provider = OpenAIProvider(
    base_url=f"{BASE_URL}/v1",
    api_key=AA_TOKEN,
)

LangGraph wraps langchain_openai.ChatOpenAI(use_responses_api=True), which speaks the Responses API directly. Build the LLM once and reuse it across graphs.

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="qwen3-32b-tool",
    base_url=f"{BASE_URL}/v1",
    api_key=AA_TOKEN,
    use_responses_api=True,
)

1. Health Check

Verify the API is reachable before making LLM requests.

  • curl

  • Python

curl $BASE_URL/health \
  -H "Authorization: Bearer $AA_TOKEN"
import httpx

response = httpx.get(
    f"{BASE_URL}/health",
    headers={"Authorization": f"Bearer {AA_TOKEN}"},
)
print(response.json())

Response:

{
  "status": "healthy"
}

2. Making Your First Request

The POST /v1/responses endpoint is the core of the API. At minimum you need model and input.

  • curl

  • Python (OpenAI SDK)

  • Python (PydanticAI)

  • Python (LangGraph)

curl -X POST $BASE_URL/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $AA_TOKEN" \
  -d '{
    "model": "qwen3-32b-tool",
    "input": "What is the capital of Germany?"
  }'
response = client.responses.create(
    model="qwen3-32b-tool",
    input="What is the capital of Germany?",
)

print(response.id)           # "resp_abc123..."
print(response.status)       # "completed"
print(response.output_text)  # "The capital of Germany is Berlin."
agent = Agent(
    model=OpenAIResponsesModel("qwen3-32b-tool", provider=provider),
    system_prompt="You are a helpful assistant.",
)

result = await agent.run("What is the capital of Germany?")
print(result.output)  # "The capital of Germany is Berlin."

Build a one-node graph whose state carries the message history plus the last response id, so chaining via previous_response_id is just another field in the state.

from typing import Annotated, TypedDict

from langchain_core.messages import BaseMessage, HumanMessage
from langgraph.graph import StateGraph
from langgraph.graph.message import add_messages

class State(TypedDict):
    messages: Annotated[list[BaseMessage], add_messages]
    previous_response_id: str | None

def chat(state: State) -> dict:
    kwargs = {}
    if pid := state.get("previous_response_id"):
        kwargs["previous_response_id"] = pid
    ai = llm.invoke(state["messages"], **kwargs)
    return {
        "messages": [ai],
        "previous_response_id": ai.response_metadata.get("id"),
    }

graph = (
    StateGraph(State)
    .add_node("chat", chat)
    .set_entry_point("chat")
    .set_finish_point("chat")
    .compile()
)

result = graph.invoke({
    "messages": [HumanMessage("What is the capital of Germany?")],
    "previous_response_id": None,
})
print(result["messages"][-1].text)  # "The capital of Germany is Berlin."
print(result["previous_response_id"])  # "resp_abc123...", pass to the next turn

Understanding the Response

The response object contains:

Field Description

id

Unique identifier (e.g. resp_abc123). Use this as previous_response_id to continue the conversation.

object

Always "response".

created_at

Unix timestamp of creation.

model

The model that generated the response.

status

"completed", "in_progress", or "incomplete".

output

Array of output items, may contain reasoning (chain-of-thought) and message blocks.

usage

Token counts: input_tokens, output_tokens, total_tokens.

Example response (JSON):

{
  "id": "resp_abc123",
  "object": "response",
  "created_at": 1711000000,
  "model": "qwen3-32b-tool",
  "status": "completed",
  "output": [
    {
      "type": "message",
      "role": "assistant",
      "content": [
        {
          "type": "output_text",
          "text": "The capital of Germany is Berlin."
        }
      ]
    }
  ],
  "usage": {
    "input_tokens": 12,
    "output_tokens": 8,
    "total_tokens": 20
  }
}

3. Adding Instructions (System Prompt)

The instructions field sets a system prompt that guides the model’s behavior: its persona, output format, or constraints.

  • curl

  • Python (OpenAI SDK)

  • Python (PydanticAI)

  • Python (LangGraph)

curl -X POST $BASE_URL/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $AA_TOKEN" \
  -d '{
    "model": "qwen3-32b-tool",
    "input": "Explain what a neural network is.",
    "instructions": "You are a helpful assistant that explains concepts in simple terms, using at most 2 sentences."
  }'
response = client.responses.create(
    model="qwen3-32b-tool",
    input="Explain what a neural network is.",
    instructions="You are a helpful assistant that explains concepts in simple terms, using at most 2 sentences.",
)

print(response.output_text)
agent = Agent(
    model=OpenAIResponsesModel("qwen3-32b-tool", provider=provider),
    system_prompt="You are a helpful assistant that explains concepts in simple terms, using at most 2 sentences.",
)

result = await agent.run("Explain what a neural network is.")
print(result.output)

Pass a SystemMessage first in the state and the server lifts it into the instructions field for you.

from langchain_core.messages import HumanMessage, SystemMessage

result = graph.invoke({
    "messages": [
        SystemMessage(
            "You are a helpful assistant that explains concepts in simple terms, "
            "using at most 2 sentences."
        ),
        HumanMessage("Explain what a neural network is."),
    ],
    "previous_response_id": None,
})
print(result["messages"][-1].text)

Instructions are inherited automatically when you continue a conversation with previous_response_id; you don’t need to resend them on every turn. See Conversations for details.

4. Structured Input

Instead of a plain string, you can pass structured input as an array of message objects:

  • curl

  • Python (OpenAI SDK)

  • Python (LangGraph)

curl -X POST $BASE_URL/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $AA_TOKEN" \
  -d '{
    "model": "qwen3-32b-tool",
    "input": [
      {"type": "message", "role": "user", "content": "What is 5 + 3?"}
    ]
  }'
response = client.responses.create(
    model="qwen3-32b-tool",
    input=[
        {"type": "message", "role": "user", "content": "What is 5 + 3?"},
    ],
)

print(response.output_text)  # "8"

LangGraph already speaks message objects. Pass typed message instances or plain dicts in the same shape:

result = graph.invoke({
    "messages": [
        {"type": "message", "role": "user", "content": "What is 5 + 3?"},
    ],
    "previous_response_id": None,
})

print(result["messages"][-1].text)  # "8"

This is useful when you need to pass specific message types like function_call_output for tool calling flows. See Tool Calling for examples.

Request Parameters Reference

Parameter Type Required Description

model

string

Yes

The LLM model to use (e.g. qwen3-32b-tool)

input

string or array

Yes

User prompt: plain string or structured input items

instructions

string

No

System prompt / instructions for the model

previous_response_id

string

No

Chain onto a previous response for multi-turn conversations

stream

boolean

No

Enable SSE streaming (default: false)

store

boolean

No

Whether to persist the response for later retrieval (default: true). See Opting Out of Storage.

metadata

object

No

Key-value string pairs for tagging responses (max 16 keys, 64-char keys, 512-char values)

conversation

string or object

No

Group this response into a conversation by ID, accepts "conv_id" or {"id": "conv_id"}

temperature

number

No

Sampling temperature (0.0–2.0)

top_p

number

No

Nucleus sampling parameter (0.0–1.0)

max_output_tokens

integer

No

Maximum tokens to generate

stop

array of strings

No

Up to 4 sequences where the model will stop generating

tools

array

No

Function or MCP tool definitions

tool_choice

string

No

"auto", "required", or "none"

parallel_tool_calls

boolean

No

Whether the model may call multiple tools in parallel (default: true)

max_tool_calls

integer

No

Maximum number of tool calls the model may make

background

boolean

No

Run as async job (default: false)

truncation

string

No

Controls how input is truncated when exceeding the context window

reasoning

object

No

Configuration for reasoning / chain-of-thought behavior

text

object

No

Configuration for text output (e.g. format constraints)

presence_penalty

number

No

Penalizes tokens based on whether they appear in the text so far (-2.0–2.0)

frequency_penalty

number

No

Penalizes tokens based on their frequency in the text so far (-2.0–2.0)

top_logprobs

integer

No

Number of most likely tokens to return at each position (0–20)

include

array of strings

No

Extra data to include in the response (e.g. "message.output_text.logprobs")

service_tier

string

No

The service tier to use for this request