Getting Started

In this article:

Client Setup
1. Health Check
2. Making Your First Request
3. Adding Instructions (System Prompt)
4. Structured Input
Request Parameters Reference

This guide walks you through your first interactions with the Responses API: checking the service health, making a request, and customizing model behavior with instructions.

Client Setup

All examples use these values; replace them with your deployment details:

BASE_URL = https://your-deployment-url
MODEL    = qwen3-32b-tool
AA_TOKEN = <your token>

curl
Python (OpenAI SDK)
Python (PydanticAI)
Python (LangGraph)

No setup needed; just use the headers in each request:

export BASE_URL="https://your-deployment-url"
export AA_TOKEN="your-token"

from openai import OpenAI

client = OpenAI(
    base_url=f"{BASE_URL}/v1",
    api_key=AA_TOKEN,
)

PydanticAI uses async/await. Wrap calls in async def main() + asyncio.run(main()), or run in a Jupyter notebook / async framework.

import httpx
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIResponsesModel
from pydantic_ai.providers.openai import OpenAIProvider

provider = OpenAIProvider(
    base_url=f"{BASE_URL}/v1",
    api_key=AA_TOKEN,
)

LangGraph wraps langchain_openai.ChatOpenAI(use_responses_api=True), which speaks the Responses API directly. Build the LLM once and reuse it across graphs.

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="qwen3-32b-tool",
    base_url=f"{BASE_URL}/v1",
    api_key=AA_TOKEN,
    use_responses_api=True,
)

1. Health Check

Verify the API is reachable before making LLM requests.

curl
Python

curl $BASE_URL/health \
  -H "Authorization: Bearer $AA_TOKEN"

import httpx

response = httpx.get(
    f"{BASE_URL}/health",
    headers={"Authorization": f"Bearer {AA_TOKEN}"},
)
print(response.json())

Response:

{
  "status": "healthy"
}

2. Making Your First Request

The POST /v1/responses endpoint is the core of the API. At minimum you need model and input.

curl
Python (OpenAI SDK)
Python (PydanticAI)
Python (LangGraph)

curl -X POST $BASE_URL/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $AA_TOKEN" \
  -d '{
    "model": "qwen3-32b-tool",
    "input": "What is the capital of Germany?"
  }'

response = client.responses.create(
    model="qwen3-32b-tool",
    input="What is the capital of Germany?",
)

print(response.id)           # "resp_abc123..."
print(response.status)       # "completed"
print(response.output_text)  # "The capital of Germany is Berlin."

agent = Agent(
    model=OpenAIResponsesModel("qwen3-32b-tool", provider=provider),
    system_prompt="You are a helpful assistant.",
)

result = await agent.run("What is the capital of Germany?")
print(result.output)  # "The capital of Germany is Berlin."

Build a one-node graph whose state carries the message history plus the last response id, so chaining via previous_response_id is just another field in the state.

from typing import Annotated, TypedDict

from langchain_core.messages import BaseMessage, HumanMessage
from langgraph.graph import StateGraph
from langgraph.graph.message import add_messages

class State(TypedDict):
    messages: Annotated[list[BaseMessage], add_messages]
    previous_response_id: str | None

def chat(state: State) -> dict:
    kwargs = {}
    if pid := state.get("previous_response_id"):
        kwargs["previous_response_id"] = pid
    ai = llm.invoke(state["messages"], **kwargs)
    return {
        "messages": [ai],
        "previous_response_id": ai.response_metadata.get("id"),
    }

graph = (
    StateGraph(State)
    .add_node("chat", chat)
    .set_entry_point("chat")
    .set_finish_point("chat")
    .compile()
)

result = graph.invoke({
    "messages": [HumanMessage("What is the capital of Germany?")],
    "previous_response_id": None,
})
print(result["messages"][-1].text)  # "The capital of Germany is Berlin."
print(result["previous_response_id"])  # "resp_abc123...", pass to the next turn

Understanding the Response

The response object contains:

Field Description

Field	Description
`id`	Unique identifier (e.g. `resp_abc123`). Use this as `previous_response_id` to continue the conversation.
`object`	Always `"response"`.
`created_at`	Unix timestamp of creation.
`model`	The model that generated the response.
`status`	`"completed"`, `"in_progress"`, or `"incomplete"`.
`output`	Array of output items, may contain `reasoning` (chain-of-thought) and `message` blocks.
`usage`	Token counts: `input_tokens`, `output_tokens`, `total_tokens`.

id

Unique identifier (e.g. resp_abc123). Use this as previous_response_id to continue the conversation.

object

Always "response".

created_at

Unix timestamp of creation.

model

The model that generated the response.

status

"completed", "in_progress", or "incomplete".

output

Array of output items, may contain reasoning (chain-of-thought) and message blocks.

usage

Token counts: input_tokens, output_tokens, total_tokens.

Example response (JSON):

{
  "id": "resp_abc123",
  "object": "response",
  "created_at": 1711000000,
  "model": "qwen3-32b-tool",
  "status": "completed",
  "output": [
    {
      "type": "message",
      "role": "assistant",
      "content": [
        {
          "type": "output_text",
          "text": "The capital of Germany is Berlin."
        }
      ]
    }
  ],
  "usage": {
    "input_tokens": 12,
    "output_tokens": 8,
    "total_tokens": 20
  }
}

3. Adding Instructions (System Prompt)

The instructions field sets a system prompt that guides the model’s behavior: its persona, output format, or constraints.

curl
Python (OpenAI SDK)
Python (PydanticAI)
Python (LangGraph)

curl -X POST $BASE_URL/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $AA_TOKEN" \
  -d '{
    "model": "qwen3-32b-tool",
    "input": "Explain what a neural network is.",
    "instructions": "You are a helpful assistant that explains concepts in simple terms, using at most 2 sentences."
  }'

response = client.responses.create(
    model="qwen3-32b-tool",
    input="Explain what a neural network is.",
    instructions="You are a helpful assistant that explains concepts in simple terms, using at most 2 sentences.",
)

print(response.output_text)

agent = Agent(
    model=OpenAIResponsesModel("qwen3-32b-tool", provider=provider),
    system_prompt="You are a helpful assistant that explains concepts in simple terms, using at most 2 sentences.",
)

result = await agent.run("Explain what a neural network is.")
print(result.output)

Pass a SystemMessage first in the state and the server lifts it into the instructions field for you.

from langchain_core.messages import HumanMessage, SystemMessage

result = graph.invoke({
    "messages": [
        SystemMessage(
            "You are a helpful assistant that explains concepts in simple terms, "
            "using at most 2 sentences."
        ),
        HumanMessage("Explain what a neural network is."),
    ],
    "previous_response_id": None,
})
print(result["messages"][-1].text)

Instructions are inherited automatically when you continue a conversation with previous_response_id; you don’t need to resend them on every turn. See Conversations for details.

4. Structured Input

Instead of a plain string, you can pass structured input as an array of message objects:

curl
Python (OpenAI SDK)
Python (LangGraph)

curl -X POST $BASE_URL/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $AA_TOKEN" \
  -d '{
    "model": "qwen3-32b-tool",
    "input": [
      {"type": "message", "role": "user", "content": "What is 5 + 3?"}
    ]
  }'

response = client.responses.create(
    model="qwen3-32b-tool",
    input=[
        {"type": "message", "role": "user", "content": "What is 5 + 3?"},
    ],
)

print(response.output_text)  # "8"

LangGraph already speaks message objects. Pass typed message instances or plain dicts in the same shape:

result = graph.invoke({
    "messages": [
        {"type": "message", "role": "user", "content": "What is 5 + 3?"},
    ],
    "previous_response_id": None,
})

print(result["messages"][-1].text)  # "8"

This is useful when you need to pass specific message types like function_call_output for tool calling flows. See Tool Calling for examples.

Request Parameters Reference

Parameter Type Required Description

Parameter	Type	Required	Description
`model`	string	Yes	The LLM model to use (e.g. `qwen3-32b-tool`)
`input`	string or array	Yes	User prompt: plain string or structured input items
`instructions`	string	No	System prompt / instructions for the model
`previous_response_id`	string	No	Chain onto a previous response for multi-turn conversations
`stream`	boolean	No	Enable SSE streaming (default: `false`)
`store`	boolean	No	Whether to persist the response for later retrieval (default: `true`). See Opting Out of Storage.
`metadata`	object	No	Key-value string pairs for tagging responses (max 16 keys, 64-char keys, 512-char values)
`conversation`	string or object	No	Group this response into a conversation by ID, accepts `"conv_id"` or `{"id": "conv_id"}`
`temperature`	number	No	Sampling temperature (0.0–2.0)
`top_p`	number	No	Nucleus sampling parameter (0.0–1.0)
`max_output_tokens`	integer	No	Maximum tokens to generate
`stop`	array of strings	No	Up to 4 sequences where the model will stop generating
`tools`	array	No	Function or MCP tool definitions
`tool_choice`	string	No	`"auto"`, `"required"`, or `"none"`
`parallel_tool_calls`	boolean	No	Whether the model may call multiple tools in parallel (default: `true`)
`max_tool_calls`	integer	No	Maximum number of tool calls the model may make
`background`	boolean	No	Run as async job (default: `false`)
`truncation`	string	No	Controls how input is truncated when exceeding the context window
`reasoning`	object	No	Configuration for reasoning / chain-of-thought behavior
`text`	object	No	Configuration for text output (e.g. format constraints)
`presence_penalty`	number	No	Penalizes tokens based on whether they appear in the text so far (-2.0–2.0)
`frequency_penalty`	number	No	Penalizes tokens based on their frequency in the text so far (-2.0–2.0)
`top_logprobs`	integer	No	Number of most likely tokens to return at each position (0–20)
`include`	array of strings	No	Extra data to include in the response (e.g. `"message.output_text.logprobs"`)
`service_tier`	string	No	The service tier to use for this request

model

string

Yes

The LLM model to use (e.g. qwen3-32b-tool)

input

string or array

Yes

User prompt: plain string or structured input items

instructions

string

System prompt / instructions for the model

previous_response_id

string

Chain onto a previous response for multi-turn conversations

stream

boolean

Enable SSE streaming (default: false)

store

boolean

Whether to persist the response for later retrieval (default: true). See Opting Out of Storage.

metadata

object

Key-value string pairs for tagging responses (max 16 keys, 64-char keys, 512-char values)

conversation

string or object

Group this response into a conversation by ID, accepts "conv_id" or {"id": "conv_id"}

temperature

number

Sampling temperature (0.0–2.0)

top_p

number

Nucleus sampling parameter (0.0–1.0)

max_output_tokens

integer

Maximum tokens to generate

stop

array of strings

Up to 4 sequences where the model will stop generating

tools

array

Function or MCP tool definitions

tool_choice

string

"auto", "required", or "none"

parallel_tool_calls

boolean

Whether the model may call multiple tools in parallel (default: true)

max_tool_calls

integer

Maximum number of tool calls the model may make

background

boolean

Run as async job (default: false)

truncation

string

Controls how input is truncated when exceeding the context window

reasoning

object

Configuration for reasoning / chain-of-thought behavior

text

object

Configuration for text output (e.g. format constraints)

presence_penalty

number

Penalizes tokens based on whether they appear in the text so far (-2.0–2.0)

frequency_penalty

number

Penalizes tokens based on their frequency in the text so far (-2.0–2.0)

top_logprobs

integer

Number of most likely tokens to return at each position (0–20)

include

array of strings

Extra data to include in the response (e.g. "message.output_text.logprobs")

service_tier

string

The service tier to use for this request