Streaming

In this article:

Basic Streaming
SSE Event Types
Errors During Streaming
Streaming with Conversation History
Retrieving a Streamed Response

By default, the API waits for the full response before returning it. With streaming enabled (stream: true), tokens are delivered in real-time via Server-Sent Events (SSE) as the model generates them.

Streamed responses are stored server-side just like non-streamed ones; you can retrieve or continue them later using previous_response_id.

Basic Streaming

curl
Python (OpenAI SDK)
Python (PydanticAI)
Python (LangGraph)

curl -N -X POST $BASE_URL/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $AA_TOKEN" \
  -d '{
    "model": "qwen3-32b-tool",
    "input": "Write a short poem about the ocean.",
    "instructions": "You are a creative writer. Keep responses short.",
    "stream": true
  }'

The response is a stream of data: lines:

data: {"type":"response.created","response":{"id":"resp_abc123",...}}

data: {"type":"response.output_item.added","item":{"type":"message",...}}

data: {"type":"response.output_text.delta","delta":"The "}

data: {"type":"response.output_text.delta","delta":"ocean "}

data: {"type":"response.output_text.delta","delta":"waves..."}

data: {"type":"response.completed","response":{...}}

data: [DONE]

stream = client.responses.create(
    model="qwen3-32b-tool",
    input="Write a short poem about the ocean.",
    instructions="You are a creative writer. Keep responses short.",
    stream=True,
)

for event in stream:
    if event.type == "response.output_text.delta":
        print(event.delta, end="", flush=True)
    elif event.type == "response.completed":
        response_id = event.response.id

PydanticAI handles streaming internally; agent.run() returns the complete result. You don’t need to manage SSE events yourself:

from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIResponsesModel

agent = Agent(
    model=OpenAIResponsesModel("qwen3-32b-tool", provider=provider),
    system_prompt="You are a creative writer. Keep responses short.",
)

result = await agent.run("Write a short poem about the ocean.")
print(result.output)

PydanticAI uses streaming under the hood when communicating with the Responses API, but exposes a simple synchronous-style interface. For fine-grained SSE control, use the OpenAI SDK directly.

Stream tokens directly from the underlying ChatOpenAI and capture the final response id from the last chunk. Use this id to chain the next turn (streamed or not).

from langchain_core.messages import HumanMessage, SystemMessage

messages = [
    SystemMessage("You are a creative writer. Keep responses short."),
    HumanMessage("Write a short poem about the ocean."),
]

response_id = None
for chunk in llm.stream(messages):
    print(chunk.text, end="", flush=True)
    # The final chunk carries the completed response metadata.
    if chunk.response_metadata.get("id"):
        response_id = chunk.response_metadata["id"]

print()  # newline after the stream
# Pass `response_id` as `previous_response_id` on the next call to chain.

To stream from inside a multi-node graph instead of the LLM directly, use graph.stream(state, stream_mode="messages"); it yields the same AIMessageChunk objects keyed by node.

SSE Event Types

Core events (always present):

Event Type Description

Event Type	Description
`response.created`	Stream started. Contains the initial response object with `id`.
`response.in_progress`	Response processing has begun.
`response.output_item.added`	A new output item (message, reasoning, tool call) was added.
`response.content_part.added`	A new content part was added to a message.
`response.output_text.delta`	A text chunk. The `delta` field contains the incremental text.
`response.completed`	Stream finished successfully. Contains the full response object.
`response.incomplete`	Terminal event. The response ended early but not with an error, e.g. it hit the token limit or has outstanding client-side tool calls. Contains the response object.
`response.failed`	Terminal event. The backend marked the response as `failed`. Contains the response object.
`error`	Terminal error. Emitted if the backend fails after streaming has started; carries `code` and `message` (see Errors During Streaming). Followed by `[DONE]`.
`[DONE]`	Terminal signal: the stream is closed.

response.created

Stream started. Contains the initial response object with id.

response.in_progress

Response processing has begun.

response.output_item.added

A new output item (message, reasoning, tool call) was added.

response.content_part.added

A new content part was added to a message.

response.output_text.delta

A text chunk. The delta field contains the incremental text.

response.completed

Stream finished successfully. Contains the full response object.

response.incomplete

Terminal event. The response ended early but not with an error, e.g. it hit the token limit or has outstanding client-side tool calls. Contains the response object.

response.failed

Terminal event. The backend marked the response as failed. Contains the response object.

error

Terminal error. Emitted if the backend fails after streaming has started; carries code and message (see Errors During Streaming). Followed by [DONE].

[DONE]

Terminal signal: the stream is closed.

Additional events (present depending on model capabilities and tool usage):

Event Type Description

Event Type	Description
`response.reasoning_text.delta`	A reasoning/chain-of-thought chunk (if the model supports it).
`response.function_call_arguments.delta`	Incremental function call arguments (for function tool calling).
`response.mcp_call_arguments.delta`	Incremental MCP tool call arguments.
`response.mcp_call.completed`	An MCP tool call finished executing.

response.reasoning_text.delta

A reasoning/chain-of-thought chunk (if the model supports it).

response.function_call_arguments.delta

Incremental function call arguments (for function tool calling).

response.mcp_call_arguments.delta

Incremental MCP tool call arguments.

response.mcp_call.completed

An MCP tool call finished executing.

Errors During Streaming

A streaming request returns HTTP 200 as soon as the first bytes are sent, so failures that happen mid-stream (e.g. the model’s context length is exceeded, or the inference backend becomes unavailable) can no longer be signalled with an HTTP status code. Instead, the API forwards the failure in-band as a terminal error event, followed by [DONE]:

data: {"type": "error", "sequence_number": 1, "code": "inference_error", "message": "This model's maximum context length is 32768 tokens...", "param": null}

data: [DONE]

Field Description

Field	Description
`type`	Always `"error"`.
`sequence_number`	Position of the event in the stream.
`code`	An internal error-code string: `inference_error` for backend/inference failures, `server_error` for unexpected errors.
`message`	Human-readable failure reason (forwarded from the inference backend when available).
`param`	Reserved for the offending parameter; currently always `null`.

type

Always "error".

sequence_number

Position of the event in the stream.

code

An internal error-code string: inference_error for backend/inference failures, server_error for unexpected errors.

message

Human-readable failure reason (forwarded from the inference backend when available).

param

Reserved for the offending parameter; currently always null.

A stream may therefore terminate with an error event instead of response.completed. Clients should handle this case explicitly rather than assuming every stream ends with a completed response:

for event in stream:
    if event.type == "response.output_text.delta":
        print(event.delta, end="", flush=True)
    elif event.type == "response.completed":
        response_id = event.response.id
    elif event.type == "error":
        raise RuntimeError(f"stream failed ({event.code}): {event.message}")

Errors raised before streaming begins (e.g. an invalid request body) are returned as a normal HTTP error response with a non-200 status, not as an error event.

Streaming with Conversation History

Streaming works seamlessly with previous_response_id. The streamed response is stored and can be used for further chaining.

curl
Python (OpenAI SDK)

# Turn 1 (non-streaming)
curl -X POST $BASE_URL/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $AA_TOKEN" \
  -d '{
    "model": "qwen3-32b-tool",
    "input": "My favorite color is blue.",
    "instructions": "You are a helpful assistant."
  }'
# → {"id": "resp_001", ...}

# Turn 2 (streaming), chains from Turn 1
curl -N -X POST $BASE_URL/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $AA_TOKEN" \
  -d '{
    "model": "qwen3-32b-tool",
    "input": "What is my favorite color?",
    "previous_response_id": "resp_001",
    "stream": true
  }'
# Streams: "Your favorite color is blue."

# Turn 1 (non-streaming)
response1 = client.responses.create(
    model="qwen3-32b-tool",
    input="My favorite color is blue.",
)

# Turn 2 (streaming), chains from Turn 1
stream = client.responses.create(
    model="qwen3-32b-tool",
    input="What is my favorite color?",
    previous_response_id=response1.id,
    stream=True,
)

response_id_2 = None
for event in stream:
    if event.type == "response.output_text.delta":
        print(event.delta, end="", flush=True)
    elif event.type == "response.completed":
        response_id_2 = event.response.id

# Turn 3 can chain from the streamed response
response3 = client.responses.create(
    model="qwen3-32b-tool",
    input="Can you remind me what we discussed about colors?",
    previous_response_id=response_id_2,
)

Retrieving a Streamed Response

Streamed responses are persisted just like non-streamed ones. You can retrieve them as JSON or replay them as SSE:

# Retrieve as JSON
curl $BASE_URL/v1/responses/resp_abc123 \
  -H "Authorization: Bearer $AA_TOKEN"

# Replay as SSE stream
curl $BASE_URL/v1/responses/resp_abc123?stream=true \
  -H "Authorization: Bearer $AA_TOKEN"