Streaming Requests & Realtime API in vLLM
Large language model inference has traditionally operated on a simple premise: the user submits a complete prompt (request), the model processes it, and returns a response (either streaming or at once). This paradigm works well for text-based chatbots and batch processing workloads, but it falls short when dealing with realtime applications, such as streaming audio or video.
vLLM has recently added support for streamable inputs to its engine as well as a Realtime WebSocket API building on top of it, exposing a new /v1/realtime endpoint in the server.
In this post, we motivate the need for realtime inference and introduce the two new features in vLLM that unlock these capabilities: streaming input support and the Realtime WebSocket API.
Why Realtime is Needed
The Traditional Batch Paradigm in vLLM
Traditional LLM inference assumes that complete prompts are available upfront. The user submits their full request e.g. via a ChatCompletionRequest, waits for the model to process it entirely, and then receives the complete response. While vLLM has long supported output streaming—emitting tokens as they are generated—the input side was always fixed: you had to provide the entire request before inference could begin.
This approach is sufficient for most applications. Text-based chatbots, document summarization, and code generation all fit naturally into this model. But a growing class of applications cannot wait for complete input before processing begins.
The Importance of Streaming
Consider a voice assistant to control a computer or phone. Instead of using a keyboard and mouse or touchpad, all actions are controlled by voice. Speech is recorded by a microphone and sent as a stream of audio to your LLM which acts as the voice assistant model. The LLM needs to continuously process the audio stream and generate actions in realtime. For such applications latency matters—a user does not want to wait more than a second to open an application, type text into a search bar, etc. For the most natural, human-like voice assistant, it needs to be able to listen and speak at the same time, i.e., the LLM needs to be able to process the audio stream and generate actions simultaneously.
A natural question is whether streaming behavior can be approximated using non-streaming LLMs by processing the input in chunks. In principle, audio can be buffered into segments, each segment processed independently, and the resulting outputs concatenated. In practice, this approach introduces several limitations. Achieving sub-second latency requires highly performant chunk detection, i.e., accurately determining when to segment the audio stream such that no relevant information is lost. Poor segmentation can introduce additional latency or degrade model performance by fragmenting meaningful temporal context. Chunk-based processing also precludes true bidirectional interaction: each chunk must be fully processed before a response can be generated, preventing listening and speaking from occurring concurrently. This results in a turn-based interaction model rather than the continuous, overlapping communication characteristic of human conversation.
This problem appears across many domains:
- Voice assistants require sub-second response times to feel natural as described above
- Live transcription services need to display text as speech is recognized
- Robotics and embodied AI need to process continuous sensor streams (cameras, microphones, LIDAR) and generate control actions with minimal delay to interact safely with the physical world
For these applications, the traditional batch paradigm introduces unacceptable delays. We need infrastructure that can process input incrementally and begin generating output before all input has arrived.
Architectural Requirements for Streaming
Not all models can support true streaming inference. Two key requirements must be met: the right attention pattern and training for incremental processing.
Attention Patterns
The attention mechanism determines whether a model can process input incrementally or must wait for the entire sequence.
-
Causal attention (uni-directional mask) restricts each position t to attend only to tokens at positions ≤ t. Because future tokens are excluded, the model’s output token at time t is final once token t arrives. This makes true streaming possible: each new token can be processed immediately, and earlier outputs never need to be revised.
-
Bidirectional attention (full mask) allows every position to attend to both past and future tokens. As a result, the model’s output token at position t is conditions on tokens that may not have arrived yet. Until the full input sequence is known, the model cannot compute a stable output for any position, because future tokens could change how earlier tokens are interpreted.
For this reason, bidirectional attention inherently requires access to the complete input sequence before producing outputs, which makes it incompatible with streaming or online processing.
For long-running or infinite streaming, standard causal attention is not enough. If each token attends to the entire past, computation and memory grow without bound, which is impractical. In practice, past context must be truncated. A common architectural solution is sliding-window attention, where each token attends only to a fixed-size window of recent tokens, keeping compute and memory bounded while supporting streaming. Hence, causal attention with a sliding window is often the architecture of choice for modern streaming models.
Training for Streaming inputs
However, having a fully streamable architecture is not sufficient on its own: the model must also be trained to support true streaming input.
Let $X = (x_0, x_1, \ldots, x_T)$ denote the input sequence and $Y = (y_0, y_1, \ldots, y_{T’})$ the output sequence. In streaming applications, the model should generate the output $y_t$ corresponding to input $x_t$ at time step $t$, with as little latency as possible. Concretely, one can think of $y_t$ as the transcription of an audio frame $x_t$ that is streamed into the model at time $t$.
The standard next-token training objective typically conditions the distribution of the next token on the entire input sequence:
\[P(y_i \mid y_{i-1}, \ldots, y_0, x_T, x_{T-1}, \ldots, x_0).\]This formulation is unsuitable for streaming, because generating $y_i$ requires processing the full input sequence $X$, which is not available in real time.
Instead, a streaming model must be able to predict $y_i$ using only past inputs and, optionally, a small amount of future context:
\[P(y_i \mid y_{i-1}, \ldots, y_0, x_{i+\delta}, \ldots, x_i, \ldots, x_0),\]where $\delta$ is a lookahead parameter that should be as small as possible. In theory, $\delta$ could be set to zero; in practice, a small delay is usually necessary to achieve reasonable performance.
As a result, training a streaming model requires:
- i) aligning input and output sequences such that $T’ = T$, and each $y_i$ is the correct output corresponding to $x_i$;
- ii) using an architecture that can process new inputs $x_{i+1}$ while previous inputs $x_i, \ldots, x_0$ have already been processed.
An intuitive architecture, as pioneered by Delayed Streams Modeling and picked up by Voxtral-Realtime, sum-pools input embeddings (e.g., speech embeddings) and output embeddings (e.g., text embeddings) into a single sequence of embeddings. The model then predicts
\[P(y_i \mid y'_{i-1}, \ldots, y'_0),\]where
\[y'_k = y_k + x_{k+\delta}.\]This distinction is important for deployment: one cannot simply take an arbitrary causal model and expect it to perform well in a streaming setting. To be fully streamable, the model must be explicitly trained with the above alignment and architectural constraints, ensuring that both conditions i) and ii) are satisfied.
Why Model Architecture Matters for Serving
vLLM can serve any model, but true streaming requires architecturally-causal models. Models like Voxtral are designed from the ground up for streaming, using causal attention mechanisms that support incremental processing.
Equally important, the serving infrastructure must support incremental input. Even with a streaming-capable model, if the server requires the complete prompt before beginning inference, you lose the latency benefits. This is why vLLM now supports streaming input alongside its existing output streaming capabilities.
Further Reading
- Transformer Transducer is a well-known and one of the most successful modeling approaches for training streamable speech recognition systems.
- Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling by the Kyutai folks is a great read on further diving into streaming architectures as explained above.
- Streaming Simultaneous Speech Translation with Augmented Memory Transformer on streaming speech translation. Translation is not as “monotonic” as speech, which makes the problem of performant streaming more difficult.
- Voxtral-Realtime (TODO) add link
Streaming Input Support in vLLM
With PR #28973, vLLM now supports streaming input for inference. This enables the incremental processing described above, where input arrives over time and output is generated continuously.
The StreamingInput Interface
The core abstraction is the StreamingInput dataclass:
from dataclasses import dataclass
from vllm.inputs import PromptType
from vllm.sampling_params import SamplingParams
@dataclass
class StreamingInput:
prompt: PromptType
sampling_params: SamplingParams | None = None
Rather than passing a fixed prompt to AsyncLLM.generate(), you can now pass an AsyncGenerator that yields StreamingInput objects over time. Each StreamingInput contains the next input chunk to be appended to a cumulative prompt. Here is an example of how it can be used:
import asyncio
from vllm.inputs.data import StreamingInput
from vllm.v1.engine.async_llm import AsyncLLM
from vllm.sampling_params import SamplingParams
async def streaming_input_example():
async_llm = AsyncLLM.from_engine_args(...)
# Input queue can consume inputs in separate async task
input_queue = asyncio.Queue[list[int]]()
async def input_generator():
# Loop until empty list encountered => input finished
while new_tokens := input_queue.get():
yield StreamingInput(prompt=new_tokens)
output_generator = async_llm.generate(
prompt=input_generator(),
sampling_params=SamplingParams(temperature=0.0, max_tokens=1),
)
# Consume outputs
async for output in output_generator:
# ...
asyncio.run(streaming_input_example())
You can wait until the output corresponding to the last input has completed before sending the next input, but it’s not required (input chunks are queued internally). Termination of the input stream is indicated by exiting from the async input generator or closing it using the aclose function. The returned outputs generator won’t complete until all received inputs have been processed and the input generator has completed.
How It Works
Internally, vLLM handles streaming input by treating each chunk as a separate request with a cumulative prompt. As new chunks arrive, the engine:
- Extends the prompt with the new content
- Reuses cached KV values for the prefix
- Generates output tokens based on the current cumulative prompt
- Optionally discards output when new input arrives
This design means that output tokens generated between input chunks may be revised as more context becomes available. The final output reflects the complete input.
Internally, vLLM implements streaming input through a sticky session mechanism. The first input chunk creates an anchor request that persists throughout the session. Subsequent chunks with the same internal request ID are queued and processed in order.
The Anchor Request Pattern
┌─────────────────────────────────────────────────────────────────────────────┐
│ STREAMING SESSION │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ User's AsyncGenerator Scheduler │
│ ═══════════════════ ═════════ │
│ │
│ ┌──────────────┐ │
│ │ Chunk 1 │ ──────────────► Add ANCHOR REQUEST │
│ │ [A, B, C] │ ┌────────────────────────────────┐ │
│ └──────────────┘ │ Request (id="session_1") │ │
│ │ ├── resumable: true │ │
│ │ ├── streaming_queue: deque() │ │
│ │ ├── status: RUNNING │ │
│ │ └── prompt_token_ids: [A,B,C] │ │
│ └────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ ┌────────────────┐ │
│ │ Chunk 2 │ │ ENGINE │ Generating... │
│ │ [D, E] │ ─────┐ │ Processing │ ──► Output: [X, Y] │
│ └──────────────┘ │ └────────────────┘ │
│ │ │
│ ▼ Anchor busy? Queue it! │
│ ┌──────────────┐ │ ┌────────────────────────────────┐ │
│ │ Chunk 3 │ └────────► │ streaming_queue: │ │
│ │ [F, G] │ ─────────────► │ ┌───────┐ ┌───────┐ │ │
│ └──────────────┘ │ │[D, E] │→│[F, G] │→ ... │ │
│ │ └───────┘ └───────┘ │ │
│ └────────────────────────────────┘ │
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ WHEN ANCHOR FINISHES CURRENT CHUNK │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Engine signals: chunk complete (stopped = True) │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ _handle_stopped_request() pops first item from queue │ │
│ │ │ │
│ │ streaming_queue: [[D,E], [F,G]] ──► [[F,G]] │ │
│ │ ▲ │ │
│ │ │ │ │
│ │ pop! │ │
│ └──────────────────────┬─────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ _update_request_as_session(anchor, update=[D, E]) │ │
│ │ │ │
│ │ BEFORE: AFTER: │ │
│ │ ┌───────────────────────┐ ┌───────────────────────────┐ │ │
│ │ │ prompt_token_ids: │ │ prompt_token_ids: │ │ │
│ │ │ [A, B, C] │ │ [A, B, C, X, D, E] │ │ │
│ │ │ _output_token_ids: │ ──► │ _output_token_ids: │ │ │
│ │ │ [X, Y] │ │ [] │ │ │
│ │ │ _all_token_ids: │ │ _all_token_ids: │ │ │
│ │ │ [A, B, C, X, Y] │ │ [A, B, C, X, D, E] │ │ │
│ │ │ num_computed_tokens: 4│ │ num_computed_tokens: 4 │ │ │
│ │ │ status: RUNNING │ │ status: WAITING │ │ │
│ │ └───────────────────────┘ └───────────────────────────┘ │ │
│ │ │ │
│ │ Note: Y is DISCARDED (last sampled token, not yet computed) │ │
│ │ Only X is kept (num_computed_tokens = 4, so [A,B,C,X]) │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ Anchor returns to waiting queue → scheduled again → ENGINE │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Why is the last token (Y) discarded?
The last sampled token hasn’t been processed as input to the model yet—it was just output from the most recent forward pass. Since the KV cache only contains entries for tokens that have been processed, this token has no KV cache entry. Discarding it is essentially “free”: we’re not invalidating any cached state, and it would need to be recomputed anyway if we kept it.
Caveat: Some models emit special stop tokens that the model requires to properly continue generation. In such cases, the scheduling logic needs to accommodate +1 token to recompute the stop token before processing the new input chunk.
Example Flow
Consider a voice assistant receiving speech incrementally:
Input chunks: [A1, B1, C1], [A2, B2], [A3, B3]
1. First chunk [A1, B1, C1] arrives
-> Model generates [D1]
2. Second chunk [A2, B2] arrives
-> Cumulative prompt: [A1, B1, C1, A2, B2] (D1 discarded)
-> Model generates [C2, D2, E2]
3. Third chunk [A3, B3] arrives
-> Cumulative prompt: [A1, B1, C1, A2, B2, C2, D2, A3, B3] (E2 discarded)
-> Model generates [C3, D3]
Output stream: D1, C2, D2, E2, C3, D3
The key insight is that early output tokens provide immediate feedback to the user, even though they may be revised as more context arrives. This dramatically reduces perceived latency.
Realtime API with WebSockets
While streaming input support provides the core capability, production applications need a convenient API for real-time communication. PR #33187 introduces a WebSocket-based Realtime API inspired by OpenAI’s Realtime API.
Architecture
The Realtime API provides a WebSocket endpoint that enables bidirectional streaming between clients and the vLLM server. Clients send audio data, and the server responds with transcribed text and model outputs.
The architecture consists of:
- WebSocket Client: Captures audio from microphone, sends chunks to server
- Realtime Handler: Receives WebSocket messages, converts to StreamingInput
- AsyncLLM: Processes streaming input, generates output
- Response Stream: Sends generated tokens back through WebSocket
Server Setup
Starting a vLLM server with Realtime API support:
vllm serve mistralai/Voxtral-Mini-4B-Realtime-2602 --enforce-eager
The server exposes a WebSocket endpoint at ws://localhost:8000/v1/realtime.
Client Example
Here’s a basic client that streams an audio file and receives transcription:
import asyncio
import base64
import json
import librosa
import numpy as np
import websockets
def load_audio_as_pcm16(audio_path: str) -> bytes:
"""Load audio file and convert to PCM16 @ 16kHz."""
audio, _ = librosa.load(audio_path, sr=16000, mono=True)
return (audio * 32767).astype(np.int16).tobytes()
async def stream_audio_file(audio_path: str, server_url: str = "ws://localhost:8000/v1/realtime"):
async with websockets.connect(server_url) as ws:
response = json.loads(await ws.recv())
# Load and convert audio to PCM16
pcm_audio = load_audio_as_pcm16(audio_path)
# Validate model
await ws.send(json.dumps({"type": "session.update", "model": model}))
# Signal start of audio stream
await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))
# Stream audio in 4KB chunks
for i in range(0, len(pcm_audio), 4096):
chunk = pcm_audio[i:i + 4096]
await ws.send(json.dumps({
"type": "input_audio_buffer.append",
"audio": base64.b64encode(chunk).decode()
}))
# Signal end of audio stream
await ws.send(json.dumps({"type": "input_audio_buffer.commit", "final": True}))
# Receive transcription
async for message in ws:
data = json.loads(message)
if data["type"] == "transcription.delta":
print(data["delta"], end="", flush=True)
elif data["type"] == "transcription.done":
break
asyncio.run(stream_audio_file("audio.wav"))
This example demonstrates the core workflow for realtime audio streaming:
- Load and convert audio: The audio file is loaded and converted to PCM16 format at 16kHz, which is the expected input format for the realtime API
- Establish WebSocket connection: Connect to the server’s
/v1/realtimeendpoint and send asession.updatemessage to validate the model - Stream audio in chunks: The audio is sent in 4KB chunks using
input_audio_buffer.appendmessages, withinput_audio_buffer.commitsignals to mark the start and end of the stream - Receive transcription incrementally: The server responds with
transcription.deltamessages containing partial transcriptions, which are printed in real-time untiltranscription.doneis received - Note on realtime behavior: While this example sends all audio before listening for transcriptions (for simplicity), the WebSocket protocol enables fully asynchronous communication—audio chunks can be sent and transcriptions received simultaneously. In a production realtime service, transcription would begin immediately as the first audio chunk arrives, with both sending and receiving happening concurrently for true low-latency speech recognition
Message Types
The Realtime API uses a message-based protocol. Key message types include:
Client to Server:
session.create: Initialize a new sessioninput_audio_buffer.append: Send audio datainput_audio_buffer.commit: Signal end of audio inputresponse.create: Request model response
Server to Client:
session.created: Session initialization confirmedresponse.text.delta: Incremental text outputresponse.audio.delta: Incremental audio output (for TTS models)response.done: Response completeerror: Error occurred
Example Scripts
The vLLM repository includes ready-to-use example clients:
- examples/online_serving/openai_realtime_client.py: Basic WebSocket client
- examples/online_serving/openai_realtime_microphone_client.py: Microphone integration
These examples demonstrate how to capture audio from system microphone and stream it to vLLM in real time.
Performance Considerations
An advantage of using the dedicated AsyncGenerator-based session interface over just sending separate requests is that the KV cache for the session is preserved as-is. This is preferable to relying on vLLMs automatic prefix caching because:
- It ensures that the corresponding cache blocks won’t be evicted while waiting for the next input chunk
- Prefix caching works at a block-level granularity (typically 16 tokens), meaning that a small number of existing tokens would otherwise be re-computed for each new input
However, this also means that additional care must be taken to avoid holding sessions open as they will be blocking the corresponding memory from being used by other requests, potentially harming overall capacity/throughput. Currently, vLLM will not preempt “idle” streaming input sessions - this behaviour will be improved in a future update.
Current Limitations
…
Future Directions
We are excited about the potential for streaming input support in vLLM. As more model providers open-source fully streamable model weights that are compatible with our input streaming design, we expect the ecosystem of realtime applications to grow significantly.
Since streaming input is still a novel capability in LLM serving, we anticipate adapting and extending our implementation to support a maximum number of different architectures and use cases. This includes exploring tighter integration with various audio and video encoders, optimizing the anchor request pattern for different latency requirements, and expanding support for multi-modal streaming scenarios.
Get Involved
We encourage you to try out vLLM’s input streaming functionality and Realtime API. Your feedback is invaluable in helping us improve these features. Please share your experiences, report issues, or suggest enhancements on the vLLM GitHub repository.
We welcome feedback and contributions as we continue to develop vLLM’s real-time capabilities.
Acknowledgements
Streaming input support and the Realtime API were made possible through collaborative efforts across multiple teams:
Meta: Joshua Deng, Jiatong Zhou, Zhuohan Li, Yu Luo, Jeremy Teboul
Mistral AI: Patrick von Platen, Andy Lo
vLLM Team: Nick Hill, Roger Wang, Cyrus Leung, Nicolò Lucchesi, Woosuk Kwon
We would also like to acknowledge other implementations of streaming input in vLLM: Tao He (Alibaba Qwen), Edward Wibowo (Brown University), Deepti Raghavan (Brown University), and Luis Gaspar Schroeder (UC Berkeley).