Key Takeaways
- Amazon SageMaker AI Async Inference now accepts an optional Body parameter, allowing payloads up to 128,000 bytes to be sent directly in the InvokeEndpointAsync request, eliminating the mandatory S3 upload step.
- The new inline‑payload path reduces latency, simplifies architecture, cuts costs, and provides immediate synchronous validation errors for size or mutual‑exclusivity violations.
- For payloads larger than 128 KB, the existing InputLocation (S3‑based) approach remains the recommended method; mixed workloads can branch on size to use the optimal path for each request.
- No changes are required to existing async endpoints, model containers, or output S3 configurations; the feature is fully backward‑compatible.
- The feature is available in 31 commercial AWS Regions and requires only an updated AWS SDK (Boto3) and the appropriate IAM permission for
sagemaker:InvokeEndpointAsync.
Background: How async inference worked before
Amazon SageMaker AI Async Inference has long enabled customers to queue inference requests and process them asynchronously, making it ideal for workloads with large payloads, variable traffic, or tolerance for seconds‑to‑minutes latency. As the article explains, “Until now, the workflow required two steps on every invocation: upload the input payload to an Amazon S3 bucket, then invoke the endpoint, passing the S3 object URI as InputLocation.” This two‑step pattern works well for multi‑megabyte files such as images or audio, but it adds unnecessary overhead for smaller inputs that still benefit from async processing (e.g., JSON prompts or structured data).
What’s new: Inline payload via the Body parameter
The launch introduces a Body parameter to the InvokeEndpointAsync API. When present, the payload is transmitted inline in the HTTP request body, with a hard limit of 128,000 bytes (raw bytes). The documentation notes that “Body and InputLocation are mutually exclusive. The API rejects requests that set both.” Output behavior remains unchanged—results are still written to the configured S3 OutputLocation—and existing async endpoints require no model or container modifications. Validation errors for size or mutual‑exclusivity are returned synchronously, giving developers immediate feedback.
Before and after: The customer experience
The practical impact is clearest when comparing code snippets.
Before (S3 upload required):
python
import boto3, json, uuid
s3 = boto3.client("s3")
sagemaker_runtime = boto3.client("sagemaker-runtime")
payload = json.dumps({"inputs": "your prompt here"}).encode("utf-8")
input_key = f"async-input/{uuid.uuid4()}.json"
s3.put_object(Bucket="my-async-bucket", Key=input_key, Body=payload)
input_location = f"s3://my-async-bucket/{input_key}"
response = sagemaker_runtime.invoke_endpoint_async(
EndpointName="my-async-endpoint",
InputLocation=input_location,
ContentType="application/json",
)
print(response["OutputLocation"])
After (inline payload):
python
import boto3, json
sagemaker_runtime = boto3.client("sagemaker-runtime")
payload = json.dumps({"inputs": "your prompt here"}).encode("utf-8")
response = sagemaker_runtime.invoke_endpoint_async(
EndpointName="my-async-endpoint",
Body=payload,
ContentType="application/json",
)
print(response["OutputLocation"])
The “after” version removes the need for an S3 client, UUID‑based key generation, IAM s3:PutObject permissions, and any cleanup logic for stale input objects. As the article succinctly puts it, “No S3 client, no uuid, no input bucket, no IAM grants on the input path, no stale‑object cleanup.”
Customer benefits
Sending the payload inline yields five concrete advantages:
- Reduced latency – One network round‑trip and one S3 PUT are eliminated per request; for fan‑out workloads the savings compound.
- Simpler architecture – No input bucket provisioning, lifecycle policies, cross‑account access patterns, or associated IAM permissions.
- Fewer error paths – The request is a single API call that either enqueues or fails, reducing troubleshooting surface area.
- Lower cost – Eliminates the S3 PUT charge for each input upload.
- Immediate validation feedback – Size and mutual‑exclusivity errors are returned synchronously, allowing faster iteration.
When to use each approach
The guidance helps developers pick the right path based on payload size and operational needs:
- Payload ≤ 128,000 bytes (e.g., JSON prompts, structured data) → Inline Body. Simpler, avoids one network round‑trip and S3 PUT charges.
- Payload > 128,000 bytes (e.g., images, audio, large documents) → InputLocation. Upload to S3 first.
- Mixed workload with variable payload sizes → Branch on size; use Body for small payloads, InputLocation for large ones.
- Need to retain input data in S3 for audit or replay → InputLocation, as it preserves the raw input in your bucket.
Getting started
To begin using inline payloads, customers should:
- Update the AWS SDK – Install or upgrade Boto3 (
pip install --upgrade boto3). - Verify installation –
pip show boto3. - Replace invocation code – Substitute the S3‑upload + InputLocation pattern with the direct Body parameter, as shown in the “After” example.
- Test the call – Invoke
InvokeEndpointAsyncwith the Body parameter and confirm the response contains anOutputLocation. - Monitor output – Poll or subscribe to SNS notifications for the S3 OutputLocation to retrieve results.
No modifications are needed to the endpoint configuration, model container, or output S3 setup. The article reminds users that “Following this guide uses billable AWS resources… Follow the cleanup steps after completing the tutorial to avoid ongoing charges.”
Clean up
To avoid lingering costs, delete any test resources:
- Remove the SageMaker endpoint:
aws sagemaker delete-endpoint --endpoint-name my-async-endpoint. - Delete the output S3 bucket (if no longer needed):
aws s3 rb s3://my-output-bucket --force. - Remove any IAM policies created solely for the tutorial.
Conclusion
Inline payload support for SageMaker AI Async Inference removes a common friction point—the mandatory S3 upload for every request—allowing developers to make a single API call for payloads up to 128 KB. The feature is backward‑compatible; existing InputLocation workflows continue unchanged, and both input paths are processed identically once the request is accepted. By updating the AWS SDK and adopting the Body parameter, customers can enjoy lower latency, simpler architecture, reduced cost, and immediate error feedback. For further details, consult the Amazon SageMaker AI Async Inference documentation.
https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-ai-async-inference-now-supports-inline-request-payloads/

