Running AI Inference
OmniTensor's decentralized inference network provides a robust, scalable solution for running AI inference tasks across a distributed network of GPU nodes. This guide will walk you through the process of leveraging OmniTensor's infrastructure for efficient and cost-effective AI inference.
Prerequisites
OmniTensor SDK installed
Valid API key for OmniTensor services
Familiarity with AI models and inference concepts
Overview
OmniTensor's decentralized inference network utilizes a distributed compute paradigm, allowing developers to run inference tasks on a network of GPU nodes contributed by the community. This approach offers several advantages:
Scalability - Dynamically scale inference capacity based on demand
Cost-efficiency - Pay only for the compute resources used
Redundancy - Increased fault tolerance through distributed processing
Low latency - Geographically distributed nodes reduce network latency
Supported Models
OmniTensor supports a wide range of pre-trained models, including:
Large Language Models (LLMs): GPT-3, BERT, T5
Computer Vision: YOLO, ResNet, EfficientNet
Speech Recognition: DeepSpeech, Wav2Vec
Custom models: Deploy your own fine-tuned or proprietary models
Basic Inference Workflow
Initialize OmniTensor client
Select or upload AI model
Prepare input data
Submit inference request
Retrieve and process results
Code Example
Here's a basic example of running inference using the OmniTensor SDK:
from omnitensor import Client, Model, InferenceRequest
# Initialize client
client = Client(api_key="YOUR_API_KEY")
# Select pre-trained model
model = Model.from_catalog("gpt-3-small")
# Prepare input
input_text = "Translate the following English text to French: 'Hello, world!'"
# Create inference request
request = InferenceRequest(
model=model,
inputs={"text": input_text},
output_keys=["translated_text"]
)
# Submit request and get results
result = client.run_inference(request)
print(result.outputs["translated_text"])
Advanced Configuration
Node Selection Strategy
OmniTensor allows you to specify node selection criteria for your inference tasks:
request = InferenceRequest(
# ... other parameters ...
node_preferences={
"min_gpu_memory": 8, # GB
"max_latency": 50, # ms
"geographical_region": "europe-west"
}
)
Batching and Streaming
For improved performance, you can batch multiple inputs or use streaming for real-time inference:
# Batched inference
batch_request = InferenceRequest(
model=model,
inputs={"texts": ["Hello", "World", "OmniTensor"]},
batch_size=3
)
# Streaming inference
with client.stream_inference(model) as stream:
while True:
input_text = input("Enter text (or 'q' to quit): ")
if input_text.lower() == 'q':
break
result = stream.process(input_text)
print(result)
Monitoring and Optimization
OmniTensor provides real-time metrics for monitoring inference performance:
metrics = client.get_inference_metrics(request_id)
print(f"Inference time: {metrics.latency_ms} ms")
print(f"GPU utilization: {metrics.gpu_utilization}%")
Use these metrics to optimize your inference pipeline and make informed decisions about resource allocation.
Error Handling and Retries
OmniTensor's SDK includes built-in error handling and retry mechanisms:
from omnitensor.exceptions import NodeFailureError, TimeoutError
try:
result = client.run_inference(request)
except NodeFailureError:
# Automatic retry on a different node
result = client.run_inference(request, retry_strategy="auto")
except TimeoutError:
# Handle timeout scenario
pass
Last updated