Running AI Inference

OmniTensor's decentralized inference network provides a robust, scalable solution for running AI inference tasks across a distributed network of GPU nodes. This guide will walk you through the process of leveraging OmniTensor's infrastructure for efficient and cost-effective AI inference.

Prerequisites

OmniTensor SDK installed
Valid API key for OmniTensor services
Familiarity with AI models and inference concepts

Overview

OmniTensor's decentralized inference network utilizes a distributed compute paradigm, allowing developers to run inference tasks on a network of GPU nodes contributed by the community. This approach offers several advantages:

Scalability - Dynamically scale inference capacity based on demand
Cost-efficiency - Pay only for the compute resources used
Redundancy - Increased fault tolerance through distributed processing
Low latency - Geographically distributed nodes reduce network latency

Supported Models

OmniTensor supports a wide range of pre-trained models, including:

Large Language Models (LLMs): GPT-3, BERT, T5
Computer Vision: YOLO, ResNet, EfficientNet
Speech Recognition: DeepSpeech, Wav2Vec
Custom models: Deploy your own fine-tuned or proprietary models

Basic Inference Workflow

Initialize OmniTensor client
Select or upload AI model
Prepare input data
Submit inference request
Retrieve and process results

Code Example

Here's a basic example of running inference using the OmniTensor SDK:

from omnitensor import Client, Model, InferenceRequest

# Initialize client
client = Client(api_key="YOUR_API_KEY")

# Select pre-trained model
model = Model.from_catalog("gpt-3-small")

# Prepare input
input_text = "Translate the following English text to French: 'Hello, world!'"

# Create inference request
request = InferenceRequest(
    model=model,
    inputs={"text": input_text},
    output_keys=["translated_text"]
)

# Submit request and get results
result = client.run_inference(request)

print(result.outputs["translated_text"])

Advanced Configuration

Node Selection Strategy

OmniTensor allows you to specify node selection criteria for your inference tasks:

request = InferenceRequest(
    # ... other parameters ...
    node_preferences={
        "min_gpu_memory": 8,  # GB
        "max_latency": 50,    # ms
        "geographical_region": "europe-west"
    }
)

Batching and Streaming

For improved performance, you can batch multiple inputs or use streaming for real-time inference:

# Batched inference
batch_request = InferenceRequest(
    model=model,
    inputs={"texts": ["Hello", "World", "OmniTensor"]},
    batch_size=3
)

# Streaming inference
with client.stream_inference(model) as stream:
    while True:
        input_text = input("Enter text (or 'q' to quit): ")
        if input_text.lower() == 'q':
            break
        result = stream.process(input_text)
        print(result)

Monitoring and Optimization

OmniTensor provides real-time metrics for monitoring inference performance:

metrics = client.get_inference_metrics(request_id)
print(f"Inference time: {metrics.latency_ms} ms")
print(f"GPU utilization: {metrics.gpu_utilization}%")

Use these metrics to optimize your inference pipeline and make informed decisions about resource allocation.

Error Handling and Retries

OmniTensor's SDK includes built-in error handling and retry mechanisms:

from omnitensor.exceptions import NodeFailureError, TimeoutError

try:
    result = client.run_inference(request)
except NodeFailureError:
    # Automatic retry on a different node
    result = client.run_inference(request, retry_strategy="auto")
except TimeoutError:
    # Handle timeout scenario
    pass

PreviousDecentralized Inference Network NextManaging and Scaling Inference Tasks

Last updated 10 months ago