Running AI Inference

OmniTensor's decentralized inference network provides a robust, scalable solution for running AI inference tasks across a distributed network of GPU nodes. This guide will walk you through the process of leveraging OmniTensor's infrastructure for efficient and cost-effective AI inference.

Prerequisites

  • OmniTensor SDK installed

  • Valid API key for OmniTensor services

  • Familiarity with AI models and inference concepts

Overview

OmniTensor's decentralized inference network utilizes a distributed compute paradigm, allowing developers to run inference tasks on a network of GPU nodes contributed by the community. This approach offers several advantages:

  • Scalability - Dynamically scale inference capacity based on demand

  • Cost-efficiency - Pay only for the compute resources used

  • Redundancy - Increased fault tolerance through distributed processing

  • Low latency - Geographically distributed nodes reduce network latency

Supported Models

OmniTensor supports a wide range of pre-trained models, including:

  • Large Language Models (LLMs): GPT-3, BERT, T5

  • Computer Vision: YOLO, ResNet, EfficientNet

  • Speech Recognition: DeepSpeech, Wav2Vec

  • Custom models: Deploy your own fine-tuned or proprietary models

Basic Inference Workflow

  1. Initialize OmniTensor client

  2. Select or upload AI model

  3. Prepare input data

  4. Submit inference request

  5. Retrieve and process results

Code Example

Here's a basic example of running inference using the OmniTensor SDK:

from omnitensor import Client, Model, InferenceRequest

# Initialize client
client = Client(api_key="YOUR_API_KEY")

# Select pre-trained model
model = Model.from_catalog("gpt-3-small")

# Prepare input
input_text = "Translate the following English text to French: 'Hello, world!'"

# Create inference request
request = InferenceRequest(
    model=model,
    inputs={"text": input_text},
    output_keys=["translated_text"]
)

# Submit request and get results
result = client.run_inference(request)

print(result.outputs["translated_text"])

Advanced Configuration

Node Selection Strategy

OmniTensor allows you to specify node selection criteria for your inference tasks:

request = InferenceRequest(
    # ... other parameters ...
    node_preferences={
        "min_gpu_memory": 8,  # GB
        "max_latency": 50,    # ms
        "geographical_region": "europe-west"
    }
)

Batching and Streaming

For improved performance, you can batch multiple inputs or use streaming for real-time inference:

# Batched inference
batch_request = InferenceRequest(
    model=model,
    inputs={"texts": ["Hello", "World", "OmniTensor"]},
    batch_size=3
)

# Streaming inference
with client.stream_inference(model) as stream:
    while True:
        input_text = input("Enter text (or 'q' to quit): ")
        if input_text.lower() == 'q':
            break
        result = stream.process(input_text)
        print(result)

Monitoring and Optimization

OmniTensor provides real-time metrics for monitoring inference performance:

metrics = client.get_inference_metrics(request_id)
print(f"Inference time: {metrics.latency_ms} ms")
print(f"GPU utilization: {metrics.gpu_utilization}%")

Use these metrics to optimize your inference pipeline and make informed decisions about resource allocation.

Error Handling and Retries

OmniTensor's SDK includes built-in error handling and retry mechanisms:

from omnitensor.exceptions import NodeFailureError, TimeoutError

try:
    result = client.run_inference(request)
except NodeFailureError:
    # Automatic retry on a different node
    result = client.run_inference(request, retry_strategy="auto")
except TimeoutError:
    # Handle timeout scenario
    pass

Last updated