DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

As DeepInfra, we are excited to announce our integration with LlamaIndex. LlamaIndex is a powerful library that allows you to index and search documents using various language models and embeddings. In this blog post, we will show you how to chat with books using DeepInfra and LlamaIndex.
We will be using the Project Gutenberg library to get the text of the book "Crime and Punishment" by Fyodor Dostoevsky. We will then use the Meta Llama 3 70B language model and the MiniLM embedding model to chat with the book.
First, let's create a virtual environment and activate it:
python3 -m venv venv
source venv/bin/activate
Here are the required packages to install:
llama-index
llama-index-llms-deepinfra
llama-index-embeddings-deepinfra
Let's install them:
pip install llama-index llama-index-llms-deepinfra llama-index-embeddings-deepinfra
Before getting started, we also need to get the API key for DeepInfra. You can get your DeepInfra API key from here.
Let's create a .env file in the root directory of the project and add the following lines:
DEEPINFRA_API_TOKEN=YOUR_DEEPINFRA_API_KEY
Here's a Python script to chat with the book "Crime and Punishment":
import requests
from dotenv import load_dotenv, find_dotenv
import re
_ = load_dotenv(find_dotenv())
from llama_index.core import VectorStoreIndex, Document
from llama_index.llms.deepinfra import DeepInfraLLM
from llama_index.embeddings.deepinfra import DeepInfraEmbeddingModel
LLM = "meta-llama/Meta-Llama-3-70B-Instruct"
EMBEDDING = "sentence-transformers/all-MiniLM-L12-v2"
BOOK_TITLE = "Crime and Punishment"
def maybe_get_gutenberg_book_id(title):
url = f"http://gutendex.com/books/?search={title}"
response = requests.get(url)
books = response.json()["results"]
for book in books:
if title.lower() in book["title"].lower():
return book["id"]
return None
def get_document(book_id):
url = f"https://www.gutenberg.org/files/{book_id}/{book_id}-0.txt"
response = requests.get(url)
text = response.text
# Get rid of binary characters.
text = re.sub(r"[^\x00-\x7F]+", "", text)
return Document(text=text)
if __name__ == "__main__":
llm = DeepInfraLLM(LLM, max_tokens=1000)
embed_model = DeepInfraEmbeddingModel(EMBEDDING)
book_id = maybe_get_gutenberg_book_id(BOOK_TITLE)
document = get_document(book_id)
index = VectorStoreIndex.from_documents([document], embed_model=embed_model)
chat_engine = index.as_chat_engine(
llm=llm, embed_model=embed_model, max_iterations=20
)
response = chat_engine.chat(
"Summarize the discussion between Raskolnikov and Pyotr Petrovich"
)
print(response)
# The conversation between Raskolnikov and Pyotr Petrovich takes place at the office of...
Voila! You have successfully chatted with the book "Crime and Punishment" using DeepInfra and LlamaIndex. You can now use this code snippet to chat with any book of your choice. Enjoy reading!
For more information on LlamaIndex, please visit our LLM documentation and Embedding documentation.
Feel free to experiment with other books and questions to explore the capabilities of DeepInfra. See you in the next blog post!
Happy chatting! 📚🦙
NVIDIA Nemotron 3 Super 120B API Benchmarks<p>NVIDIA Nemotron 3 Super 120B A12B is available across multiple API providers, and the spread in performance and cost is wide enough to change deployment decisions. Artificial Analysis benchmarks three providers — Lightning AI, CoreWeave, and Nebius — with output speed ranging from 154 to 509 t/s (a 3.3x gap), TTFT spanning 0.98s to 1.94s, […]</p>
Qwen3.5 4B via DeepInfra: Latency, Throughput & Cost<p>About Qwen3.5 4B (Reasoning) Qwen3.5 4B is a compact 4-billion parameter open-weights model released in March 2026 as part of Alibaba Cloud’s Qwen3.5 Small Model Series. It employs an Efficient Hybrid Architecture combining Gated Delta Networks (a form of linear attention) with sparse Mixture-of-Experts, delivering high-throughput inference with minimal latency overhead — a significant architectural […]</p>
DeepInfra Raises $107M Series B to Scale Inference InfrastructureDeepInfra has raised $107 million in Series B funding to scale its inference cloud, expand global capacity, and support the next generation of open-source and agentic AI workloads.© 2026 DeepInfra. All rights reserved.