LLM

6.1. LLM#

6.1.1. Compressing Prompts With No Loss with `llmlingua`#

Here is how to reduce the costs of working with LLMs.

When working with LLMs, we often encountered problems like exceeding token limits, forgetting context, or paying much more for usage than expected.

Researchers from Microsoft try to solve these problems with llmlingua.

llmlingua compresses your prompt by taking a trained small LLM to detect unimportant tokens.

They claim to achieve up to 20x compression with no or minimal performance loss.

I tried it out by myself and I noticed no performance loss at all, but I would be cautious for critical applications.

!pip install llmlingua

# !pip install llmlingua

from llmlingua import PromptCompressor

prompt = "<YOUR_PROMPT>"
llm_lingua = PromptCompressor("lgaalves/gpt2-dolly",)

compressed_prompt = llm_lingua.compress_prompt(prompt, instruction="", question="", target_token=200)

# {'compressed_prompt': 'are- that turns into formatting & with like "[]" best it.......'
# 'origin_tokens': 2430,
# 'compressed_tokens': 261,
# 'ratio': '9.3x',
# 'saving': 'Saving $0.1 in GPT-4.}'

6.1.2. One-Function Call to Any LLM with `litellm`#

Do you want a One-Function call to any LLM in Python?

Try litellm.

litellm is a Python package to call any LLM in a consistent format and to return a consistent output.

You only need to set the API key of the provider and the model name.

It also supports async calls and streaming the models response.

!pip install litellm

import os
from litellm import completion

os.environ["OPENAI_API_KEY"] = "your-api-key" 
os.environ["ANTHROPIC_API_KEY"] = "your-api-key"
os.environ['MISTRAL_API_KEY'] = "your-api-key"

messages = [{ "content": "Hello, how are you?","role": "user"}]

# OpenAI
response = completion(model="gpt-3.5-turbo", messages=messages)

# Anthropic
response = completion(model="claude-instant-1", messages=messages)

# Mistral
response = completion(model="mistral/mistral-tiny", messages=messages)

6.1.3. Safeguard Your LLMs with `LLMGuard`#

Safeguarding your LLMs against unwanting behavior is critical.

LLMGuard, a Python package, ensures a safe interaction between the user and LLM.

It checks prompts and outputs for:

Sensitive Information like credit card number and sanitizes it
Toxic or harmful language
Prompt injections

Use LLMGuard to make your LLMs more safe.

!pip install llm-guard

from openai import OpenAI
from llm_guard import scan_output, scan_prompt
from llm_guard.input_scanners import Anonymize, PromptInjection, Toxicity
from llm_guard.vault import Vault

client = OpenAI(api_key="OPENAIKEY")
vault = Vault()
input_scanners = [Anonymize(vault), Toxicity(), PromptInjection()]

prompt = "Make an SQL insert statement to add a new user to our database. Name is John Doe. \
Email is test@test.com Phone number is 555-123-4567 and the IP address is 192.168.1.100. \
And credit card number is 4567-8901-2345-6789."

sanitized_prompt, results_valid, results_score = scan_prompt(input_scanners, prompt)

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": sanitized_prompt},
    ],
)

# Sanitized Prompt: 
# Make an SQL insert statement to add a new user to our database. 
# Name is [REDACTED_PERSON_1]. Email is [REDACTED_EMAIL_ADDRESS_1] Phone number is 
# [REDACTED_PHONE_NUMBER_1] and the IP address is [REDACTED_IP_ADDRESS_1]. 
# And credit card number is [REDACTED_CREDIT_CARD_RE_1].

6.1.4. Evaluate LLMs with `uptrain`#

Evaluating LLMs can be tricky.

Luckily, uptrain offers a neat library to do that.

uptrain is a Python library to evaluate LLMs with 20+ preconfigured checks.

This includes the quality of the responses (completeness, validity,…), language proficiency (tonality, conciseness, …) and code hallucination.

It supports the biggest providers like OpenAI, Mistral, Claude and Ollama.

!pip install uptrain

from uptrain import EvalLLM, Evals
import json

OPENAI_API_KEY = "*******"

data = [{
    'question': 'Which is the most popular global sport?',
    'context': "The popularity of sports can be measured in various ways, including TV viewership...",
    'response': 'Football is the most popular sport with around 4 billion followers worldwide'
}]

eval_llm = EvalLLM(openai_api_key=OPENAI_API_KEY)

results = eval_llm.evaluate(
    data=data,
    checks=[Evals.CONTEXT_RELEVANCE, Evals.FACTUAL_ACCURACY, Evals.RESPONSE_COMPLETENESS]
)

print(json.dumps(results, indent=3))

6.1.5. Embed Any Type of File#

These days, everything is about Embeddings and LLMs.

The Python library embed-anything makes it easy to generate embeddings from multiple sources like image, video, or audio.

It’s built in Rust so it executes fast.

!pip install embed-anything

import embed_anything

data = embed_anything.embed_file("filename.pdf", embeder= "Bert")
embeddings = np.array([data.embedding for data in data])

data = embed_anything.embed_directory("test_files", embeder= "Clip")
embeddings = np.array([data.embedding for data in data])

6.1.6. Structured LLM Output with `outlines`#

Are you annoyed of unstructured outputs of LLMs?

Try outlines.

outlines is a Python library for structured generation of your LLMs.

There are multiple ways to enforce the output of the model like choices, type constraints, JSON output, and more.

!pip install outlines

import outlines

model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")

prompt = """You are a sentiment-labelling assistant.
Is the following review positive or negative?

Review: This restaurant is just awesome!
"""

generator = outlines.generate.choice(model, ["Positive", "Negative"])
answer = generator(prompt)

6.1.7. Evaluating RAG Pipelines with `Ragas`#

How do you evaluate your RAG application?

Sure, you can look manually over your responses and see if it’s what you want.

But, it’s not scalable.

Instead, use Ragas in Python.

Ragas is a library providing evaluation techniques and metrics for your RAG pipeline like Context Precision/Recall, Faithfulness and answer relevancy.

See below how easy it is to run Ragas.

!pip install ragas

from datasets import Dataset 
import os
from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness

os.environ["OPENAI_API_KEY"] = "your-openai-key"

data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], 
    ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
    'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}

dataset = Dataset.from_dict(data_samples)

score = evaluate(dataset,metrics=[faithfulness,answer_correctness])
score.to_pandas()

6.1.8. Unified Reranker API with `rerankers`#

An essential element of your RAG systems is reranking.

Reranking involves a reranking model that outputs a similarity score for each retrieved document and the user query.

The rerankers library gives you a unified API to use with popular vendors and models such as Cohere, Jina or T5.

The perfect API to easily test and replace many methods.

!pip install rerankers

from rerankers import Reranker

ranker = Reranker("t5")

results = ranker.rank(query="I love you", docs=["I hate you", "I really like you"], doc_ids=[0,1])

6.1.9. Create Embeddings on your CPU with `fastembed`#

My favourite library for creating embeddings:

fastembed, developed by Qdrant.

fastembed is a lightweight and fast library for using popular embedding models.

Without using your GPU.

It also integrates seamlessly with Qdrant’s vector database.

I would like to see more supported models though, as fastembed has so much potential.

!pip install fastembed

from fastembed import TextEmbedding

documents = [
    "This is some",
    "example document",
]

embedding_model = TextEmbedding(model_name="jinaai/jina-embeddings-v2-small-en")

embeddings = list(embedding_model.embed(documents))

6.1.10. Convert Files to Markdown & JSON with `docling`#

Preparing your data for LLMs is a crucial step in RAG applications.

docling simplifies this step for you by converting popular document formats like PDF or PPT to Markdown or JSON.

It uses two models, layout analyis model and table structure recognition model, to process the files.

!pip install docling

from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())  
# Output: "## Docling Technical Report[...]"

6.1.11. Simple Chunking Library with `chonkie`#

Having a great chunking library without installing 500 MB of subdependencies is my childhood’s dream.

Luckily, chonkie provides you with the most important chunking strategies.

Currently, it supports:

Token chunker
Word chunker
Sentence chunker
Semantic chunker
Semantic Double-Pass Merge chunker

!pip install chonkie

from chonkie import SemanticChunker

chunker = SemanticChunker(
    embedding_model="all-minilm-l6-v2",
    chunk_size=512,
    similarity_threshold=0.7
)

chunks = chunker.chunk("Some text with semantic meaning to chunk appropriately.")
for chunk in chunks:
    print(f"Chunk: {chunk.text}")
    print(f"Number of semantic sentences: {len(chunk.sentences)}")

6.1.12. Type-Safe Agentic AI with `pydantic-ai`#

A type-safe Python agent framework was on my Christmas wishlist.

Luckily, 𝗣𝘆𝗱𝗮𝗻𝘁𝗶𝗰𝗔𝗜 came out earlier.

𝗣𝘆𝗱𝗮𝗻𝘁𝗶𝗰𝗔𝗜 bridges the gap between LLMs and structured data validation.

Type-safety is my favourite feature, while making structured responses and using custom models easy.

Definitely something I have to go much deeper into it.

!pip install pydantic-ai

from pydantic_ai import Agent

agent = Agent(  
    'gemini-1.5-flash',
    system_prompt='Be concise, reply with one sentence.',  
)

result = agent.run_sync('Where does "hello world" come from?')  
print(result.data)
"""
The first known use of "hello, world" was in a 1974 textbook about the C programming language.
"""

6.1.13. Convert Files into Markdown with `markitdown`#

Preparing your data for LLMs can be hard.

Different file formats like PPT, Excel or Audio files need different preprocessing steps.

Luckily, with 𝗺𝗮𝗿𝗸𝗶𝘁𝗱𝗼𝘄𝗻 from Microsoft, this is easy.

𝗺𝗮𝗿𝗸𝗶𝘁𝗱𝗼𝘄𝗻 converts various file formats to Markdown:

Currently, it supports:

PDF
PPT
Word
Excel
Images
Audio
HTML
JSON, XML
ZIP files

What I really like is that for image descriptions, you can use LLMs and even adjust the prompt for it.

Thanks to Jimi Vaubien for showing me this gem.

!pip install markitdown

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o", llm_prompt="Describe the image without going into details like colors or shapes.")
result = md.convert("example.jpg")
print(result.text_content)

6.1.14. Route Queries Intelligently with `RouteLLM`#

Using LLMs like o1/o3 or Claude Sonnet for every task is a waste of money.

For simple tasks, you may switch to a much simpler and lightweight model to save on costs and compute time.

With RouteLLM, queries will be routed intelligently to the right LLM for the job.

RouteLLM acts as a traffic controller for your LLM workflows.

Instead of blindly sending all requests to the most expensive model, it dynamically routes queries based on: -> Complexity: Simple tasks for smaller, cheaper models -> Accuracy needs: Critical tasks for heavyweight models

And it’s open-source!

See below for a simple example: -> We use GPT-4 as our strong model and Mixtral 8x7B as our weak model -> The router model router-mf-0.11593 is the default routing model and sufficient for basic query routing. The threshold 0.11593 can be calibrated by yourself if you want to to say that e.g. ~30 % of the queries will be routed to the strong model.

!pip install "routellm[serve,eval]"

import os
from routellm.controller import Controller

os.environ["OPENAI_API_KEY"] = "sk-XXXXXX"
os.environ["ANYSCALE_API_KEY"] = "esecret_XXXXXX"

client = Controller(
  routers=["mf"],
  strong_model="gpt-4-1106-preview",
  weak_model="anyscale/mistralai/Mixtral-8x7B-Instruct-v0.1",
)

response = client.chat.completions.create(
  model="router-mf-0.11593",
  messages=[
    {"role": "user", "content": "Hello!"}
  ]
)

LLM

Contents

6.1. LLM#

6.1.1. Compressing Prompts With No Loss with llmlingua#

6.1.2. One-Function Call to Any LLM with litellm#

6.1.3. Safeguard Your LLMs with LLMGuard#

6.1.4. Evaluate LLMs with uptrain#

6.1.5. Embed Any Type of File#

6.1.6. Structured LLM Output with outlines#

6.1.7. Evaluating RAG Pipelines with Ragas#

6.1.8. Unified Reranker API with rerankers#

6.1.9. Create Embeddings on your CPU with fastembed#

6.1.10. Convert Files to Markdown & JSON with docling#

6.1.11. Simple Chunking Library with chonkie#

6.1.12. Type-Safe Agentic AI with pydantic-ai#

6.1.13. Convert Files into Markdown with markitdown#

6.1.14. Route Queries Intelligently with RouteLLM#