6.1. LLM#
6.1.1. Compressing Prompts With No Loss with llmlingua
#
Here is how to reduce the costs of working with LLMs.
When working with LLMs, we often encountered problems like exceeding token limits, forgetting context, or paying much more for usage than expected.
Researchers from Microsoft try to solve these problems with llmlingua
.
llmlingua
compresses your prompt by taking a trained small LLM to detect unimportant tokens.
They claim to achieve up to 20x compression with no or minimal performance loss.
I tried it out by myself and I noticed no performance loss at all, but I would be cautious for critical applications.
!pip install llmlingua
# !pip install llmlingua
from llmlingua import PromptCompressor
prompt = "<YOUR_PROMPT>"
llm_lingua = PromptCompressor("lgaalves/gpt2-dolly",)
compressed_prompt = llm_lingua.compress_prompt(prompt, instruction="", question="", target_token=200)
# {'compressed_prompt': 'are- that turns into formatting & with like "[]" best it.......'
# 'origin_tokens': 2430,
# 'compressed_tokens': 261,
# 'ratio': '9.3x',
# 'saving': 'Saving $0.1 in GPT-4.}'
6.1.2. One-Function Call to Any LLM with litellm
#
Do you want a One-Function call to any LLM in Python?
Try litellm
.
litellm
is a Python package to call any LLM in a consistent format and to return a consistent output.
You only need to set the API key of the provider and the model name.
It also supports async calls and streaming the models response.
!pip install litellm
import os
from litellm import completion
os.environ["OPENAI_API_KEY"] = "your-api-key"
os.environ["ANTHROPIC_API_KEY"] = "your-api-key"
os.environ['MISTRAL_API_KEY'] = "your-api-key"
messages = [{ "content": "Hello, how are you?","role": "user"}]
# OpenAI
response = completion(model="gpt-3.5-turbo", messages=messages)
# Anthropic
response = completion(model="claude-instant-1", messages=messages)
# Mistral
response = completion(model="mistral/mistral-tiny", messages=messages)
6.1.3. Safeguard Your LLMs with LLMGuard
#
Safeguarding your LLMs against unwanting behavior is critical.
LLMGuard
, a Python package, ensures a safe interaction between the user and LLM.
It checks prompts and outputs for:
Sensitive Information like credit card number and sanitizes it
Toxic or harmful language
Prompt injections
Use LLMGuard
to make your LLMs more safe.
!pip install llm-guard
from openai import OpenAI
from llm_guard import scan_output, scan_prompt
from llm_guard.input_scanners import Anonymize, PromptInjection, Toxicity
from llm_guard.vault import Vault
client = OpenAI(api_key="OPENAIKEY")
vault = Vault()
input_scanners = [Anonymize(vault), Toxicity(), PromptInjection()]
prompt = "Make an SQL insert statement to add a new user to our database. Name is John Doe. \
Email is test@test.com Phone number is 555-123-4567 and the IP address is 192.168.1.100. \
And credit card number is 4567-8901-2345-6789."
sanitized_prompt, results_valid, results_score = scan_prompt(input_scanners, prompt)
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": sanitized_prompt},
],
)
# Sanitized Prompt:
# Make an SQL insert statement to add a new user to our database.
# Name is [REDACTED_PERSON_1]. Email is [REDACTED_EMAIL_ADDRESS_1] Phone number is
# [REDACTED_PHONE_NUMBER_1] and the IP address is [REDACTED_IP_ADDRESS_1].
# And credit card number is [REDACTED_CREDIT_CARD_RE_1].
6.1.4. Evaluate LLMs with uptrain
#
Evaluating LLMs can be tricky.
Luckily, uptrain
offers a neat library to do that.
uptrain
is a Python library to evaluate LLMs with 20+ preconfigured checks.
This includes the quality of the responses (completeness, validity,…), language proficiency (tonality, conciseness, …) and code hallucination.
It supports the biggest providers like OpenAI, Mistral, Claude and Ollama.
!pip install uptrain
from uptrain import EvalLLM, Evals
import json
OPENAI_API_KEY = "*******"
data = [{
'question': 'Which is the most popular global sport?',
'context': "The popularity of sports can be measured in various ways, including TV viewership...",
'response': 'Football is the most popular sport with around 4 billion followers worldwide'
}]
eval_llm = EvalLLM(openai_api_key=OPENAI_API_KEY)
results = eval_llm.evaluate(
data=data,
checks=[Evals.CONTEXT_RELEVANCE, Evals.FACTUAL_ACCURACY, Evals.RESPONSE_COMPLETENESS]
)
print(json.dumps(results, indent=3))
6.1.5. Embed Any Type of File#
These days, everything is about Embeddings and LLMs.
The Python library embed-anything
makes it easy to generate embeddings from multiple sources like image, video, or audio.
It’s built in Rust so it executes fast.
!pip install embed-anything
import embed_anything
data = embed_anything.embed_file("filename.pdf", embeder= "Bert")
embeddings = np.array([data.embedding for data in data])
data = embed_anything.embed_directory("test_files", embeder= "Clip")
embeddings = np.array([data.embedding for data in data])
6.1.6. Structured LLM Output with outlines
#
Are you annoyed of unstructured outputs of LLMs?
Try outlines
.
outlines
is a Python library for structured generation of your LLMs.
There are multiple ways to enforce the output of the model like choices, type constraints, JSON output, and more.
!pip install outlines
import outlines
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")
prompt = """You are a sentiment-labelling assistant.
Is the following review positive or negative?
Review: This restaurant is just awesome!
"""
generator = outlines.generate.choice(model, ["Positive", "Negative"])
answer = generator(prompt)
6.1.7. Evaluating RAG Pipelines with Ragas
#
How do you evaluate your RAG application?
Sure, you can look manually over your responses and see if it’s what you want.
But, it’s not scalable.
Instead, use Ragas
in Python.
Ragas
is a library providing evaluation techniques and metrics for your RAG pipeline like Context Precision/Recall, Faithfulness and answer relevancy.
See below how easy it is to run Ragas
.
!pip install ragas
from datasets import Dataset
import os
from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness
os.environ["OPENAI_API_KEY"] = "your-openai-key"
data_samples = {
'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'],
['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[faithfulness,answer_correctness])
score.to_pandas()
6.1.8. Unified Reranker API with rerankers
#
An essential element of your RAG systems is reranking.
Reranking involves a reranking model that outputs a similarity score for each retrieved document and the user query.
The rerankers
library gives you a unified API to use with popular vendors and models such as Cohere, Jina or T5.
The perfect API to easily test and replace many methods.
!pip install rerankers
from rerankers import Reranker
ranker = Reranker("t5")
results = ranker.rank(query="I love you", docs=["I hate you", "I really like you"], doc_ids=[0,1])
6.1.9. Create Embeddings on your CPU with fastembed
#
My favourite library for creating embeddings:
fastembed
, developed by Qdrant.
fastembed
is a lightweight and fast library for using popular embedding models.
Without using your GPU.
It also integrates seamlessly with Qdrant’s vector database.
I would like to see more supported models though, as fastembed
has so much potential.
!pip install fastembed
from fastembed import TextEmbedding
documents = [
"This is some",
"example document",
]
embedding_model = TextEmbedding(model_name="jinaai/jina-embeddings-v2-small-en")
embeddings = list(embedding_model.embed(documents))
6.1.10. Convert Files to Markdown & JSON with docling
#
Preparing your data for LLMs is a crucial step in RAG applications.
docling
simplifies this step for you by converting popular document formats like PDF or PPT to Markdown or JSON.
It uses two models, layout analyis model and table structure recognition model, to process the files.
!pip install docling
from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())
# Output: "## Docling Technical Report[...]"
6.1.11. Simple Chunking Library with chonkie
#
Having a great chunking library without installing 500 MB of subdependencies is my childhood’s dream.
Luckily, chonkie
provides you with the most important chunking strategies.
Currently, it supports:
Token chunker
Word chunker
Sentence chunker
Semantic chunker
Semantic Double-Pass Merge chunker
!pip install chonkie
from chonkie import SemanticChunker
chunker = SemanticChunker(
embedding_model="all-minilm-l6-v2",
chunk_size=512,
similarity_threshold=0.7
)
chunks = chunker.chunk("Some text with semantic meaning to chunk appropriately.")
for chunk in chunks:
print(f"Chunk: {chunk.text}")
print(f"Number of semantic sentences: {len(chunk.sentences)}")
6.1.12. Type-Safe Agentic AI with pydantic-ai
#
A type-safe Python agent framework was on my Christmas wishlist.
Luckily, 𝗣𝘆𝗱𝗮𝗻𝘁𝗶𝗰𝗔𝗜 came out earlier.
𝗣𝘆𝗱𝗮𝗻𝘁𝗶𝗰𝗔𝗜 bridges the gap between LLMs and structured data validation.
Type-safety is my favourite feature, while making structured responses and using custom models easy.
Definitely something I have to go much deeper into it.
!pip install pydantic-ai
from pydantic_ai import Agent
agent = Agent(
'gemini-1.5-flash',
system_prompt='Be concise, reply with one sentence.',
)
result = agent.run_sync('Where does "hello world" come from?')
print(result.data)
"""
The first known use of "hello, world" was in a 1974 textbook about the C programming language.
"""
6.1.13. Convert Files into Markdown with markitdown
#
Preparing your data for LLMs can be hard.
Different file formats like PPT, Excel or Audio files need different preprocessing steps.
Luckily, with 𝗺𝗮𝗿𝗸𝗶𝘁𝗱𝗼𝘄𝗻 from Microsoft, this is easy.
𝗺𝗮𝗿𝗸𝗶𝘁𝗱𝗼𝘄𝗻 converts various file formats to Markdown:
Currently, it supports:
PDF
PPT
Word
Excel
Images
Audio
HTML
JSON, XML
ZIP files
What I really like is that for image descriptions, you can use LLMs and even adjust the prompt for it.
Thanks to Jimi Vaubien for showing me this gem.
!pip install markitdown
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o", llm_prompt="Describe the image without going into details like colors or shapes.")
result = md.convert("example.jpg")
print(result.text_content)
6.1.14. Route Queries Intelligently with RouteLLM
#
Using LLMs like o1/o3 or Claude Sonnet for every task is a waste of money.
For simple tasks, you may switch to a much simpler and lightweight model to save on costs and compute time.
With RouteLLM, queries will be routed intelligently to the right LLM for the job.
RouteLLM acts as a traffic controller for your LLM workflows.
Instead of blindly sending all requests to the most expensive model, it dynamically routes queries based on: -> Complexity: Simple tasks for smaller, cheaper models -> Accuracy needs: Critical tasks for heavyweight models
And it’s open-source!
See below for a simple example: -> We use GPT-4 as our strong model and Mixtral 8x7B as our weak model -> The router model router-mf-0.11593 is the default routing model and sufficient for basic query routing. The threshold 0.11593 can be calibrated by yourself if you want to to say that e.g. ~30 % of the queries will be routed to the strong model.
!pip install "routellm[serve,eval]"
import os
from routellm.controller import Controller
os.environ["OPENAI_API_KEY"] = "sk-XXXXXX"
os.environ["ANYSCALE_API_KEY"] = "esecret_XXXXXX"
client = Controller(
routers=["mf"],
strong_model="gpt-4-1106-preview",
weak_model="anyscale/mistralai/Mixtral-8x7B-Instruct-v0.1",
)
response = client.chat.completions.create(
model="router-mf-0.11593",
messages=[
{"role": "user", "content": "Hello!"}
]
)