6.1. LLM#

6.1.1. Compressing Prompts With No Loss with llmlingua#

Here is how to reduce the costs of working with LLMs.

When working with LLMs, we often encountered problems like exceeding token limits, forgetting context, or paying much more for usage than expected.

Researchers from Microsoft try to solve these problems with llmlingua.

llmlingua compresses your prompt by taking a trained small LLM to detect unimportant tokens.

They claim to achieve up to 20x compression with no or minimal performance loss.

I tried it out by myself and I noticed no performance loss at all, but I would be cautious for critical applications.

!pip install llmlingua
# !pip install llmlingua

from llmlingua import PromptCompressor

prompt = "<YOUR_PROMPT>"
llm_lingua = PromptCompressor("lgaalves/gpt2-dolly",)

compressed_prompt = llm_lingua.compress_prompt(prompt, instruction="", question="", target_token=200)

# {'compressed_prompt': 'are- that turns into formatting & with like "[]" best it.......'
# 'origin_tokens': 2430,
# 'compressed_tokens': 261,
# 'ratio': '9.3x',
# 'saving': 'Saving $0.1 in GPT-4.}'

6.1.2. One-Function Call to Any LLM with litellm#

Do you want a One-Function call to any LLM in Python?

Try litellm.

litellm is a Python package to call any LLM in a consistent format and to return a consistent output.

You only need to set the API key of the provider and the model name.

It also supports async calls and streaming the models response.

!pip install litellm
import os
from litellm import completion

os.environ["OPENAI_API_KEY"] = "your-api-key" 
os.environ["ANTHROPIC_API_KEY"] = "your-api-key"
os.environ['MISTRAL_API_KEY'] = "your-api-key"

messages = [{ "content": "Hello, how are you?","role": "user"}]

# OpenAI
response = completion(model="gpt-3.5-turbo", messages=messages)

# Anthropic
response = completion(model="claude-instant-1", messages=messages)

# Mistral
response = completion(model="mistral/mistral-tiny", messages=messages)

6.1.3. Safeguard Your LLMs with LLMGuard#

Safeguarding your LLMs against unwanting behavior is critical.

LLMGuard, a Python package, ensures a safe interaction between the user and LLM.

It checks prompts and outputs for:

  • Sensitive Information like credit card number and sanitizes it

  • Toxic or harmful language

  • Prompt injections

Use LLMGuard to make your LLMs more safe.

!pip install llm-guard
from openai import OpenAI
from llm_guard import scan_output, scan_prompt
from llm_guard.input_scanners import Anonymize, PromptInjection, Toxicity
from llm_guard.vault import Vault

client = OpenAI(api_key="OPENAIKEY")
vault = Vault()
input_scanners = [Anonymize(vault), Toxicity(), PromptInjection()]

prompt = "Make an SQL insert statement to add a new user to our database. Name is John Doe. \
Email is test@test.com Phone number is 555-123-4567 and the IP address is 192.168.1.100. \
And credit card number is 4567-8901-2345-6789."

sanitized_prompt, results_valid, results_score = scan_prompt(input_scanners, prompt)

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": sanitized_prompt},
    ],
)

# Sanitized Prompt: 
# Make an SQL insert statement to add a new user to our database. 
# Name is [REDACTED_PERSON_1]. Email is [REDACTED_EMAIL_ADDRESS_1] Phone number is 
# [REDACTED_PHONE_NUMBER_1] and the IP address is [REDACTED_IP_ADDRESS_1]. 
# And credit card number is [REDACTED_CREDIT_CARD_RE_1].

6.1.4. Evaluate LLMs with uptrain#

Evaluating LLMs can be tricky.

Luckily, uptrain offers a neat library to do that.

uptrain is a Python library to evaluate LLMs with 20+ preconfigured checks.

This includes the quality of the responses (completeness, validity,…), language proficiency (tonality, conciseness, …) and code hallucination.

It supports the biggest providers like OpenAI, Mistral, Claude and Ollama.

!pip install uptrain
from uptrain import EvalLLM, Evals
import json

OPENAI_API_KEY = "*******"

data = [{
    'question': 'Which is the most popular global sport?',
    'context': "The popularity of sports can be measured in various ways, including TV viewership...",
    'response': 'Football is the most popular sport with around 4 billion followers worldwide'
}]

eval_llm = EvalLLM(openai_api_key=OPENAI_API_KEY)

results = eval_llm.evaluate(
    data=data,
    checks=[Evals.CONTEXT_RELEVANCE, Evals.FACTUAL_ACCURACY, Evals.RESPONSE_COMPLETENESS]
)

print(json.dumps(results, indent=3))

6.1.5. Embed Any Type of File#

These days, everything is about Embeddings and LLMs.

The Python library embed-anything makes it easy to generate embeddings from multiple sources like image, video, or audio.

It’s built in Rust so it executes fast.

!pip install embed-anything
import embed_anything

data = embed_anything.embed_file("filename.pdf", embeder= "Bert")
embeddings = np.array([data.embedding for data in data])

data = embed_anything.embed_directory("test_files", embeder= "Clip")
embeddings = np.array([data.embedding for data in data])

6.1.6. Structured LLM Output with outlines#

Are you annoyed of unstructured outputs of LLMs?

Try outlines.

outlines is a Python library for structured generation of your LLMs.

There are multiple ways to enforce the output of the model like choices, type constraints, JSON output, and more.

!pip install outlines
import outlines

model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")

prompt = """You are a sentiment-labelling assistant.
Is the following review positive or negative?

Review: This restaurant is just awesome!
"""

generator = outlines.generate.choice(model, ["Positive", "Negative"])
answer = generator(prompt)

6.1.7. Evaluating RAG Pipelines with Ragas#

How do you evaluate your RAG application?

Sure, you can look manually over your responses and see if it’s what you want.

But, it’s not scalable.

Instead, use Ragas in Python.

Ragas is a library providing evaluation techniques and metrics for your RAG pipeline like Context Precision/Recall, Faithfulness and answer relevancy.

See below how easy it is to run Ragas.

!pip install ragas

from datasets import Dataset 
import os
from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness

os.environ["OPENAI_API_KEY"] = "your-openai-key"

data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], 
    ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
    'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}

dataset = Dataset.from_dict(data_samples)

score = evaluate(dataset,metrics=[faithfulness,answer_correctness])
score.to_pandas()

6.1.8. Unified Reranker API with rerankers#

An essential element of your RAG systems is reranking.

Reranking involves a reranking model that outputs a similarity score for each retrieved document and the user query.

The rerankers library gives you a unified API to use with popular vendors and models such as Cohere, Jina or T5.

The perfect API to easily test and replace many methods.

!pip install rerankers
from rerankers import Reranker

ranker = Reranker("t5")

results = ranker.rank(query="I love you", docs=["I hate you", "I really like you"], doc_ids=[0,1])