Blog

Building Confidential and Secure AI Solutions for Education, Engineering and Business

published: Dec. 31, 2024, 3:41 p.m.

Tags: my_notes AI LLM information_technology

It happened that after my last week’s article about Local AI Assistant I discussed AI opportunities with a friend. Christmas time is a good time for friends and family, and he brought up a question about using ChatGPT and the security of his data.

We all know about ChatGPT. OpenAI also provides some APIs that allow you to integrate chat functionality into your applications, for example as a chat-bot. But my friend saw two problems:

Training Costs: To be useful chat-bot should operate with business data (for example operation hours, prices, etc.), which means he need to train it. Everyone knows that training large language models (LLM or AI) costs a lot of money.
Data Security: He was willing to use his proprietary data for his own chat-bot but wasn’t comfortable with OpenAI, or any other company incorporating that know-how into their large language models. AI providers typically need more data to train their models, and he doesn’t want his data ending up in someone else’s product.

At that moment, I realized that despite AI’s popularity, people, who don’t work with AI technologies often not have enough information about how they can use it. That’s why, after answering his questions, I decided to publish my answers.

How to Teach AI to Use Your Data

Let’s say you have some kind of database - perhaps FAQs on your website, knowledge tests for students, technical support data, or engineering data like standards. If you decide to train a model using this data, you’ll face three problems:

Time and money: Training can be expensive
Difficult Data Updates: After you’ve trained your model, you can’t easily add or change it’s knowledge without retraining it, for example you can’t change the answer if you find that it was wrong (my friend didn’t consider this).
Rapid AI Evolution: AI is progressing at a very fast pace. If you decide to switch to a more efficient model next month, you’ll have to retrain it.

However, you don’t actually need to train the model! You can use RAG (Retrieval-Augmented Generation) technology. This approach means you don’t store you knowledge inside of LLM.

Instead, you:

Store your data - FAQ entries, for example (questions + answers) in a database that supports semantic search (often called a “vector database”).
When a user asks a question, you use the LLM to convert the question into an embedding (a numerical vector).
Perform a semantic search in your vector database to find the most relevant FAQ entries.
Provide these relevant entries as context to the LLM.
The LLM uses that context to formulate an answer.

The advantages of this approach are:

Scalability: You can update your information at any moment by simply adding new questions and answers to database, or changing existing questions or answers.
Accuracy: You reduce hallucinations by grounding the model’s answers in your business data.
Flexibility: You easily swap out the model that you use (it might be local model, or any LLM provider - OpenAI, Anthropic, or any other that you prefer)

RAG technology answers my friend’s first question about how to teach AI to use his data.

Security: Hosted vs. Local Models

My friend’s second question was about data security, which can be either simple or tricky, depending on your viewpoint. You have two main options:

Sign an Agreement with an LLM Provider: You trust they’ll handle your data responsibly and avoid leaks.
To use local LLM on your own server: You have full control over your data, minimizing the risk of it being shared.

At first glance, using LLM provider seems preferable. They have modern hardware, ample computing power, and can use huge models that deliver top-tier results. However, for many real business tasks, like answering 5,000 -10,000 questions, your model doesn’t need to know everything about the world. It just needs to understand the question and generate a proper answer.

This means it’s possible to use a smaller models, that doesn’t require a lot of resources. Even a model with 1-3 billion parameters, that you can run on the laptop, should work pretty good, while models with 7b or 12b parameter can perform extremely well on a server. Models with 22b or 70b probably will be overkill for many tasks. Of course, if you plan to run it locally for a real business workload, you’ll need to calculate how many requests you’ll have per hour, then allocate server capacity accordingly. You also have the option to rent servers in the cloud (e.g., AWS), which can still feel more secure than sharing your data with an LLM provider.

Example: Retrieval-Augmented Generation (RAG) with FAISS and a Small LLM

Below, you can see a piece of Python code illustrating how the RAG approach might be implemented. It’s a simple example to demonstrate the idea, don’t copy it directly into real solutions.


#!/usr/bin/env python
# coding: utf-8

"""
Example: Retrieval-Augmented Generation (RAG) with FAISS and a Small LLM

This code snippet illustrates:
1. How to embed a small Q&A dataset (e.g., FAQs) using a BERT-like model.
2. How to index and query those embeddings with FAISS for semantic search.
3. How to use a small LLM (T5 in this case) to generate a final answer
   based on the retrieved context and an additional note.

Dependencies:
- transformers
- torch
- faiss
- numpy

Important:
This is a simplified example. In a production setting, you'll want to handle
larger datasets, more robust error checking, and optimize for speed.
"""

import torch
from transformers import AutoTokenizer, AutoModel, AutoModelForSeq2SeqLM
import numpy as np
import faiss

# -----------------------------------------------------
# Step 1: Load a Transformer Model and Tokenizer
# -----------------------------------------------------
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# -----------------------------------------------------
# Step 2: Prepare a Small Q&A Dataset
# -----------------------------------------------------
# Example data (could be from Education, Engineering, or Business)
business_questions = [
    "What are your working hours?",
    "Where are you located?",
]
business_answers = [
    "Our working hours are 9 AM to 5 PM from Monday to Friday.",
    "Our address is 123456 Main Street, SE, Calgary, Alberta, Canada.",
]

# -----------------------------------------------------
# Step 3: Encode the Questions into Embeddings
# -----------------------------------------------------
data_embeddings = []
for question in business_questions:
    inputs = tokenizer(
        question, 
        return_tensors="pt", 
        truncation=True, 
        padding=True, 
        max_length=128)

    with torch.no_grad():
        # Use mean pooling of the last hidden state as the embedding
        embedding = model(**inputs).last_hidden_state.mean(dim=1)
    data_embeddings.append(embedding.squeeze().numpy())

data_embeddings = np.array(data_embeddings)

# -----------------------------------------------------
# Step 4: Create a FAISS Index for Semantic Search
# -----------------------------------------------------
dimension = data_embeddings.shape[1]  # e.g., 768 for DistilBERT
index = faiss.IndexFlatL2(dimension)  # L2 distance
index.add(data_embeddings)

# Store the original questions/answers for retrieval
stored_datas = [{"question": q, "answer": a} for q, \
    a in zip(business_questions, business_answers)]

# (Optional) Save the index for later reuse
faiss.write_index(index, "data_index.faiss")

# -----------------------------------------------------
# Step 5: Define a Function to Query the FAISS Index
# -----------------------------------------------------
def query_data(query, top_k=3):
    """
    Encode the query into an embedding, search the FAISS index,
    and return the top-k matched Q&A pairs.
    """
    # Convert query to embedding
    inputs = tokenizer(query, return_tensors="pt", truncation=True, padding=True, \
        max_length=128)

    with torch.no_grad():
        query_embedding = model(**inputs).last_hidden_state.mean(dim=1).squeeze().numpy()

    # Search for top_k matches
    distances, indices = index.search(np.array([query_embedding]), k=top_k)

    # Retrieve and rank results
    results = []
    for i in range(len(indices[0])):
        idx = indices[0][i]
        results.append({
            "question": stored_datas[idx]["question"],
            "answer": stored_datas[idx]["answer"],
            "distance": distances[0][i]
        })

    # FAISS already sorts by distance; no need to resort unless you want custom logic
    results.sort(key=lambda x: x["distance"])
    return results

# -----------------------------------------------------
# Step 6: Load a Small LLM for Response Generation
# -----------------------------------------------------
response_model_name = "google/flan-t5-large"
response_tokenizer = AutoTokenizer.from_pretrained(response_model_name)
response_model = AutoModelForSeq2SeqLM.from_pretrained(response_model_name)

# -----------------------------------------------------
# Step 7: Generate a Context-Aware Response
# -----------------------------------------------------
def generate_response_with_llm(user_query, top_results):
    """
    Use a small, modern LLM to generate a conversational response
    based on the user query and the retrieved knowledge.
    """

    # Additional domain/business-specific info
    context_note = (
        "Additional note: You represent PhotoInPrint, a service providing "
        "professional Fine Art Photography Printing using high-quality papers "
        "like Ilford, Hahnemühle, and Red River. Orders can be placed online "
        "from anywhere, including the United States. No photobooks or acrylic.\n\n"
    )

    # Construct context from top retrieved results
    context = context_note
    context += "Analyze the following information to answer the user's question:\n"
    for i, result in enumerate(top_results, 1):
        context += f"{i}. Question: {result['question']} - Answer: {result['answer']}\n"

    # Instruction for the model
    context += (
        f"\nYour task: Using only the answers above and the additional note, "
        f"generate a clear, detailed, and helpful response to the user's question: "
        f"\"{user_query}\".\n"
        "Combine relevant answers where appropriate to create a smooth and "
        "comprehensive response. Do not use information not explicitly provided. "
        "If no relevant information is found, politely inform the user and ask for "
        "clarification.\n"
    )

    # Tokenize the input context
    inputs = response_tokenizer(context, return_tensors="pt", truncation=True, \
        max_length=512)

    # Generate the response
    outputs = response_model.generate(
        **inputs,
        max_length=140,
        temperature=0.7,
        top_p=0.9,
        do_sample=True
    )

    # Decode and return the response text
    response = response_tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
    return response

# -----------------------------------------------------
# Step 8: Example Usage
# -----------------------------------------------------
if __name__ == "__main__":
    user_query = "Are you working at 2pm on Monday?"
    top_results = query_data(user_query, top_k=3)
    final_response = generate_response_with_llm(user_query, top_results)
    print("User Query:", user_query)
    print("Generated Response:\n", final_response)

How It Works:

Embedding: We use a BERT-like model (DistilBERT) to convert each “question” in our knowledge base into vector embeddings.
Indexing: We store these embeddings in a FAISS index for quick, semantic search.
Retrieval: When a user submits a query, we embed that query and retrieve the most relevant database entries from FAISS.
Response Generation: We then feed the user query and retrieved information into a small T5 model (google/flan-t5-large) to produce a final answer referencing only the provided context.

RAG-based solutions, paired with local or controlled cloud LLMs, offer a secure, cost-effective, and flexible way to infuse AI into various domains—without the headaches of ongoing retraining or the risks of data exposure.

In real-life AI deployments, you do need some engineering effort, especially for setting up a vector database, connecting it to your knowledge base, and optimizing runtime. However, it’s still feasible even for relatively small businesses.

To summarize:

Retrieval-Augmented Generation approach allow you to be flexible in switching from one LLM to another, reduce costs, keep business knowledge in a simple database, and allows you to update that knowledge at any time. This is a easy and cost-effective way for Education, Engineering and Business to integrate AI into day-to-day operations and solutions.

« to My Blog

Dmitry Shutov