cuterwrite

cuterwrite

Implementing Local RAG Service: Integrating Open WebUI, Ollama, and Qwen2.5

Introduction#

When building information retrieval and generative AI applications, the Retrieval-Augmented Generation (RAG) model is increasingly favored by developers for its powerful ability to retrieve relevant information from knowledge bases and generate accurate answers. However, implementing an end-to-end local RAG service requires not only a suitable model but also the integration of a robust user interface and an efficient inference framework.

Utilizing an easily deployable Docker approach can greatly simplify model management and service integration when constructing a local RAG service. Here, we rely on the user interface and model inference services provided by Open WebUI, and introduce the bge-m3 embedding model via Ollama to achieve document vectorization for retrieval, thereby assisting Qwen2.5 in generating more precise answers.

In this article, we will discuss how to quickly launch Open WebUI via Docker, synchronize Ollama's RAG capabilities, and implement an efficient document retrieval and generation system in conjunction with the Qwen2.5 model.

Project Overview#

This project will use the following key tools:

  1. Open WebUI: Provides a web interface for user interaction with the model.
  2. Ollama: Used for managing embedding and large language model inference tasks. The bge-m3 model in Ollama will be used for document retrieval, while Qwen2.5 will be responsible for answer generation.
  3. Qwen2.5: The model part utilizes the Qwen 2.5 series launched by Alibaba, providing natural language generation for retrieval-augmented generation services.

To implement the RAG service, we need the following steps:

  1. Deploy Open WebUI as the user interaction interface.
  2. Configure Ollama to efficiently schedule the Qwen2.5 series models.
  3. Use the embedding model named bge-m3 configured in Ollama to achieve retrieval vectorization.

Deploying Open WebUI#

Open WebUI provides a streamlined Docker solution, allowing users to start the web interface without manually configuring numerous dependencies.

First, ensure that Docker is installed on the server. If not installed, you can quickly install it using the following command:

Then create a directory to save Open WebUI's data, so that the data will not be lost after project updates:

Next, we can start Open WebUI with the following command:

If you want to run Open WebUI with Nvidia GPU support, you can use the following command:

Here, we expose the Open WebUI service on port 3000 of the machine, which can be accessed via a browser at http://localhost:3000 (for remote access, use the public IP and open port 3000). /DATA/open-webui is the data storage directory, and you can adjust this path as needed.

Of course, in addition to the Docker installation method, you can also install Open WebUI via pip, source compilation, Podman, etc. For more installation methods, please refer to the Open WebUI official documentation.

Basic Setup#

  1. Enter the account information to register, set a strong password!!!

Important

The first registered user will be automatically set as the system administrator, so please ensure you are the first registered user.

  1. Click on the avatar in the lower left corner and select the admin panel.
  2. Click on settings in the panel.
  3. Disable new user registration (optional).
  4. Click save in the lower right corner.

image

Configuring Ollama and Qwen2.5#

Deploying Ollama#

Install Ollama on the local server. Currently, Ollama provides various installation methods, please refer to Ollama's official documentation to download and install the latest version 0.3.11 (Qwen2.5 only starts supporting this version). Installation details can be found in a previous article I wrote: Ollama: From Beginner to Advanced.

Start the Ollama service (if started via Docker, this is not necessary, but port 11434 must be exposed):

Once the Ollama service is started, you can connect to it by accessing http://localhost:11434.

The Ollama Library provides semantic vector models (bge-m3) as well as various text generation models (including Qwen2.5). Next, we will configure Ollama to meet the project's needs for document retrieval and question-answer generation.

Downloading the Qwen2.5 Model#

To install Qwen2.5 via Ollama, you can directly run the ollama pull command in the command line to download the Qwen2.5 model. For example, to download the 72B model of Qwen2.5, you can use the following command:

This command will fetch the Qwen2.5 model from Ollama's model repository and prepare the runtime environment.

Qwen2.5 offers various model sizes, including 72B, 32B, 14B, 7B, 3B, 1.5B, 0.5B, etc. You can choose the appropriate model based on your needs and GPU memory size. I am using a server with 4x V100, so I can directly choose the 72B model. If fast output speed is required and minor performance loss is acceptable, you can use the quantized version qwen2.5:72b-instruct-q4_0; if you can accept slower output speed, you can use qwen2.5:72b-instruct-q5_K_M. For the 4x V100 server, although the token generation of the q5_K_M model is noticeably lagging, I still chose the q5_K_M model to test Qwen2.5's performance.

For personal computers with less memory, it is recommended to use the 14B or 7B models, which can be downloaded using the following commands:

or

If you have both Open WebUI and Ollama services running, you can also download the model from the admin panel.

image

Downloading the bge-m3 Model#

Download the bge-m3 model in Ollama, which is used for document vectorization. Run the following command in the command line to download the model (or download it in the Open WebUI interface):

At this point, we have completed the configuration of Ollama, and next we will configure the RAG service in Open WebUI.

RAG Integration and Configuration#

Configuring Ollama's RAG Interface in Open WebUI#

Accessing the Open WebUI Admin Interface#

After starting Open WebUI, you can directly access the service address via a web browser, log in to your admin account, and then enter the admin panel.

Setting the Ollama Interface#

In the admin panel of Open WebUI, click on Settings, and you will see options for external connections. Ensure that the Ollama API address is host.docker.internal:11434, then click the verify connection button on the right to confirm that the Ollama service is connected properly.

image

Setting the Semantic Vector Model#

In the admin panel of Open WebUI, click on Settings, then click on Documents, and complete the following steps:

  1. Set the semantic vector model engine to Ollama.
  2. Set the semantic vector model to bge-m3:latest.
  3. The remaining settings can be kept as default; here I set the maximum file upload size to 10MB, the maximum upload quantity to 3, Top K to 5, and block size and block overlap to 1500 and 100 respectively, and enabled PDF image processing.
  4. Click save in the lower right corner.

image

Testing the RAG Service#

Now, you have implemented a complete local RAG system. You can enter any natural language question in the main interface of Open WebUI, then upload the corresponding document. The system will call the semantic vector model to vectorize the document, and then use the Qwen2.5 model to retrieve the document, generate an answer, and return it to the user.

In the user chat interface of Open WebUI, upload the document you want to retrieve, then enter your question and click send. Open WebUI will call Ollama's bge-m3 model for document vectorization, and then call the Qwen2.5 model for question-answer generation.

Here, I uploaded a simple txt file (text generated by GPT) with the following content:

Then I asked three questions:

  1. What strange creature did Evan encounter in the forest?
  2. What was inscribed on the ancient stone tablet that Evan found in the cave?
  3. What treasure did Evan discover at the center of the altar?

The following image shows the answers:

image

Summary#

With the help of Open WebUI and Ollama, we can easily build an efficient and intuitive local RAG system. By using the bge-m3 semantic vector model for text vectorization, combined with the Qwen2.5 generation model, users can interact efficiently with document retrieval and enhanced generation tasks in a unified web interface. This not only protects data privacy but also significantly enhances the localization capabilities of generative AI applications.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.