Private GPT
模型
Taiyi-CLIP-Roberta-102M-Chinese 中文CLIP模型
使用电网的图片测试了一下,效果不理想。
HuggingFace 下载
- The pipeline API
 - using pipelines with a local model
 - How to download hugging face sentiment-analysis pipeline to use it offline?
 - How to Download Hugging Face Sentiment-Analysis Pipeline for Offline Use
 - pipeline does not load from local folder, instead, it always downloads models from the internet.
 - Download files from the Hub
 - Download models for local loading
 - How to download Huggingface Transformers model?
 - Download Huggingface models
 - huggingface transformers预训练模型如何下载至本地,并使用?
 
Qwen-7B
chat(…, stream=True)
向量数据库
Chroma
Chroma 向量数据库依赖于 sqlite3,而且需要 sqlite3 >= 3.35.0。
制作镜像时,使用的是 python:3.10 和 python:3.10-slim,使用 apt install sqlite3 能够安装的 sqlite3 最高版本是 3.34.1,所以会出现下面的错误。
    import chromadb
  File "/usr/local/lib/python3.10/site-packages/chromadb/__init__.py", line 69, in <module>
    raise RuntimeError(
RuntimeError: Your system has an unsupported version of sqlite3. Chroma requires sqlite3 >= 3.35.0.
Please visit https://docs.trychroma.com/troubleshooting#sqlite to learn how to upgrade.
这里采用编译源码的方式安装 sqlite3。
wget https://www.sqlite.org/2023/sqlite-autoconf-3430000.tar.gz
tar -zxvf sqlite-autoconf-3430000.tar.gz
cd sqlite-autoconf-3430000
./configure --prefix=/usr/local
make install
make[1]: Entering directory '/sqlite-autoconf-3430000'
 /bin/mkdir -p '/usr/local/lib'
 /bin/bash ./libtool   --mode=install /usr/bin/install -c   libsqlite3.la '/usr/local/lib'
libtool: install: /usr/bin/install -c .libs/libsqlite3.so.0.8.6 /usr/local/lib/libsqlite3.so.0.8.6
libtool: install: (cd /usr/local/lib && { ln -s -f libsqlite3.so.0.8.6 libsqlite3.so.0 || { rm -f libsqlite3.so.0 && ln -s libsqlite3.so.0.8.6 libsqlite3.so.0; }; })
libtool: install: (cd /usr/local/lib && { ln -s -f libsqlite3.so.0.8.6 libsqlite3.so || { rm -f libsqlite3.so && ln -s libsqlite3.so.0.8.6 libsqlite3.so; }; })
libtool: install: /usr/bin/install -c .libs/libsqlite3.lai /usr/local/lib/libsqlite3.la
libtool: install: /usr/bin/install -c .libs/libsqlite3.a /usr/local/lib/libsqlite3.a
libtool: install: chmod 644 /usr/local/lib/libsqlite3.a
libtool: install: ranlib /usr/local/lib/libsqlite3.a
libtool: finish: PATH="/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/sbin" ldconfig -n /usr/local/lib
----------------------------------------------------------------------
Libraries have been installed in:
   /usr/local/lib
If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the '-LLIBDIR'
flag during linking and do at least one of the following:
   - add LIBDIR to the 'LD_LIBRARY_PATH' environment variable
     during execution
   - add LIBDIR to the 'LD_RUN_PATH' environment variable
     during linking
   - use the '-Wl,-rpath -Wl,LIBDIR' linker flag
   - have your system administrator add LIBDIR to '/etc/ld.so.conf'
See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------
 /bin/mkdir -p '/usr/local/bin'
  /bin/bash ./libtool   --mode=install /usr/bin/install -c sqlite3 '/usr/local/bin'
libtool: install: /usr/bin/install -c sqlite3 /usr/local/bin/sqlite3
 /bin/mkdir -p '/usr/local/include'
 /usr/bin/install -c -m 644 sqlite3.h sqlite3ext.h '/usr/local/include'
 /bin/mkdir -p '/usr/local/share/man/man1'
 /usr/bin/install -c -m 644 sqlite3.1 '/usr/local/share/man/man1'
 /bin/mkdir -p '/usr/local/lib/pkgconfig'
 /usr/bin/install -c -m 644 sqlite3.pc '/usr/local/lib/pkgconfig'
make[1]: Leaving directory '/sqlite-autoconf-3430000'
- x86
    
cp /usr/local/lib/libsqlite3.so.0.8.6 /usr/lib/x86_64-linux-gnu/libsqlite3.so.0 - arm64
    
cp /usr/local/lib/libsqlite3.so.0.8.6 /usr/lib/aarch64-linux-gnu/libsqlite3.so.0 - SQLite Download Page
 - How to Install SQLite3 from Source on Linux (With a Sample Database)
 - Load embedding from disk - Langchain Chroma DB
 
Milvus
图片搜索
import os
import numpy as np
from pymilvus import FieldSchema, CollectionSchema, Collection, DataType, connections, utility
from app.models.search_image import SearchImageModel
from app.config import Config
config = Config()
model = SearchImageModel()
model.load(config)
def get_images(path):
    image_paths = []
    ext_names = ['.png', '.jpg', '.jpeg']
    for filename in os.listdir(path):
        _, ext_name = os.path.splitext(filename.lower())
        if ext_name.lower() in ext_names:
            image_paths.append(filename)
    return image_paths
image_path = 'data/images/20190128155421222575013.jpg'
image_features = model.get_image_features_with_path(image_path)
print('*'*100, image_features.shape)
images = get_images('data/images')
for image in images:
    file_path = f'data/images/{image}'
    image_features = model.get_image_features(file_path)
    collection.insert([[file_path], [image_features]])
COLLECTION_NAME = 'PrivateGPTImage'  # Collection name
connections.connect(host='localhost', port=19530)
if utility.has_collection(COLLECTION_NAME):
    utility.drop_collection(COLLECTION_NAME)
fields = [
    FieldSchema(name='path', dtype=DataType.VARCHAR, description='Image path', is_primary=True, auto_id=False, max_length=1024),
    FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, description='Embedding vectors', dim=512)
]
schema = CollectionSchema(fields=fields, description='Image Collection')
collection = Collection(name=COLLECTION_NAME, schema=schema)
images = get_images('data/images')
for image in images:
    file_path = f'data/images/{image}'
    image_features = model.get_image_features_with_path(file_path)
    image_features /= np.linalg.norm(image_features)
    # image_features = image_features / image_features.norm(dim=-1, keepdim=True)
    collection.insert([[file_path], image_features.numpy()])
index_params = {
    'index_type': 'IVF_FLAT',
    'metric_type': 'L2',
    'params': {'nlist': 512}
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()
collection = Collection(COLLECTION_NAME)
collection.load()
def search_image(text):
    # Search parameters for the index
    search_params={
        "metric_type": "L2"
    }
    data = model.get_text_features(text)
    data /= np.linalg.norm(data)
    # data /= data.norm(dim=-1, keepdim=True)
    search_param = {
        "data": data.numpy(),
        "anns_field": "embedding",
        "param": {"metric_type": "L2", "offset": 1},
        "limit": 10,
        "output_fields": ["path"],
    }
    results=collection.search(**search_param)
    ret=[]
    for hit in results[0]:
        row=[]
        row.extend([hit.id, hit.score, hit.entity.get('path')])  # Get the id, distance, and title for the results
        ret.append(row)
    return ret
results = search_image('Working at heights wearing a helmet') # 戴着安全帽高空作业
for result in results:
    print(result)
utility.drop_collection(COLLECTION_NAME)
文本搜索
from langchain.vectorstores import Milvus
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
EMBEDDING_MODEL_NAME='BAAI/bge-base-zh'
EMBEDDING_MODEL_CACHE_DIRECTORY='models/embeddings'
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME, 
                                   cache_folder=EMBEDDING_MODEL_CACHE_DIRECTORY)
vector_store = Milvus(embedding_function=embeddings, 
                      collection_name="PrivateGPT", 
                      connection_args={"host": 'localhost', "port": 19530},
                      drop_old = True)
loader = TextLoader('data/docs/test.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
vector_store.add_documents(texts)
docs = vector_store.similarity_search("有多少张图片")
for i, doc in enumerate(docs):
    print(f'{i} - {len(doc.page_content)}', doc.page_content[:100])
- Milvus Docs
 - Question Answering Using Milvus and Hugging Face
 - Similarity Search with Milvus and OpenAI
 - Build a Milvus Powered Text-Image Search Engine in Minutes
 - Deep Dive into Text-Image Search Engine with Towhee
 - 基于Milvus的向量搜索实践(一)
 - 基于Milvus的向量搜索实践(二)
 - 基于Milvus的向量搜索实践(三)
 - 向量检索:如何取舍 Milvus 索引实现搜索优化?
 - 笔记︱几款多模态向量检索引擎:Faiss 、milvus、Proxima、vearch、Jina等
 - PyMilvus
 - A purposeful rendezvous with Milvus — the vector database
 - Install Milvus Standalone with Docker Compose (CPU)
 
Faiss
文件重复检测
Redis
运行 Redis 服务。
docker run --name redis -it -p 6379:6379 -v $(pwd)/data/redis:/data redis redis-server --save 60 1
安装 redis Python 包。
pip install redis
测试。
import redis
from redis.exceptions import ConnectionError
try:
    r = redis.Redis(host='localhost', port=6379, decode_responses=True)
    r.ping()
    print('Connected!')
except ConnectionError as ex:
    print('Error:', ex)
    raise Exception
r.set('hello', 'world')
r.get('hello')
使用 Redis 保存图片的 MD5 值
import os, hashlib
dir = 'images'
for filename in os.listdir(dir):
    file_path = f'{dir}/{filename}'
    file_hash = hashlib.md5(open(file_path, 'rb').read()).hexdigest()
    if not r.get(file_hash):
        r.set(file_hash, file_path)
        print(file_hash, file_path)
- Redis
 - Redis Docker Hub
 - Finding duplicate files and removing them
 - Finding Duplicate Files with Python
 
Gunicorn
服务多进程运行,出现竞争问题。
执行下面的命令出现错误信息。
gunicorn --worker-class uvicorn.workers.UvicornWorker --config app/gunicorn_conf.pyc app.main:app
错误信息
Traceback (most recent call last):
  File "/usr/local/bin/gunicorn", line 8, in <module>
    sys.exit(run())
  File "/usr/local/lib/python3.10/site-packages/gunicorn/app/wsgiapp.py", line 67, in run
    WSGIApplication("%(prog)s [OPTIONS] [APP_MODULE]").run()
  File "/usr/local/lib/python3.10/site-packages/gunicorn/app/base.py", line 236, in run
    super().run()
  File "/usr/local/lib/python3.10/site-packages/gunicorn/app/base.py", line 72, in run
    Arbiter(self).run()
  File "/usr/local/lib/python3.10/site-packages/gunicorn/arbiter.py", line 229, in run
    self.halt(reason=inst.reason, exit_status=inst.exit_status)
  File "/usr/local/lib/python3.10/site-packages/gunicorn/arbiter.py", line 342, in halt
    self.stop()
  File "/usr/local/lib/python3.10/site-packages/gunicorn/arbiter.py", line 396, in stop
    time.sleep(0.1)
  File "/usr/local/lib/python3.10/site-packages/gunicorn/arbiter.py", line 242, in handle_chld
    self.reap_workers()
  File "/usr/local/lib/python3.10/site-packages/gunicorn/arbiter.py", line 530, in reap_workers
    raise HaltServer(reason, self.WORKER_BOOT_ERROR)
gunicorn.errors.HaltServer: <HaltServer 'Worker failed to boot.' 3>
使用 --preload 参数可以解决这个问题。
gunicorn --worker-class uvicorn.workers.UvicornWorker --config app/gunicorn_conf.pyc --preload app.main:app
进程运行数量过多,出现内存不足的错误。
MAX_WORKERS=10 gunicorn --worker-class uvicorn.workers.UvicornWorker --config app/gunicorn_conf.pyc --preload app.main:app
错误信息
[2023-09-02 09:14:07 +0000] [1] [ERROR] Worker (pid:212) was sent SIGKILL! Perhaps out of memory?
使用 MAX_WORKERS 参数减少进程数量。
MAX_WORKERS=3 gunicorn --worker-class uvicorn.workers.UvicornWorker --config app/gunicorn_conf.pyc --preload app.main:app
Python
文本转换为 bool 类型
>>> eval('True')
True
>>> eval('False')
False
直接使用 bool 函数会出现下面的错误。
>>> bool('True')
True
>>> bool('False')
True
Gradio
Gallery
- Cannot drag and drop image from Gallery to Image
 - Specifying Gallery’s height causes unexpected display of images
 - Adjust width / height of image preview in the Image Component?
 
mount_gradio_app
- gradio/demo/custom_path/run.py
 - mount_gradio_app causing reload loop
 - Build a demo with Gradio
 - Gradio Controlling Layout
 - LoRA the Explorer
 - LoraTheExplorer/app.py
 - LoraTheExplorer/custom.css
 - Gradio tutorial (Build machine learning applications)
 - Gradio File
 
Shell
查找文件软链接的绝对路径
ls -l /usr/lib/aarch64-linux-gnu/libsqlite3.so.0
lrwxrwxrwx 1 root root 19 Feb 24  2021 /usr/lib/aarch64-linux-gnu/libsqlite3.so.0 -> libsqlite3.so.0.8.6
readlink -f /usr/lib/aarch64-linux-gnu/libsqlite3.so.0
/usr/lib/aarch64-linux-gnu/libsqlite3.so.0.8.6
构建镜像
Dockerfile
FROM python:3.10-slim
ARG SQLITE3_PATH
ENV APP_HOME=/private-gpt
WORKDIR ${APP_HOME}
# 编译Sqlite3
RUN wget https://www.sqlite.org/2023/sqlite-autoconf-3430000.tar.gz \
    && tar -zxvf sqlite-autoconf-3430000.tar.gz \
    && cd sqlite-autoconf-3430000 \
    && ./configure --prefix=/usr/local \
    && make \
    && make install \
    && cd .. \
    && rm -rf sqlite-autoconf-3430000 \
    && rm -rf sqlite-autoconf-3430000.tar.gz
# 拷贝Sqlite3
COPY --from=builder /usr/local/lib/libsqlite3.so.0.8.6 ${SQLITE3_PATH}
构建多平台镜像的脚本
build_image() {
    local dockerfile=$1
    local app_name=$2
    local platforms=($3)
    local platform_sqlite3_paths=($4)
    for ((i=0; i<${#platforms[@]}; ++i))
    do
        echo "🐳 Building $app_name:${platforms[i]}, Sqlite3 Path: ${platform_sqlite3_paths[i]}"
        docker buildx build --progress=plain --platform=linux/${platforms[i]} --rm -f $dockerfile \
            --build-arg SQLITE3_PATH=${platform_sqlite3_paths[i]} \
            -t wangjunjian/$app_name:${platforms[i]} "."
        echo "💯\n"
    done
}
APP_NAME=private-gpt
PLATFORMS=(amd64 arm64)
PLATFORM_SQLITE3_PATHS=(/usr/lib/x86_64-linux-gnu/libsqlite3.so.0 /usr/lib/aarch64-linux-gnu/libsqlite3.so.0)
build_image Dockerfile $APP_NAME "${PLATFORMS[*]}" "${PLATFORM_SQLITE3_PATHS[*]}"
测试镜像
docker run --rm -it -p 8000:80 -e MAX_WORKERS=1 wangjunjian/private-gpt:arm64
上传镜像
docker push wangjunjian/private-gpt:amd64
下载镜像
docker pull wangjunjian/private-gpt:amd64
运行镜像
docker run -d --name private-gpt -p 8888:80 -v $(pwd)/storage:/private-gpt/storage -e MAX_WORKERS=1 wangjunjian/private-gpt:amd64
参考资料
- privateGPT walkthrough: Creating your own offline GPT Q&A system
 - OpenAI CLIP
 - Hugging Face CLIP
 - How do I persist to disk a temporary file using Python?
 - Making Neural Search Queries Accessible to Everyone with Gradio — Deploying Haystack’s Semantic Document Search with Hugging Face models in Gradio in Three Easy Steps
 - Towhee
 - Visualize nearest neighbor search on reverse image search
 - Fine-Grained Image Similarity Detection Using Facebook AI Similarity Search(FAISS)
 - Building an image search engine with Python and Faiss
 - Fast and Simple Image Search with Foundation Models
 - 250+ Free Machine Learning Datasets for Instant Download
 - Image search with 🤗 datasets
 - FAISS (Facebook AI Similarity Search)
 - Deep Lake Docs
 - Weaviate
 - Weaviate GitHub
 - Milvus makes it easy to add similarity search to your applications
 - Milvus
 - 8 Best Vector Databases to Unleash the True Potential of AI
 - 12 Vector Databases For 2023: A Review
 - HuggingFaceEmbeddings
 - How to Use FAISS to Build Your First Similarity Search
 - LangChain Vector stores
 - Introduction to Facebook AI Similarity Search (Faiss)
 - Faiss: A library for efficient similarity search
 - DocArray
 - Welcome to DocArray!
 - Qwen-7B-Chat
 - Private GPT
 - 基于localGPT 和 streamlit 打造个人知识库问答机器人
 - gpt4-pdf-chatbot-langchain
 - Knowledge QA LLM
 - Knowledge-QA-LLM: 基于本地知识库+LLM的开源问答系统
 - 闻达:一个大规模语言模型调用平台
 - Building a FastAPI App with the Gradio Python Client
 - How to use your own data with Dolly
 - Using Langchain, Chroma, and GPT for document-based retrieval-augmented generation
 - face_recognition
 - Chat completions API
 - ImageSearcher/image_searcher/embedders/face_embedder.py
 - GPT best practices
 - Jina
 - PromptPerfect 专业一流的提示词工程开发工具
 - Implement unified text and image search with a CLIP model using Amazon SageMaker and Amazon OpenSearch Service
 - 中文CLIP模型开源
 - LangChain Tutorial in Python - Crash Course
 - Qwen-7B ReAct Prompting 示例
 - LangChain - 打造自己的GPT(五)拥有本地高效、安全的Sentence Embeddings For Chinese & English
 - 想自己利用OpenAI做一个文档问答的话
 - LangChain及LangFlow使用指南
 - Query Your Own Documents with LlamaIndex and LangChain
 - 分词 – 从源码解读LangChain-ChatGLM(二)
 - LlamaIndex Node Parser