搭建属于自己的搜索引擎，searx搜索引擎配置

由墨香-15607781945 · 发布日期 2022年8月13日 · 已更新 2022年10月26日

每次使用百度搜索，前面几条永远是广告，甚至还有“假官网”“假医院”，搞得每次都得甄别有没有带“广告”标，再点进去，所以有用的信息没找到，时间到是浪费了不少。所以我就想自己建个无广告、不追踪隐私的搜索引擎，当然重新开发是不可能的，所幸在github找到了个聚合搜索神器——SearX

Searx是一个免费的互联网元搜索引擎，汇集了70多种搜索服务的结果。用户既不被跟踪也不被分析。此外，searx可以在Tor上实现在线匿名搜索。

在线链接汇总 https://searx.space/

官方配置文档 https://searx.github.io/searx/admin/settings.html
官方文档 https://searx.github.io/searx/admin/
官网 https://searx.github.io/searx/index.html
官方看板 https://www.openhub.net/p/searx
官方开源地址 https://github.com/searx/searx
官方开源讨论 https://github.com/searx/searx/issues
官方百科 https://github.com/searx/searx/wiki

匿迹搜索开源文档 https://github.com/entropage/mijisou/tree/devel/searx/engines 
秘迹搜索开源文档 https://github.com/entropage/mijisou
秘迹搜索 http://mijisou.com/ 

本地测试站点 http://192.168.1.194:8888/

如何自定义搜索引擎

1.在 engine 目录添加py文件，并编写 request（生成请求参数）和 response（格式化返回结果）方法，服务内置发送请求方法。集成各个搜索服务（例如 baidu google bind），模板如下：

https://www.jianshu.com/p/7ea4cf05e589 教程来自简书

  
categories = ['general']  # optional


def request(query, params):
    '''pre-request callback
    params<dict>:
      method  : POST/GET
      headers : {}
      data    : {} # if method == POST
      url     : ''
      category: 'search category'
      pageno  : 1 # number of the requested page
    '''

    params['url'] = 'https://host/%s' % query

    return params


def response(resp):
    '''post-response callback
    resp: requests response object
    '''
    return [{'url': '', 'title': '', 'content': ''}]

2、在settings.yml 配置文件添加引擎配置

源码分析

1、多线程获取搜索结果

    def search_multiple_requests(self, requests):
        search_id = uuid4().__str__()

        for engine_name, query, request_params in requests:
            th = threading.Thread(
                target=PROCESSORS[engine_name].search,
                args=(query, request_params, self.result_container, self.start_time, self.actual_timeout, engine_name),
                name=search_id,
            )
            th._timeout = False
            th._engine_name = engine_name
            th.start()
        for th in threading.enumerate():
            if th.name == search_id:
                remaining_time = max(0.0, self.actual_timeout - (time() - self.start_time))
                th.join(remaining_time)
                if th.is_alive():
                    th._timeout = True
                    self.result_container.add_unresponsive_engine(th._engine_name, 'timeout')
                    logger.warning('engine timeout: {0}'.format(th._engine_name))

2、加载搜索引擎

def load_engine(engine_data):
    engine_name = engine_data['name']
    if '_' in engine_name:
        logger.error('Engine name contains underscore: "{}"'.format(engine_name))
        sys.exit(1)

    if engine_name.lower() != engine_name:
        logger.warn('Engine name is not lowercase: "{}", converting to lowercase'.format(engine_name))
        engine_name = engine_name.lower()
        engine_data['name'] = engine_name

    engine_module = engine_data['engine']

    try:
        # 加载 引擎
        engine = load_module(engine_module + '.py', engine_dir)
    except (SyntaxError, KeyboardInterrupt, SystemExit, SystemError, ImportError, RuntimeError):
        logger.exception('Fatal exception in engine "{}"'.format(engine_module))
        sys.exit(1)
    except:
        logger.exception('Cannot load engine "{}"'.format(engine_module))
        return None

3、加载搜索引擎方法 load_module （插件的需求借鉴）


def load_module(filename, module_dir):
    modname = splitext(filename)[0]
    if modname in sys.modules:
        del sys.modules[modname]
    filepath = join(module_dir, filename)
    # and https://docs.python.org/3/library/importlib.html#importing-a-source-file-directly
    spec = importlib.util.spec_from_file_location(modname, filepath)
    module = importlib.util.module_from_spec(spec)
    sys.modules[modname] = module
    spec.loader.exec_module(module)
    return module

4、值得借鉴的地方
引擎必须继承该类，实现 search方法。调用搜索引擎时用的就是search方法

class EngineProcessor(ABC):
    @abstractmethod
    def search(self, query, params, result_container, start_time, timeout_limit):
        pass

start_new_thread(gc.collect, tuple()) 新增线程垃圾回收防止内存泄露

    def search_standard(self):
        """
        Update self.result_container, self.actual_timeout
        """
        requests, self.actual_timeout = self._get_requests()
        print(f"zsq 多线程 查询start {time()}")
        # send all search-request
        if requests:
            self.search_multiple_requests(requests)
            #开启一个新线程回收垃圾
            start_new_thread(gc.collect, tuple())
        print(f"多线程 查询end {time()}")
        # return results, suggestions, answers and infoboxes
        return True

第一次配置Searx 详细操作如下

下面是基于 Debian/Ubuntu 和 virtualenv 的安装指南，如果是 Ubuntu 请确认使用的是 universe 仓库。

配置好启动常用语句
sudo service nginx restart
sudo service uwsgi restart
cd /usr/local/searx
sudo python3 searx/webapp.py

Searx是个很好的搜索引擎，自己搭建一个，开始按着别人的步骤没搞好，因为忘记了配置防火墙。
项目地址：https://github.com/asciimoo/searx
（这里有相关说明）
安装环境：Ubuntu 16~19
先安装Git：apt install git
拉取源码：git clone https://github.com/asciimoo/searx.git
安装依赖：cd searx 进入安装目录
输入：./manage.sh update_packages
以上完成了就编辑配置文件：vi searx/settings.yml
（这3个地方注意修改下就行 ：
port：为监听端口，默认8888，可自行修改。
bind_address：监听地址，默认为127.0.0.1，最好修改为0.0.0.0，外网好访问
secret_key：该参数为加密密匙，可自行设置，数值可以在ssh客户端使用openssl rand -hex 16命令生成。）
前面配置都简单，然后看你的系统有没有自带screen，没有就安装，ubuntu19是自己带了的
安装screen：apt isstall screen -y (有就略过，看系统反馈的信息）
这里就基本完成了，就可以运行了
运行searx：screen -dmS searx python searx/webapp.py
(在安装目录searx下,直接输入：python searx/webapp.py，可以临时运行，关闭ssh对话窗口，程序就会关闭）
到这里正常的话，就可以访问了，如果不能访问，那就是防火墙没设置好。
（我开始居然访问不了，看别人都是这样安装的，确认了前面没问题，想了下应该是防火墙）
查看防火墙状态：systemctl status firewalld
（dead，防火墙未开启，running，已开启了。）
开启防火墙：systemctl start firewalld
（没有任何提示即开启成功）

基本安装-基于 Debian/Ubuntu 和 virtualenv 的安装指南

安装依赖包:
sudo apt-get install git build-essential libxslt-dev python-dev python-virtualenv python-babel zlib1g-dev libffi-dev libssl-dev

安装 searx:
cd /usr/local
sudo git clone https://github.com/asciimoo/searx.git
sudo useradd searx -d /usr/local/searx
sudo chown searx:searx -R /usr/local/searx

在 virtualenv 中安装依赖:
sudo -u searx -i
cd /usr/local/searx
virtualenv searx-ve
. ./searx-ve/bin/activate
./manage.sh update_packages

配置

sed -i -e "s/ultrasecretkey/`openssl rand -hex 16`/g" searx/settings.yml

根据需要修改/usr/local/searx中 searx/settings.yml，在该文件中修改语言和访问IP地址及端口

检查

启动 searx:
python searx/webapp.py

浏览器访问 http://localhost:8888
如果一切工作正常，可以在 settings.yml 中禁用调试选项：

sed -i -e "s/debug : True/debug : False/g" searx/settings.yml

配合 uwsgi 使用

安装依赖包
sudo apt-get install uwsgi uwsgi-plugin-python

创建配置文件 /etc/uwsgi/apps-available/searx.ini，内容如下：

[uwsgi]
# Who will run the code
uid = searx
gid = searx
 
# disable logging for privacy
disable-logging = true
 
# Number of workers (usually CPU count)
workers = 4
 
# The right granted on the created socket
chmod-socket = 666
 
# Plugin to use and interpretor config
single-interpreter = true
master = true
plugin = python
lazy-apps = true
enable-threads = true
 
# Module to import
module = searx.webapp
 
# Virtualenv and python path
virtualenv = /usr/local/searx/searx-ve/
pythonpath = /usr/local/searx/
chdir = /usr/local/searx/searx/


激活 uwsgi 应用并重启
cd /etc/uwsgi/apps-enabled
ln -s ../apps-available/searx.ini
/etc/init.d/uwsgi restart

Web 服务器nginx

nginx
使用如下命令安装 Nginx

sudo apt-get install nginx

配置到 / 根路径

创建配置文件 /etc/nginx/sites-available/searx 内容如下：

server {
    listen 80;
    server_name searx.example.com;
    root /usr/local/searx;
 
    location / {
            include uwsgi_params;
            uwsgi_pass unix:/run/uwsgi/app/searx/socket;
    }
}


重启服务：
sudo service nginx restart
sudo service uwsgi restart

配置到指定路径 (/searx)

添加配置文件 /etc/nginx/sites-enabled/default 内容如下:

location = /searx { rewrite ^ /searx/; }
location /searx {
        try_files $uri @searx;
}
location @searx {
        uwsgi_param SCRIPT_NAME /searx;
        include uwsgi_params;
        uwsgi_modifier1 30;
        uwsgi_pass unix:/run/uwsgi/app/searx/socket;
}

或者使用反向代理（适合单用户使用或者低访问量的实例）
location /searx {
    proxy_pass http://127.0.0.1:8888;
    proxy_set_header Host $host;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Scheme $scheme;
    proxy_set_header X-Script-Name /searx;
    proxy_buffering off;
}

编辑 searx/settings.xml 中的 base_url
base_url : http://your.domain.tld/searx/


重启服务：
sudo service nginx restart
sudo service uwsgi restart

为了更好的保护隐私，可以禁用日志，在 /etc/nginx/sites-available/default 的 uwsgi_pass 下面增加如下内容：
access_log /dev/null;
error_log /dev/null;

重启服务
sudo service nginx restart

Web 服务器 apache

增加 wsgi mod:
sudo apt-get install libapache2-mod-uwsgi
sudo a2enmod uwsgi

增加配置内容到 /etc/apache2/apache2.conf:
<Location />
    Options FollowSymLinks Indexes
    SetHandler uwsgi-handler
    uWSGISocket /run/uwsgi/app/searx/socket
</Location>

N注意，如果你的 searx 实例不是部署在根路径，需要修改 <Location /> 配置信息，如 <Location /searx>.

重启 Apache:
sudo /etc/init.d/apache2 restart



禁用日志

回到配置文件 /etc/apache2/apache2.conf 在 <Location /> 指令上方增加：
CustomLog /dev/null combined

重启 Apache:
sudo /etc/init.d/apache2 restart

如何更新

cd /usr/local/searx
sudo -u searx -i
. ./searx-ve/bin/activate
git stash
git pull origin master
git stash apply
./manage.sh update_packages
sudo service uwsgi restart

Docker部署 searx：

确认你已装有 Docker ，然后使用如下命令来部署 searx：
docker pull wonderfall/searx
docker run -d --name searx -p $PORT:8888 wonderfall/searx

打开浏览器访问 http://localhost:$PORT.
更多的帮助请看 Docke Hub
你也可以通过 Dockerfile 来构建 searx
git clone https://github.com/asciimoo/searx.git
cd searx
docker build -t whatever/searx .

参考资料 https://asciimoo.github.io/searx/dev/install/installation.html#id13
Searx – About me  https://about.okhin.fr/posts/Searx/

演示 http://ma-so.com

参考来源 https://blog.csdn.net/weixin_45461896/article/details/125398990

搭建属于自己的搜索引擎，searx搜索引擎配置

如何自定义搜索引擎

源码分析

第一次配置Searx 详细操作如下

基本安装-基于 Debian/Ubuntu 和 virtualenv 的安装指南

配置

检查

配合 uwsgi 使用

Web 服务器nginx

Web 服务器 apache

如何更新

Docker部署 searx：

您可能还喜欢...

发表回复取消回复

近期文章

近期评论

归档

分类

搭建属于自己的搜索引擎，searx搜索引擎配置

如何自定义搜索引擎

源码分析

第一次配置Searx 详细操作如下

基本安装-基于 Debian/Ubuntu 和 virtualenv 的安装指南

配置

检查

配合 uwsgi 使用

Web 服务器nginx

Web 服务器 apache

如何更新

Docker部署 searx：

您可能还喜欢...

Python 组合排序：list列表 dict字典 tuple元组

学习NLP——基于HanLP实现的中文文本清洗

发表回复 取消回复

近期文章

近期评论

归档

分类

发表回复取消回复