You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

7.2 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

一、官网

RAG-Anything 官网

二、环境配置

# 创建虚拟环境
conda create -n raganything python=3.10

# 查看当前存在哪些虚拟环境
conda env list 

# 激活虚拟环境
conda activate raganything

三、依赖环境

  • 1、Libreoffice

https://zh-cn.libreoffice.org/

# 我下载的版本:
https://mirrors.nju.edu.cn/tdf/libreoffice/stable/25.2.4/win/x86_64/LibreOffice_25.2.4_Win_x86-64.msi

:因为后面要使用的MinerU能力是将PDF转为markdown,所以需要一个将Office转成PDF的功能

下载完成后,安装即可。

  • 2、RAGAnything
# 安装RagAnything
pip install raganything

# 安装包
pip install pycocotools timm
pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/

# 下载模型
mineru-models-download 

四、代码调试

  • 解决在国内网络无法下载huggingface的问题
# 要修改的文件
D:\anaconda3\envs\raganything\Lib\site-packages\huggingface_hub\constants.py

# 修改文件
HUGGINGFACE_CO_URL_HOME = "https://hf-mirror.com/"
_HF_DEFAULT_ENDPOINT = "https://hf-mirror.com"
  • 将soffice.exe添加到环境变量

    C:\Program Files\LibreOffice\program
    

  • 注意需要提前配置好环境变量后再进入PyCharm进行调试因为我发现如果是在打开PyCharm的前提下添加了环境就是PyCharm里面的代码是检测不到的。

  • 因为原版的程序有soffice.exe版本检测框弹出不能直接用于生产环境我只好手动修改了下代码

    D:\anaconda3\envs\raganything\Lib\site-packages\raganything\mineru_parser.py
    

    修改内容:

    # Check if LibreOffice is available
                #libreoffice_available = False
                working_libreoffice_cmd = 'soffice'
                # try:
                #     result = subprocess.run(
                #         ["libreoffice", "--version"],
                #         capture_output=True,
                #         check=True,
                #         timeout=10,
                #         encoding="utf-8",
                #         errors="ignore",
                #     )
                #     libreoffice_available = True
                #     working_libreoffice_cmd = "libreoffice"
                #     print(f"LibreOffice detected: {result.stdout.strip()}")
                # except (
                #     subprocess.CalledProcessError,
                #     FileNotFoundError,
                #     subprocess.TimeoutExpired,
                # ):
                #     pass
                #
                # # Try alternative commands for LibreOffice
                # if not libreoffice_available:
                #     for cmd in ["soffice", "libreoffice"]:
                #         try:
                #             result = subprocess.run(
                #                 [cmd, "--version"],
                #                 capture_output=True,
                #                 check=True,
                #                 timeout=10,
                #                 encoding="utf-8",
                #                 errors="ignore",
                #             )
                #             libreoffice_available = True
                #             working_libreoffice_cmd = cmd
                #             print(
                #                 f"LibreOffice detected with command '{cmd}': {result.stdout.strip()}"
                #             )
                #             break
                #         except (
                #             subprocess.CalledProcessError,
                #             FileNotFoundError,
                #             subprocess.TimeoutExpired,
                #         ):
                #             continue
                #
                # if not libreoffice_available:
                #     raise RuntimeError(
                #         "LibreOffice is required for Office document conversion but was not found.\n"
                #         "Please install LibreOffice:\n"
                #         "- Windows: Download from https://www.libreoffice.org/download/download/\n"
                #         "- macOS: brew install --cask libreoffice\n"
                #         "- Ubuntu/Debian: sudo apt-get install libreoffice\n"
                #         "- CentOS/RHEL: sudo yum install libreoffice\n"
                #         "Alternatively, convert the document to PDF manually.\n"
                #         "MinerU 2.0 no longer includes built-in Office document conversion."
                #     )
    
  • 首次运行时,代码会执行下面的类似命令

mineru -p C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\tmpt2sl2vd1\\驿来特平台安全.pdf -o output -m auto -b pipeline --source huggingface

下载需要等待,但程序本身不显示进度,我一直以为卡住了,后来跟踪代码,才知道它是在下载模型。

我后面修改了一下源码:

D:\anaconda3\envs\raganything\Lib\site-packages\raganything\mineru_parser.py
 # 107行
 try:
            result = subprocess.run(
                cmd,
                #capture_output=True, #注释掉这句,可以把输出打印出来
                text=True,
                check=True,
                encoding="utf-8",
                errors="ignore",
            )
            print("MinerU command executed successfully")

magic-pdf.json

{
    "bucket_info":{
        "bucket-name-1":["ak", "sk", "endpoint"],
        "bucket-name-2":["ak", "sk", "endpoint"]
    },
    "temp-output-dir":"/tmp",
    "models-dir":"C:/Users/Administrator/.cache/modelscope/hub/models/OpenDataLab/PDF-Extract-Kit-1___0/models",
    "device-mode":"cpu",
    "layout-config": {
        "model": "doclayout_yolo"
    },
    "formula-config": {
        "mfd_model": "yolo_v8_mfd",
        "mfr_model": "unimernet_small",
        "enable": true 
    },
    "table-config": {
        "model": "rapid_table",  
        "enable": false, 
        "max_time": 400
    }
}

五、相关资料

# 转换PDF到Markdown的工具
https://github.com/opendatalab/MinerU
# MinerU依赖的Magic-PDF
https://github.com/papayalove/Magic-PDF/blob/master/README_zh-CN.md
# MinerU依赖的PDF-Extract-Kit
https://github.com/opendatalab/PDF-Extract-Kit/blob/main/README_zh-CN.md

# mineru 官网
https://mineru.net/
# MinerU v2.0VLM模型捅破解析效果天花板
https://blog.csdn.net/qq1198768105/article/details/148678967
# MinerU、Magic-PDF、Magic-Doc
https://blog.csdn.net/lovechris00/article/details/140584728

# MinerU教程第二弹丨MinerU 本地部署保姆级“喂饭”教程
https://zhuanlan.zhihu.com/p/1908942870666282723