diff --git a/dsRagAnything/Doc/配置过程.md b/dsRagAnything/Doc/配置过程.md new file mode 100644 index 00000000..9dd9fdbb --- /dev/null +++ b/dsRagAnything/Doc/配置过程.md @@ -0,0 +1,233 @@ +#### 一、官网 + +[RAG-Anything 官网](https://github.com/HKUDS/RAG-Anything) + +#### 二、环境配置 + +```cmd +# 创建虚拟环境 +conda create -n raganything python=3.10 + +# 查看当前存在哪些虚拟环境 +conda env list + +# 激活虚拟环境 +conda activate raganything +``` + +#### 三、依赖环境 + +- 1、$Libreoffice$ + +https://zh-cn.libreoffice.org/ + +```sh +# 我下载的版本: +https://mirrors.nju.edu.cn/tdf/libreoffice/stable/25.2.4/win/x86_64/LibreOffice_25.2.4_Win_x86-64.msi +``` + +> **注**:因为后面要使用的$MinerU$能力是将$PDF$转为$markdown$,所以需要一个将$Office$转成$PDF$的功能 + +下载完成后,安装即可。 + + + +- 2、$RAGAnything$ + +```cmd +# 安装RagAnything +pip install raganything + +# 安装包 +pip install pycocotools timm +pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/ + +# 下载模型 +mineru-models-download +``` + +![](https://dsideal.obs.cn-north-1.myhuaweicloud.com/HuangHai/BlogImages/%7Byear%7D/%7Bmonth%7D/%7Bmd5%7D.%7BextName%7D/20250706152846354.png) + +#### 四、代码调试 + +- 解决在国内网络无法下载$huggingface$的问题 + +```sh +# 要修改的文件 +D:\anaconda3\envs\raganything\Lib\site-packages\huggingface_hub\constants.py + +# 修改文件 +HUGGINGFACE_CO_URL_HOME = "https://hf-mirror.com/" +_HF_DEFAULT_ENDPOINT = "https://hf-mirror.com" +``` + +- 将soffice.exe添加到环境变量 + + ``` + C:\Program Files\LibreOffice\program + ``` + + ![](https://dsideal.obs.cn-north-1.myhuaweicloud.com/HuangHai/BlogImages/%7Byear%7D/%7Bmonth%7D/%7Bmd5%7D.%7BextName%7D/20250706155805724.png) + + + +- **注意**:需要提前配置好环境变量后,再进入PyCharm进行调试,因为我发现,如果是在打开PyCharm的前提下,添加了环境就是PyCharm里面的代码是检测不到的。 + + + +- 因为原版的程序有soffice.exe版本检测框弹出,不能直接用于生产环境,我只好手动修改了下代码: + + ```cmd + D:\anaconda3\envs\raganything\Lib\site-packages\raganything\mineru_parser.py + ``` + + 修改内容: + + ```python + # Check if LibreOffice is available + #libreoffice_available = False + working_libreoffice_cmd = 'soffice' + # try: + # result = subprocess.run( + # ["libreoffice", "--version"], + # capture_output=True, + # check=True, + # timeout=10, + # encoding="utf-8", + # errors="ignore", + # ) + # libreoffice_available = True + # working_libreoffice_cmd = "libreoffice" + # print(f"LibreOffice detected: {result.stdout.strip()}") + # except ( + # subprocess.CalledProcessError, + # FileNotFoundError, + # subprocess.TimeoutExpired, + # ): + # pass + # + # # Try alternative commands for LibreOffice + # if not libreoffice_available: + # for cmd in ["soffice", "libreoffice"]: + # try: + # result = subprocess.run( + # [cmd, "--version"], + # capture_output=True, + # check=True, + # timeout=10, + # encoding="utf-8", + # errors="ignore", + # ) + # libreoffice_available = True + # working_libreoffice_cmd = cmd + # print( + # f"LibreOffice detected with command '{cmd}': {result.stdout.strip()}" + # ) + # break + # except ( + # subprocess.CalledProcessError, + # FileNotFoundError, + # subprocess.TimeoutExpired, + # ): + # continue + # + # if not libreoffice_available: + # raise RuntimeError( + # "LibreOffice is required for Office document conversion but was not found.\n" + # "Please install LibreOffice:\n" + # "- Windows: Download from https://www.libreoffice.org/download/download/\n" + # "- macOS: brew install --cask libreoffice\n" + # "- Ubuntu/Debian: sudo apt-get install libreoffice\n" + # "- CentOS/RHEL: sudo yum install libreoffice\n" + # "Alternatively, convert the document to PDF manually.\n" + # "MinerU 2.0 no longer includes built-in Office document conversion." + # ) + ``` + +- 首次运行时,代码会执行下面的类似命令 + +```cmd +mineru -p C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\tmpt2sl2vd1\\驿来特平台安全.pdf -o output -m auto -b pipeline --source huggingface +``` + +![](https://dsideal.obs.cn-north-1.myhuaweicloud.com/HuangHai/BlogImages/%7Byear%7D/%7Bmonth%7D/%7Bmd5%7D.%7BextName%7D/20250706161400967.png) + +下载需要等待,但程序本身不显示进度,我一直以为卡住了,后来跟踪代码,才知道它是在下载模型。 + + + +我后面修改了一下源码: + +```cmd +D:\anaconda3\envs\raganything\Lib\site-packages\raganything\mineru_parser.py +``` + +```python + # 107行 + try: + result = subprocess.run( + cmd, + #capture_output=True, #注释掉这句,可以把输出打印出来 + text=True, + check=True, + encoding="utf-8", + errors="ignore", + ) + print("MinerU command executed successfully") +``` + + + + + + + +magic-pdf.json + +```json +{ + "bucket_info":{ + "bucket-name-1":["ak", "sk", "endpoint"], + "bucket-name-2":["ak", "sk", "endpoint"] + }, + "temp-output-dir":"/tmp", + "models-dir":"C:/Users/Administrator/.cache/modelscope/hub/models/OpenDataLab/PDF-Extract-Kit-1___0/models", + "device-mode":"cpu", + "layout-config": { + "model": "doclayout_yolo" + }, + "formula-config": { + "mfd_model": "yolo_v8_mfd", + "mfr_model": "unimernet_small", + "enable": true + }, + "table-config": { + "model": "rapid_table", + "enable": false, + "max_time": 400 + } +} +``` + +#### 五、相关资料 + +```sh +# 转换PDF到Markdown的工具 +https://github.com/opendatalab/MinerU +# MinerU依赖的Magic-PDF +https://github.com/papayalove/Magic-PDF/blob/master/README_zh-CN.md +# MinerU依赖的PDF-Extract-Kit +https://github.com/opendatalab/PDF-Extract-Kit/blob/main/README_zh-CN.md + +# mineru 官网 +https://mineru.net/ +# MinerU v2.0:VLM模型捅破解析效果天花板! +https://blog.csdn.net/qq1198768105/article/details/148678967 +# MinerU、Magic-PDF、Magic-Doc +https://blog.csdn.net/lovechris00/article/details/140584728 + +# MinerU教程第二弹丨MinerU 本地部署保姆级“喂饭”教程 +https://zhuanlan.zhihu.com/p/1908942870666282723 +``` + +