diff --git a/dsRagAnything/Doc/配置过程.md b/dsRagAnything/Doc/RagAnything配置.md similarity index 82% rename from dsRagAnything/Doc/配置过程.md rename to dsRagAnything/Doc/RagAnything配置.md index 9dd9fdbb..bb34d4f7 100644 --- a/dsRagAnything/Doc/配置过程.md +++ b/dsRagAnything/Doc/RagAnything配置.md @@ -2,6 +2,8 @@ [RAG-Anything 官网](https://github.com/HKUDS/RAG-Anything) + + #### 二、环境配置 ```cmd @@ -147,68 +149,51 @@ _HF_DEFAULT_ENDPOINT = "https://hf-mirror.com" - 首次运行时,代码会执行下面的类似命令 ```cmd -mineru -p C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\tmpt2sl2vd1\\驿来特平台安全.pdf -o output -m auto -b pipeline --source huggingface +mineru -p C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\tmpt2sl2vd1\\驿来特平台安全.pdf -o output -m auto -b pipeline --source modelscope ``` ![](https://dsideal.obs.cn-north-1.myhuaweicloud.com/HuangHai/BlogImages/%7Byear%7D/%7Bmonth%7D/%7Bmd5%7D.%7BextName%7D/20250706161400967.png) 下载需要等待,但程序本身不显示进度,我一直以为卡住了,后来跟踪代码,才知道它是在下载模型。 - - -我后面修改了一下源码: +- 修改源码: ```cmd D:\anaconda3\envs\raganything\Lib\site-packages\raganything\mineru_parser.py ``` ```python - # 107行 - try: - result = subprocess.run( - cmd, - #capture_output=True, #注释掉这句,可以把输出打印出来 - text=True, - check=True, - encoding="utf-8", - errors="ignore", - ) - print("MinerU command executed successfully") +# 62行 +@staticmethod +def _run_mineru_command( + input_path: Union[str, Path], + output_dir: Union[str, Path], + method: str = "auto", + lang: Optional[str] = None, + backend: str = "pipeline", + start_page: Optional[int] = None, + end_page: Optional[int] = None, + formula: bool = True, + table: bool = True, + device: Optional[str] = None, + source: str = "modelscope", # 'huggingface' --> 'modelscope' +) -> None: + +# 107行 +try: + result = subprocess.run( + cmd, + #capture_output=True, #注释掉这句,可以把输出打印出来 + text=True, + check=True, + encoding="utf-8", + errors="ignore", + ) + print("MinerU command executed successfully") ``` - - - - -magic-pdf.json - -```json -{ - "bucket_info":{ - "bucket-name-1":["ak", "sk", "endpoint"], - "bucket-name-2":["ak", "sk", "endpoint"] - }, - "temp-output-dir":"/tmp", - "models-dir":"C:/Users/Administrator/.cache/modelscope/hub/models/OpenDataLab/PDF-Extract-Kit-1___0/models", - "device-mode":"cpu", - "layout-config": { - "model": "doclayout_yolo" - }, - "formula-config": { - "mfd_model": "yolo_v8_mfd", - "mfr_model": "unimernet_small", - "enable": true - }, - "table-config": { - "model": "rapid_table", - "enable": false, - "max_time": 400 - } -} -``` - #### 五、相关资料 ```sh @@ -230,4 +215,3 @@ https://blog.csdn.net/lovechris00/article/details/140584728 https://zhuanlan.zhihu.com/p/1908942870666282723 ``` - diff --git a/dsRagAnything/Doc/文档.txt b/dsRagAnything/Doc/文档.txt deleted file mode 100644 index 6469478b..00000000 --- a/dsRagAnything/Doc/文档.txt +++ /dev/null @@ -1,51 +0,0 @@ -# 官网 -https://github.com/HKUDS/RAG-Anything - -# 创建虚拟环境 -conda create -n raganything python=3.10 - -# 激活虚拟环境 -conda activate raganything - -# 下一步需要测试的库 -https://github.com/HKUDS/VideoRAG - -# 添加到PATH -C:\Program Files\LibreOffice\program - -# Office document parsing test (MinerU only) -python examples/office_document_test.py --file path/to/document.docx - -# Check LibreOffice installation -python examples/office_document_test.py --check-libreoffice --file dummy - - -# End-to-end processing -python examples/raganything_example.py path/to/document.pdf --api-key YOUR_API_KEY - -# Direct modal processing -python examples/modalprocessors_example.py --api-key YOUR_API_KEY - -# Image format parsing test (MinerU only) -python examples/image_format_test.py --file path/to/image.bmp - -# Text format parsing test (MinerU only) -python examples/text_format_test.py --file path/to/document.md - - -# Check PIL/Pillow installation -python examples/image_format_test.py --check-pillow --file dummy - -# Check ReportLab installation -python examples/text_format_test.py --check-reportlab --file dummy - -# MinerU -https://github.com/opendatalab/MinerU/blob/master/README_zh-CN.md - -# 硅基流动的视觉模型 -https://cloud.siliconflow.cn/sft-b86b3myzge/models?tags=%E8%A7%86%E8%A7%89 -# 免费的模型 - -# 调用地址 -https://api.siliconflow.cn/v1/chat/completions -model:GLM-4.1V-9B-Thinking \ No newline at end of file diff --git a/dsRagAnything/Doc/文档解析.txt b/dsRagAnything/Doc/文档解析.txt deleted file mode 100644 index 102355c2..00000000 --- a/dsRagAnything/Doc/文档解析.txt +++ /dev/null @@ -1,131 +0,0 @@ -# 创建虚拟环境 -conda create -n py310 python=3.10 - -# 查看当前存在哪些虚拟环境 -conda env list -conda info -e - -# 激活虚拟环境 -conda activate py310 - -# RAG-Anything 官网 -https://github.com/HKUDS/RAG-Anything - -# libreoffice 官网 -https://zh-cn.libreoffice.org/ - -# Github -https://github.com/opendatalab/MinerU -https://github.com/papayalove/Magic-PDF/blob/master/README_zh-CN.md -https://github.com/opendatalab/PDF-Extract-Kit/blob/main/README_zh-CN.md - -# mineru 官网 -https://mineru.net/ - -# MinerU v2.0:VLM模型捅破解析效果天花板! -https://blog.csdn.net/qq1198768105/article/details/148678967 - -# MinerU、Magic-PDF、Magic-Doc -https://blog.csdn.net/lovechris00/article/details/140584728 - - -# 安装 -pip install raganything - - -Office documents (.doc, .docx, .ppt, .pptx, .xls, .xlsx) require LibreOffice installation -Download from LibreOffice official website -Windows: Download installer from official website -macOS: brew install --cask libreoffice -Ubuntu/Debian: sudo apt-get install libreoffice -CentOS/RHEL: sudo yum install libreoffice - -# MinerU教程第二弹丨MinerU 本地部署保姆级“喂饭”教程 -https://zhuanlan.zhihu.com/p/1908942870666282723 - -pip install modelscope -curl -o download_models.py https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/scripts/download_models.py -python download_models.py - -# MinerU本地化部署教程——一款AI知识库建站的必备工具 -https://blog.csdn.net/mzl87/article/details/147904238 - -pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://mirrors.aliyun.com/pypi/simple - -magic-pdf --version - -pip install modelscope - -cd C:\Users\Administrator\PycharmProjects\PythonProject\MinerU -python - -magic-pdf -p D:\python\小乔证件\黄琬乔2023蓝桥杯省赛准考证.pdf -o ./output - - -(py310) PS C:\Users\Administrator> magic-pdf -p D:\python\小乔证件\黄琬乔2023蓝桥杯省赛准考证.pdf -o ./output -2025-07-05 19:09:59.132 | ERROR | magic_pdf.tools.cli:parse_doc:134 - C:\Users\Administrator\magic-pdf.json not found -Traceback (most recent call last): - - File "D:\anaconda3\envs\py310\lib\runpy.py", line 196, in _run_module_as_main - return _run_code(code, main_globals, None, - - # 新版本的模型下载命令 - mineru-models-download - - pip install pycocotools timm - pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/ - - # https://github.com/opendatalab/MinerU/blob/master/README_zh-CN.md - https://github.com/opendatalab/MinerU/issues/2357 - - https://gitee.com/bibi100/MinerU/blob/master/README_zh-CN.md#Magic-PDF - - # OSError: We couldn‘t connect to ‘https://huggingface.co‘ to load this file - https://blog.csdn.net/qq_38683460/article/details/145661150 - - D:\anaconda3\envs\py310\Lib\site-packages\huggingface_hub\constants.py - -修改文件 -HUGGINGFACE_CO_URL_HOME = "https://hf-mirror.com/" -_HF_DEFAULT_ENDPOINT = "https://hf-mirror.com" - - -mineru -p D:\\python\\小乔证件\\黄琬乔2023蓝桥杯省赛准考证.pdf -o output -m auto -b pipeline --source modelscope - - -(base) PS C:\Users\Administrator> conda activate py310 -(py310) PS C:\Users\Administrator> mineru -p D:\\python\\小乔证件\\黄琬乔2023蓝桥杯省赛准考证.pdf -o output -m auto -b pipeline --source modelscope -2025-07-05 20:51:56.963 | WARNING | mineru.backend.vlm.predictor::35 - sglang is not installed. If you are not using sglang, you can ignore this warning. -2025-07-05 20:52:02.387 | INFO | mineru.backend.pipeline.pipeline_analyze:doc_analyze:124 - Batch 1/1: 2 pages/2 pages -2025-07-05 20:52:02.388 | INFO | mineru.backend.pipeline.model_init:__init__:137 - DocAnalysis init, this may take some times...... -Downloading Model from https://www.modelscope.cn to directory: C:\Users\Administrator\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0 -2025-07-05 20:52:06,696 - modelscope - INFO - Target directory already exists, skipping creation. -Downloading Model from https://www.modelscope.cn to directory: C:\Users\Administrator\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0 -2025-07-05 20:52:10,003 - modelscope - INFO - Target directory already exists, skipping creation. -Downloading Model from https://www.modelscope.cn to directory: C:\Users\Administrator\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0 -2025-07-05 20:52:14,257 - modelscope - INFO - Target directory already exists, skipping creation. -Downloading Model from https://www.modelscope.cn to directory: C:\Users\Administrator\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0 -2025-07-05 20:52:17,799 - modelscope - INFO - Target directory already exists, skipping creation. -Downloading Model from https://www.modelscope.cn to directory: C:\Users\Administrator\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0 -2025-07-05 20:52:20,709 - modelscope - INFO - Target directory already exists, skipping creation. -Downloading Model from https://www.modelscope.cn to directory: C:\Users\Administrator\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0 -2025-07-05 20:52:24,269 - modelscope - INFO - Target directory already exists, skipping creation. -2025-07-05 20:52:24.405 | INFO | mineru.backend.pipeline.model_init:__init__:182 - DocAnalysis init done! -2025-07-05 20:52:24.405 | INFO | mineru.backend.pipeline.pipeline_analyze:custom_model_init:64 - model init cost: 22.017439603805542 -Layout Predict: 100%|████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.05s/it] -MFD Predict: 100%|███████████████████████████████████████████████████████████████████████| 2/2 [00:05<00:00, 2.63s/it] -MFR Predict: 100%|███████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00, 1.50it/s] -Downloading Model from https://www.modelscope.cn to directory: C:\Users\Administrator\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0 -2025-07-05 20:52:37,273 - modelscope - INFO - Target directory already exists, skipping creation. -Downloading Model from https://www.modelscope.cn to directory: C:\Users\Administrator\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0 -2025-07-05 20:52:40,451 - modelscope - INFO - Target directory already exists, skipping creation. -OCR-det ch: 100%|██████████████████████████████████████████████████████████████████████| 16/16 [00:02<00:00, 5.88it/s] -Table Predict: 0%| | 0/1 [00:00