main
HuangHai 2 weeks ago
parent 49e1fe9f3b
commit a4a07b90b5

@ -2,6 +2,8 @@
[RAG-Anything 官网](https://github.com/HKUDS/RAG-Anything)
#### 二、环境配置
```cmd
@ -147,68 +149,51 @@ _HF_DEFAULT_ENDPOINT = "https://hf-mirror.com"
- 首次运行时,代码会执行下面的类似命令
```cmd
mineru -p C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\tmpt2sl2vd1\\驿来特平台安全.pdf -o output -m auto -b pipeline --source huggingface
mineru -p C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\tmpt2sl2vd1\\驿来特平台安全.pdf -o output -m auto -b pipeline --source modelscope
```
![](https://dsideal.obs.cn-north-1.myhuaweicloud.com/HuangHai/BlogImages/%7Byear%7D/%7Bmonth%7D/%7Bmd5%7D.%7BextName%7D/20250706161400967.png)
下载需要等待,但程序本身不显示进度,我一直以为卡住了,后来跟踪代码,才知道它是在下载模型。
我后面修改了一下源码:
- 修改源码:
```cmd
D:\anaconda3\envs\raganything\Lib\site-packages\raganything\mineru_parser.py
```
```python
# 107行
try:
result = subprocess.run(
cmd,
#capture_output=True, #注释掉这句,可以把输出打印出来
text=True,
check=True,
encoding="utf-8",
errors="ignore",
)
print("MinerU command executed successfully")
# 62行
@staticmethod
def _run_mineru_command(
input_path: Union[str, Path],
output_dir: Union[str, Path],
method: str = "auto",
lang: Optional[str] = None,
backend: str = "pipeline",
start_page: Optional[int] = None,
end_page: Optional[int] = None,
formula: bool = True,
table: bool = True,
device: Optional[str] = None,
source: str = "modelscope", # 'huggingface' --> 'modelscope'
) -> None:
# 107行
try:
result = subprocess.run(
cmd,
#capture_output=True, #注释掉这句,可以把输出打印出来
text=True,
check=True,
encoding="utf-8",
errors="ignore",
)
print("MinerU command executed successfully")
```
magic-pdf.json
```json
{
"bucket_info":{
"bucket-name-1":["ak", "sk", "endpoint"],
"bucket-name-2":["ak", "sk", "endpoint"]
},
"temp-output-dir":"/tmp",
"models-dir":"C:/Users/Administrator/.cache/modelscope/hub/models/OpenDataLab/PDF-Extract-Kit-1___0/models",
"device-mode":"cpu",
"layout-config": {
"model": "doclayout_yolo"
},
"formula-config": {
"mfd_model": "yolo_v8_mfd",
"mfr_model": "unimernet_small",
"enable": true
},
"table-config": {
"model": "rapid_table",
"enable": false,
"max_time": 400
}
}
```
#### 五、相关资料
```sh
@ -230,4 +215,3 @@ https://blog.csdn.net/lovechris00/article/details/140584728
https://zhuanlan.zhihu.com/p/1908942870666282723
```

@ -1,51 +0,0 @@
# 官网
https://github.com/HKUDS/RAG-Anything
# 创建虚拟环境
conda create -n raganything python=3.10
# 激活虚拟环境
conda activate raganything
# 下一步需要测试的库
https://github.com/HKUDS/VideoRAG
# 添加到PATH
C:\Program Files\LibreOffice\program
# Office document parsing test (MinerU only)
python examples/office_document_test.py --file path/to/document.docx
# Check LibreOffice installation
python examples/office_document_test.py --check-libreoffice --file dummy
# End-to-end processing
python examples/raganything_example.py path/to/document.pdf --api-key YOUR_API_KEY
# Direct modal processing
python examples/modalprocessors_example.py --api-key YOUR_API_KEY
# Image format parsing test (MinerU only)
python examples/image_format_test.py --file path/to/image.bmp
# Text format parsing test (MinerU only)
python examples/text_format_test.py --file path/to/document.md
# Check PIL/Pillow installation
python examples/image_format_test.py --check-pillow --file dummy
# Check ReportLab installation
python examples/text_format_test.py --check-reportlab --file dummy
# MinerU
https://github.com/opendatalab/MinerU/blob/master/README_zh-CN.md
# 硅基流动的视觉模型
https://cloud.siliconflow.cn/sft-b86b3myzge/models?tags=%E8%A7%86%E8%A7%89
# 免费的模型
# 调用地址
https://api.siliconflow.cn/v1/chat/completions
model:GLM-4.1V-9B-Thinking

@ -1,131 +0,0 @@
# 创建虚拟环境
conda create -n py310 python=3.10
# 查看当前存在哪些虚拟环境
conda env list
conda info -e
# 激活虚拟环境
conda activate py310
# RAG-Anything 官网
https://github.com/HKUDS/RAG-Anything
# libreoffice 官网
https://zh-cn.libreoffice.org/
# Github
https://github.com/opendatalab/MinerU
https://github.com/papayalove/Magic-PDF/blob/master/README_zh-CN.md
https://github.com/opendatalab/PDF-Extract-Kit/blob/main/README_zh-CN.md
# mineru 官网
https://mineru.net/
# MinerU v2.0VLM模型捅破解析效果天花板
https://blog.csdn.net/qq1198768105/article/details/148678967
# MinerU、Magic-PDF、Magic-Doc
https://blog.csdn.net/lovechris00/article/details/140584728
# 安装
pip install raganything
Office documents (.doc, .docx, .ppt, .pptx, .xls, .xlsx) require LibreOffice installation
Download from LibreOffice official website
Windows: Download installer from official website
macOS: brew install --cask libreoffice
Ubuntu/Debian: sudo apt-get install libreoffice
CentOS/RHEL: sudo yum install libreoffice
# MinerU教程第二弹丨MinerU 本地部署保姆级“喂饭”教程
https://zhuanlan.zhihu.com/p/1908942870666282723
pip install modelscope
curl -o download_models.py https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/scripts/download_models.py
python download_models.py
# MinerU本地化部署教程——一款AI知识库建站的必备工具
https://blog.csdn.net/mzl87/article/details/147904238
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://mirrors.aliyun.com/pypi/simple
magic-pdf --version
pip install modelscope
cd C:\Users\Administrator\PycharmProjects\PythonProject\MinerU
python
magic-pdf -p D:\python\小乔证件\黄琬乔2023蓝桥杯省赛准考证.pdf -o ./output
(py310) PS C:\Users\Administrator> magic-pdf -p D:\python\小乔证件\黄琬乔2023蓝桥杯省赛准考证.pdf -o ./output
2025-07-05 19:09:59.132 | ERROR | magic_pdf.tools.cli:parse_doc:134 - C:\Users\Administrator\magic-pdf.json not found
Traceback (most recent call last):
File "D:\anaconda3\envs\py310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
# 新版本的模型下载命令
mineru-models-download
pip install pycocotools timm
pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/
# https://github.com/opendatalab/MinerU/blob/master/README_zh-CN.md
https://github.com/opendatalab/MinerU/issues/2357
https://gitee.com/bibi100/MinerU/blob/master/README_zh-CN.md#Magic-PDF
# OSError: We couldnt connect to https://huggingface.co to load this file
https://blog.csdn.net/qq_38683460/article/details/145661150
D:\anaconda3\envs\py310\Lib\site-packages\huggingface_hub\constants.py
修改文件
HUGGINGFACE_CO_URL_HOME = "https://hf-mirror.com/"
_HF_DEFAULT_ENDPOINT = "https://hf-mirror.com"
mineru -p D:\\python\\小乔证件\\黄琬乔2023蓝桥杯省赛准考证.pdf -o output -m auto -b pipeline --source modelscope
(base) PS C:\Users\Administrator> conda activate py310
(py310) PS C:\Users\Administrator> mineru -p D:\\python\\小乔证件\\黄琬乔2023蓝桥杯省赛准考证.pdf -o output -m auto -b pipeline --source modelscope
2025-07-05 20:51:56.963 | WARNING | mineru.backend.vlm.predictor:<module>:35 - sglang is not installed. If you are not using sglang, you can ignore this warning.
2025-07-05 20:52:02.387 | INFO | mineru.backend.pipeline.pipeline_analyze:doc_analyze:124 - Batch 1/1: 2 pages/2 pages
2025-07-05 20:52:02.388 | INFO | mineru.backend.pipeline.model_init:__init__:137 - DocAnalysis init, this may take some times......
Downloading Model from https://www.modelscope.cn to directory: C:\Users\Administrator\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0
2025-07-05 20:52:06,696 - modelscope - INFO - Target directory already exists, skipping creation.
Downloading Model from https://www.modelscope.cn to directory: C:\Users\Administrator\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0
2025-07-05 20:52:10,003 - modelscope - INFO - Target directory already exists, skipping creation.
Downloading Model from https://www.modelscope.cn to directory: C:\Users\Administrator\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0
2025-07-05 20:52:14,257 - modelscope - INFO - Target directory already exists, skipping creation.
Downloading Model from https://www.modelscope.cn to directory: C:\Users\Administrator\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0
2025-07-05 20:52:17,799 - modelscope - INFO - Target directory already exists, skipping creation.
Downloading Model from https://www.modelscope.cn to directory: C:\Users\Administrator\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0
2025-07-05 20:52:20,709 - modelscope - INFO - Target directory already exists, skipping creation.
Downloading Model from https://www.modelscope.cn to directory: C:\Users\Administrator\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0
2025-07-05 20:52:24,269 - modelscope - INFO - Target directory already exists, skipping creation.
2025-07-05 20:52:24.405 | INFO | mineru.backend.pipeline.model_init:__init__:182 - DocAnalysis init done!
2025-07-05 20:52:24.405 | INFO | mineru.backend.pipeline.pipeline_analyze:custom_model_init:64 - model init cost: 22.017439603805542
Layout Predict: 100%|████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.05s/it]
MFD Predict: 100%|███████████████████████████████████████████████████████████████████████| 2/2 [00:05<00:00, 2.63s/it]
MFR Predict: 100%|███████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00, 1.50it/s]
Downloading Model from https://www.modelscope.cn to directory: C:\Users\Administrator\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0
2025-07-05 20:52:37,273 - modelscope - INFO - Target directory already exists, skipping creation.
Downloading Model from https://www.modelscope.cn to directory: C:\Users\Administrator\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0
2025-07-05 20:52:40,451 - modelscope - INFO - Target directory already exists, skipping creation.
OCR-det ch: 100%|██████████████████████████████████████████████████████████████████████| 16/16 [00:02<00:00, 5.88it/s]
Table Predict: 0%| | 0/1 [00:00<?, ?it/s]Downloading Model from https://www.modelscope.cn to directory: C:\Users\Administrator\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0
2025-07-05 20:52:46,921 - modelscope - INFO - Target directory already exists, skipping creation.
Table Predict: 100%|█████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00, 4.51s/it]
Processing pages: 0%| | 0/2 [00:00<?, ?it/s]Downloading Model from https://www.modelscope.cn to directory: C:\Users\Administrator\.cache\modelscope\hub\models\OpenDataLab\PDF-Extract-Kit-1.0
2025-07-05 20:52:51,389 - modelscope - INFO - Target directory already exists, skipping creation.
Processing pages: 100%|██████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00, 2.06s/it]
OCR-rec Predict: 100%|███████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 22.94it/s]
2025-07-05 20:52:52.566 | INFO | mineru.cli.common:_process_output:156 - local output dir is output\黄琬乔2023蓝桥杯省赛准考证\auto
(py310) PS C:\Users\Administrator>
Loading…
Cancel
Save