VLLM-开源的大模型加速推理引擎
大约 7 分钟
VLLM-开源的大模型加速推理引擎
部署
GPU部署
由于官方默认提供GPU部署的包,所以直接下载即可
pip install vLLM
CPU部署
需要通过源码打包
## 下载源码
git clone https://github.com/vllm-project/vllm.git
## 创建 conda虚拟环境
conda create --name vLLM python=3.10 -y
conda activate vLLM
## 安装源代码GCC 编译器:
sudo apt-get update -y
sudo apt-get install -y gcc-12 g++-12 libnuma-dev
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
## 安装打包依赖
pip install --upgrade pip
pip install wheel packaging ninja "setuptools>=49.4.0" numpy
pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
## 打包安装
VLLM_TARGET_DEVICE=cpu python setup.py install
运行
运行 Qwen2-0.5
python -m vllm.entrypoints.openai.api_server --model /data/llm/Qwen2-0.5B --port 8000 --host 0.0.0.0
测试
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "/data/llm/Qwen2-0.5B",
"input": "Your text string goes here",
"messages": [
{"role": "system", "content": "你是一个智能助手."},
{"role": "user", "content": "天空为什么是蓝色的?"}
],
"stream": true,
"temperature": 0.7,
"top_p": 0.8,
"repetition_penalty": 1.05,
"max_tokens": 512
}'
参考文章
https://docs.vllm.ai/en/latest/getting_started/cpu-installation.html https://blog.csdn.net/obullxl/article/details/141105850
https://github.com/vllm-project/vllm/issues/7722 https://docs.vllm.ai/en/latest/getting_started/openvino-installation.html
报错
安装 cmake
执行VLLM_TARGET_DEVICE=cpu python setup.py install时报错
running build_ext
Traceback (most recent call last):
File "/data/vllm/setup.py", line 215, in build_extensions
subprocess.check_output(['cmake', '--version'])
安装cmake
apt update
apt install cmake
还报错
running build_ext
CMake Error at CMakeLists.txt:1 (cmake_minimum_required):
CMake 3.26 or higher is required. You are running version 3.25.1
直接下载安装
wget https://cmake.org/files/v3.26/cmake-3.26.0-linux-x86_64.sh
bash cmake-3.26.0-linux-x86_64.sh --prefix=/usr/local --skip-license
CMake Error at CMakeLists.txt:11 (include):
include could not find requested file:
/data/vllm/cmake/utils.cmake
CMake Error at CMakeLists.txt:46 (find_python_from_executable):
Unknown CMake command "find_python_from_executable".
请求报错
lse, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=131048, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [151644, 8948, 198, 56568, 101909, 100168, 110498, 13, 151645, 198, 151644, 872, 198, 101916, 100678, 20412, 105681, 9370, 11319, 151645, 198, 151644, 77091, 198], lora_request: None, prompt_adapter_request: None.
INFO 09-04 16:16:07 async_llm_engine.py:206] Added request chat-f73a117b4d314441a07b47449ce1b976.
INFO 09-04 16:16:07 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 _custom_ops.py:37] Error in calling custom op rms_norm: '_OpNamespace' '_C' object has no attribute 'rms_norm'
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 _custom_ops.py:37] Possibly you have built or installed an obsolete version of vllm.
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 _custom_ops.py:37] Please try a clean build and install of vllm,or remove old built files such as vllm/*cpython*.so and build/ .
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method execute_model: '_OpNamespace' '_C' object has no attribute 'rms_norm', Traceback (most recent call last):
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] File "/data/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] File "/data/vllm/vllm/worker/worker_base.py", line 327, in execute_model
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] output = self.model_runner.execute_model(
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] File "/data/vllm/vllm/worker/cpu_model_runner.py", line 373, in execute_model
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] hidden_states = model_executable(**execute_model_kwargs)
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] File "/data/vllm/vllm/model_executor/models/qwen2.py", line 361, in forward
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] File "/data/vllm/vllm/model_executor/models/qwen2.py", line 277, in forward
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] hidden_states, residual = layer(
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] File "/data/vllm/vllm/model_executor/models/qwen2.py", line 206, in forward
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] hidden_states = self.input_layernorm(hidden_states)
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] File "/data/vllm/vllm/model_executor/custom_op.py", line 14, in forward
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] return self._forward_method(*args, **kwargs)
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] File "/data/vllm/vllm/model_executor/custom_op.py", line 39, in forward_cpu
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] return self.forward_cuda(*args, **kwargs)
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] File "/data/vllm/vllm/model_executor/layers/layernorm.py", line 62, in forward_cuda
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] ops.rms_norm(
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] File "/data/vllm/vllm/_custom_ops.py", line 38, in wrapper
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] raise e
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] File "/data/vllm/vllm/_custom_ops.py", line 29, in wrapper
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] return fn(*args, **kwargs)
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] File "/data/vllm/vllm/_custom_ops.py", line 156, in rms_norm
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] torch.ops._C.rms_norm(out, input, weight, epsilon)
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/torch/_ops.py", line 1170, in __getattr__
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] raise AttributeError(
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226] AttributeError: '_OpNamespace' '_C' object has no attribute 'rms_norm'
(VllmWorkerProcess pid=304274) ERROR 09-04 16:16:07 multiproc_worker_utils.py:226]
ERROR 09-04 16:16:07 async_llm_engine.py:63] Engine background task failed
ERROR 09-04 16:16:07 async_llm_engine.py:63] Traceback (most recent call last):
ERROR 09-04 16:16:07 async_llm_engine.py:63] File "/data/vllm/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
ERROR 09-04 16:16:07 async_llm_engine.py:63] return_value = task.result()
ERROR 09-04 16:16:07 async_llm_engine.py:63] File "/data/vllm/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
ERROR 09-04 16:16:07 async_llm_engine.py:63] result = task.result()
ERROR 09-04 16:16:07 async_llm_engine.py:63] File "/data/vllm/vllm/engine/async_llm_engine.py", line 868, in engine_step
ERROR 09-04 16:16:07 async_llm_engine.py:63] request_outputs = await self.engine.step_async(virtual_engine)
ERROR 09-04 16:16:07 async_llm_engine.py:63] File "/data/vllm/vllm/engine/async_llm_engine.py", line 345, in step_async
ERROR 09-04 16:16:07 async_llm_engine.py:63] output = await self.model_executor.execute_model_async(
ERROR 09-04 16:16:07 async_llm_engine.py:63] File "/data/vllm/vllm/executor/cpu_executor.py", line 305, in execute_model_async
ERROR 09-04 16:16:07 async_llm_engine.py:63] output = await make_async(self.execute_model
ERROR 09-04 16:16:07 async_llm_engine.py:63] File "/root/miniconda3/envs/vLLM/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 09-04 16:16:07 async_llm_engine.py:63] result = self.fn(*self.args, **self.kwargs)
ERROR 09-04 16:16:07 async_llm_engine.py:63] File "/data/vllm/vllm/executor/cpu_executor.py", line 223, in execute_model
ERROR 09-04 16:16:07 async_llm_engine.py:63] output = self.driver_method_invoker(self.driver_worker,
ERROR 09-04 16:16:07 async_llm_engine.py:63] File "/data/vllm/vllm/executor/cpu_executor.py", line 361, in _async_driver_method_invoker
ERROR 09-04 16:16:07 async_llm_engine.py:63] return driver.execute_method(method, *args, **kwargs).get()
ERROR 09-04 16:16:07 async_llm_engine.py:63] File "/data/vllm/vllm/executor/multiproc_worker_utils.py", line 58, in get
ERROR 09-04 16:16:07 async_llm_engine.py:63] raise self.result.exception
ERROR 09-04 16:16:07 async_llm_engine.py:63] AttributeError: '_OpNamespace' '_C' object has no attribute 'rms_norm'
Exception in callback functools.partial(<function _log_task_completion at 0x7efbfca83ac0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7efbfa9ca590>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7efbfca83ac0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7efbfa9ca590>>)>
Traceback (most recent call last):
File "/data/vllm/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
return_value = task.result()
File "/data/vllm/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
result = task.result()
File "/data/vllm/vllm/engine/async_llm_engine.py", line 868, in engine_step
request_outputs = await self.engine.step_async(virtual_engine)
File "/data/vllm/vllm/engine/async_llm_engine.py", line 345, in step_async
output = await self.model_executor.execute_model_async(
File "/data/vllm/vllm/executor/cpu_executor.py", line 305, in execute_model_async
output = await make_async(self.execute_model
File "/root/miniconda3/envs/vLLM/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/data/vllm/vllm/executor/cpu_executor.py", line 223, in execute_model
output = self.driver_method_invoker(self.driver_worker,
File "/data/vllm/vllm/executor/cpu_executor.py", line 361, in _async_driver_method_invoker
return driver.execute_method(method, *args, **kwargs).get()
File "/data/vllm/vllm/executor/multiproc_worker_utils.py", line 58, in get
raise self.result.exception
AttributeError: '_OpNamespace' '_C' object has no attribute 'rms_norm'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
File "/data/vllm/vllm/engine/async_llm_engine.py", line 65, in _log_task_completion
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
ERROR 09-04 16:16:07 client.py:266] Got Unhealthy response from RPC Server
ERROR 09-04 16:16:07 client.py:412] AsyncEngineDeadError('Background loop is stopped.')
ERROR 09-04 16:16:07 client.py:412] Traceback (most recent call last):
ERROR 09-04 16:16:07 client.py:412] File "/data/vllm/vllm/entrypoints/openai/rpc/client.py", line 409, in generate
ERROR 09-04 16:16:07 client.py:412] await self.check_health(socket=socket)
ERROR 09-04 16:16:07 client.py:412] File "/data/vllm/vllm/entrypoints/openai/rpc/client.py", line 429, in check_health
ERROR 09-04 16:16:07 client.py:412] await self._send_one_way_rpc_request(
ERROR 09-04 16:16:07 client.py:412] File "/data/vllm/vllm/entrypoints/openai/rpc/client.py", line 267, in _send_one_way_rpc_request
ERROR 09-04 16:16:07 client.py:412] raise response
ERROR 09-04 16:16:07 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.
INFO: 127.0.0.1:38558 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
return await self.app(scope, receive, send)
File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
await self.middleware_stack(scope, receive, send)
File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
raise exc
File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
await self.app(scope, receive, _send)
File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
await self.app(scope, receive, send)
File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
raise exc
File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
await app(scope, receive, sender)
File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
await self.middleware_stack(scope, receive, send)
File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
await route.handle(scope, receive, send)
File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
await self.app(scope, receive, send)
File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
raise exc
File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
await app(scope, receive, sender)
File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
response = await f(request)
File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/fastapi/routing.py", line 297, in app
raw_response = await run_endpoint_function(
File "/root/miniconda3/envs/vLLM/lib/python3.10/site-packages/fastapi/routing.py", line 210, in run_endpoint_function
return await dependant.call(**values)
File "/data/vllm/vllm/entrypoints/openai/api_server.py", line 286, in create_chat_completion
generator = await openai_serving_chat.create_chat_completion(
File "/data/vllm/vllm/entrypoints/openai/serving_chat.py", line 191, in create_chat_completion
return await self.chat_completion_full_generator(
File "/data/vllm/vllm/entrypoints/openai/serving_chat.py", line 441, in chat_completion_full_generator
async for res in result_generator:
File "/data/vllm/vllm/utils.py", line 432, in iterate_with_cancellation
item = await awaits[0]
File "/data/vllm/vllm/entrypoints/openai/rpc/client.py", line 416, in generate
raise request_output
AttributeError: '_OpNamespace' '_C' object has no attribute 'rms_norm'
