./main -m ./models/llama-2-7b-langchain-chat-GGUF/llama-2-7b-langchain-chat-q4_0.gguf -p "What color is the sun?" -n 1024 What color is the sun?
nobody knows. It’s not a specific color, more a range of colors. Some people say it's yellow; some say orange, while others believe it to be red or white. Ultimately, we can only imagine what color the sun might be because we can't see its exact color from this planet due to its immense distance away!
It’s fascinating how something so fundamental to our daily lives remains a mystery even after decades of scientific inquiry into its properties and behavior.” [end of text]llama_print_timings: load time= 376.57 ms
llama_print_timings: sample time= 56.40 ms / 105 runs ( 0.54 ms per token, 1861.77 tokens per second)llama_print_timings: prompt evaltime= 366.68 ms / 7 tokens ( 52.38 ms per token, 19.09 tokens per second)llama_print_timings: evaltime= 15946.81 ms / 104 runs ( 153.33 ms per token, 6.52 tokens per second)llama_print_timings: total time= 16401.43 ms
当然,也可以用上面量化的模型进行推理。
1
2
3
4
5
6
7
8
9
10
11
./main -m ./models/Llama-2-7b-chat-hf/ggml-model-q4_0.gguf -p "What color is the sun?" -n 1024What color is the sun?
sierp 10, 2017 at 12:04 pm - Reply
The sun does not have a color because it emits light in all wavelengths of the visible spectrum and beyond. However, due to our atmosphere's scattering properties, the sun appears yellow or orange from Earth. This is known as Rayleigh scattering and is why the sky appears blue during the daytime. [end of text]llama_print_timings: load time= 90612.21 ms
llama_print_timings: sample time= 52.31 ms / 91 runs ( 0.57 ms per token, 1739.76 tokens per second)llama_print_timings: prompt evaltime= 523.38 ms / 7 tokens ( 74.77 ms per token, 13.37 tokens per second)llama_print_timings: evaltime= 15266.91 ms / 90 runs ( 169.63 ms per token, 5.90 tokens per second)llama_print_timings: total time= 15911.47 ms
前面编译之后,会在 llama.cpp 项目的根目录下生成一个 server 可执行文件,执行下面的命令,启动 API 服务。
1
2
3
4
5
6
7
8
9
10
./server -m ./models/llama-2-7b-langchain-chat-GGUF/llama-2-7b-langchain-chat-q4_0.gguf --host 0.0.0.0 --port 8080llm_load_tensors: mem required= 3647.96 MB (+ 256.00 MB per state)..................................................................................................
llama_new_context_with_model: kv self size= 256.00 MB
llama_new_context_with_model: compute buffer total size= 71.97 MB
llama server listening at http://0.0.0.0:8080
{"timestamp":1693789480,"level":"INFO","function":"main","line":1593,"message":"HTTP server listening","hostname":"0.0.0.0","port":8080}
这样就启动了一个 API 服务,可以使用 curl 命令进行测试。
1
2
3
4
5
6
curl --request POST \
--url http://localhost:8080/completion \
--header "Content-Type: application/json"\
--data '{"prompt": "What color is the sun?","n_predict": 512}'{"content":".....","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,......}}
curl -X 'POST'\
'http://localhost:8000/v1/chat/completions'\
-H 'accept: application/json'\
-H 'Content-Type: application/json'\
-d '{
"messages": [
{
"content": "You are a helpful assistant.",
"role": "system"
},
{
"content": "Write a poem for Chinese?",
"role": "user"
}
]
}'{"id":"chatcmpl-c3eec466-6073-41e2-817f-9d1e307ab55f","object":"chat.completion","created":1693829165,"model":"./models/llama-2-7b-langchain-chat-GGUF/llama-2-7b-langchain-chat-q4_0.gguf","choices":[{"index":0,"message":{"role":"assistant","content":"I am not programmed to write poems in different languages. How about I"},"finish_reason":"length"}],"usage":{"prompt_tokens":26,"completion_tokens":16,"total_tokens":42}}