HuggingFace 通过提供共享模型 model、数据集 dataset、在线托管 space 等服务,为 AI 研究人员和开发者提供了一个完整的生态。本篇文章将介绍如何使用 HuggingFace 的模型和数据集。
1. 模型操作与使用
1.1 自定义存储目录
1
| export HF_HOME=/Volumes/Data/HuggingFace
|
否则默认在 ~/.cache/huggingface
目录下。
1.2 模型的下载
第一种方法,页面上点击下载到本地
https://huggingface.co/LinkSoul/Chinese-Llama-2-7b/tree/main 点击文件列表中的下载 Icon 。
第二种方法,使用 Git LFS 下载
在安装 git-lfs 之后,执行命令:
下载模型到本地:
1
| git clone https://huggingface.co/LinkSoul/Chinese-Llama-2-7b
|
第三种方法,使用 huggingface-hub
下载
1
| pip install huggingface_hub
|
1
2
| from huggingface_hub import snapshot_download
snapshot_download(repo_id="LinkSoul/Chinese-Llama-2-7b")
|
第四种方法,使用 transformers
使用时,在线下载
1
| pip install transformers
|
1
2
| from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("LinkSoul/Chinese-Llama-2-7b")
|
1.3 模型的操作
1
2
3
| from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("LinkSoul/Chinese-Llama-2-7b")
|
1
2
3
4
| from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("LinkSoul/Chinese-Llama-2-7b")
model.save_pretrained("/Volumes/Data/HuggingFace/Chinese-Llama-2-7b-v2")
|
1.4 模型的使用
1
| pip install transformers torch
|
1
2
3
4
5
6
7
8
9
10
11
12
| from transformers import EncoderDecoderModel, AutoTokenizer
model_id = "raynardj/wenyanwen-chinese-translate-to-ancient"
model = EncoderDecoderModel.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
def chat(text):
input_ids = tokenizer.encode(text, return_tensors='pt')
output = model.generate(input_ids, max_length=40)
return tokenizer.decode(output[0], skip_special_tokens=True)
chat("你好")
|
2. Dataset 操作与使用
2.1 数据集的下载
进入 Ipython
1
2
| In [1]: import datasets
In [2]: remote_datasets = datasets.load_dataset("fka/awesome-chatgpt-prompts")
|
此时,数据集将被下载到 $HF_HOME/datasets
目录下。类似模型的下载,数据集也可以在页面上下载、Git LFS 下载,在此不再赘述。
1
2
3
4
5
6
7
8
9
10
11
12
13
| tree -L 2 $HF_HOME/datasets
/Volumes/Data/HuggingFace/datasets
├── _Volumes_Data_HuggingFace_datasets_fka___awesome-chatgpt-prompts_default-18237255be23cc62_0.0.0_eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d.lock
├── downloads
│ ├── 7528ed6bf521cf4a58ed283bfa5ba864e12c7203ad53ea3495ba45326e30768a
│ ├── 7528ed6bf521cf4a58ed283bfa5ba864e12c7203ad53ea3495ba45326e30768a.json
│ ├── 7528ed6bf521cf4a58ed283bfa5ba864e12c7203ad53ea3495ba45326e30768a.lock
│ ├── f41fd13f9d4e803c35d9543c56b1d887676f17d84d10e3a428ad1e46bcce6c78.8fbabec58cee4e6f69e20f509619af34f2b4ed0052c2c39ca0d73a47e1035a8b
│ ├── f41fd13f9d4e803c35d9543c56b1d887676f17d84d10e3a428ad1e46bcce6c78.8fbabec58cee4e6f69e20f509619af34f2b4ed0052c2c39ca0d73a47e1035a8b.json
│ └── f41fd13f9d4e803c35d9543c56b1d887676f17d84d10e3a428ad1e46bcce6c78.8fbabec58cee4e6f69e20f509619af34f2b4ed0052c2c39ca0d73a47e1035a8b.lock
└── fka___awesome-chatgpt-prompts
└── default-18237255be23cc62
|
可以看到存储的目录并不是 fka/awesome-chatgpt-prompts
。不能使用 datasets.load_from_disk("fka/awesome-chatgpt-prompts")
加载数据集,load_from_disk
适合直接下载、Git LFS 等方式下载的数据集。
2.2 数据集的操作
1
2
3
4
5
6
7
8
| In [3]: remote_datasets
DatasetDict({
train: Dataset({
features: ['act', 'prompt'],
num_rows: 153
})
})
|
可以看到一共有 153 条数据,数据放在两个字段中,act
和 prompt
。
1
2
3
| In [4]: remote_datasets["train"][0]
Out[4]: {'act': 'Linux Terminal',
'prompt': 'I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets {like this}. my first command is pwd'}
|
1
2
3
4
5
6
| In [5]: remote_datasets["train"].select(range(10))
Out[5]:
Dataset({
features: ['act', 'prompt'],
num_rows: 10
})
|
1
2
3
4
5
6
7
8
9
| In [5]: new_datasets = remote_datasets.rename_column("act", "actor")
In [6]: new_datasets
Out[6]:
DatasetDict({
train: Dataset({
features: ['actor', 'prompt'],
num_rows: 153
})
})
|
1
2
3
4
5
6
7
8
| In [7]: new_datasets.filter(lambda x: "Linux" in x["actor"])
Out[7]:
DatasetDict({
train: Dataset({
features: ['actor', 'prompt'],
num_rows: 1
})
})
|
1
2
3
4
| In [8]: new_datasets.map(lambda x: {"actor": x["actor"].upper(), "prompt": x["prompt"]})["train"][0]
Out[8]:
{'actor': 'LINUX TERMINAL',
'prompt': 'I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets {like this}. my first command is pwd'}
|
1
2
3
4
| In [9]: new_datasets.sort("actor")["train"][0]
Out[9]:
{'actor': 'AI Assisted Doctor',
'prompt': 'I want you to act as an AI assisted doctor. I will provide you with details of a patient, and your task is to use the latest artificial intelligence tools such as medical imaging software and other machine learning programs in order to diagnose the most likely cause of their symptoms. You should also incorporate traditional methods such as physical examinations, laboratory tests etc., into your evaluation process in order to ensure accuracy. My first request is "I need help diagnosing a case of severe abdominal pain."'}
|
1
2
3
4
| In [10]: new_datasets.shuffle(seed=42)["train"][0]
Out[10]:
{'actor': 'Tech Reviewer:',
'prompt': 'I want you to act as a tech reviewer. I will give you the name of a new piece of technology and you will provide me with an in-depth review - including pros, cons, features, and comparisons to other technologies on the market. My first suggestion request is "I am reviewing iPhone 11 Pro Max".'}
|
1
2
3
4
5
6
| In [11]: new_datasets.shuffle(seed=42)["train"].select(range(10))
Out[11]:
Dataset({
features: ['actor', 'prompt'],
num_rows: 10
})
|
1
| new_datasets.save_to_disk("fka_awesome-chatgpt-prompts_2")
|
数据将保存至 Ipython 工作目录下的 fka_awesome-chatgpt-prompts_2
中。