Set ZhihuData with Huggingface DataSets
We hope to inject Phi-3-mini into Zhihu's data . The first step is to import Zhihu's KOL data.
Note: Please create your noteboook (download_hf_zhihu_datasets.ipynb) in datasets folder
0. Confirm Your Env
- Replace
to datasets
folder.
pip install datasets -U
pip install transformers -U
cd datasets/
huggingface-cli download --repo-type dataset --resume-download wangrui6/Zhihu-KOL --local-dir <YOUR_DATASET_DIR>
Open VSCODE at your local machine
-
Install
vscode ssh plugin
as following figure: -
Remote connect to server via SSH plugin.
-
Click
File
->open directory
-> Select tozhihu
and clickconfirm
icon. -
Navigate to
01.download_phi3_mini.ipynb
file and select jupyter kernel on right cornorselect kernel
, choose `hfdev (python 3.10.12),
1. Load data into csv and save it to json
from datasets import load_dataset
dataset = load_dataset('<YOUR_DATASET_DIR>')
# This is lab env , so set 2000 datas from Zhihu-KOL
dataset['train'].take(2000).to_csv('zhihu_dataset_train.csv')
csvfile = open('zhihu_dataset_train.csv', 'r', encoding="utf8")
jsonfile = open('zhihu_dataset_train.json', 'w',encoding="utf8")
fieldnames = ("INSTRUCTION","RESPONSE","SOURCE","METADATA")
import csv
import json
reader = csv.DictReader(csvfile, fieldnames)
i = 0
for row in reader:
if i > 0:
print(row)
try:
# print(row)
# json.loads(row)
json.dump(row, jsonfile, ensure_ascii=False)
jsonfile.write('\n')
i += 1
except ValueError:
continue
if i == 0:
i += 1
2. Clear your data
data = []
with open('zhihu_dataset_train.json', 'r',encoding="utf8") as file:
for line in file:
try:
data.append(json.loads(line))
except ValueError:
continue
3. Save your data
import json
with open('datasets.json', 'w',encoding="utf8") as f:
for i in range(len(data)):
if i >0:
json.dump(data[i], f, ensure_ascii=False)
f.write('\n')
Congratulations!
Your data has been successfully loaded. Next, you need to configure your data and related algorithms through Microsoft Olive E2E_LoRA&QLoRA_Config_With_Olive.md