Set ZhihuData with Huggingface DataSets

We hope to inject Phi-3-mini into Zhihu's data . The first step is to import Zhihu's KOL data.

Note: Please create your noteboook (download_hf_zhihu_datasets.ipynb) in datasets folder

0. Confirm Your Env

  • Replace to datasets folder.

pip install datasets -U

pip install transformers -U

cd datasets/

huggingface-cli download --repo-type dataset --resume-download wangrui6/Zhihu-KOL --local-dir <YOUR_DATASET_DIR> 

Open VSCODE at your local machine

  • Install vscode ssh plugin as following figure: vscode01

  • Remote connect to server via SSH plugin. vscode02 vscode03

  • Click File -> open directory-> Select to zhihu and click confirm icon. vscode05

  • Navigate to 01.download_phi3_mini.ipynb file and select jupyter kernel on right cornor select kernel, choose `hfdev (python 3.10.12),
    vscode04

1. Load data into csv and save it to json


from datasets import load_dataset

dataset = load_dataset('<YOUR_DATASET_DIR>')


# This is lab env , so set 2000 datas from Zhihu-KOL
dataset['train'].take(2000).to_csv('zhihu_dataset_train.csv')


csvfile = open('zhihu_dataset_train.csv', 'r', encoding="utf8")
jsonfile = open('zhihu_dataset_train.json', 'w',encoding="utf8")


fieldnames = ("INSTRUCTION","RESPONSE","SOURCE","METADATA")

import csv
import json

reader = csv.DictReader(csvfile, fieldnames)

i = 0
for row in reader:
    if i > 0:
        print(row)
        try:
            # print(row)
            # json.loads(row)
            json.dump(row, jsonfile, ensure_ascii=False)
            jsonfile.write('\n')
            i += 1
        except ValueError:
            continue
    if i == 0:
        i += 1


2. Clear your data



data = []
with open('zhihu_dataset_train.json', 'r',encoding="utf8") as file:
    for line in file:
        try:
            data.append(json.loads(line))
        except ValueError:
            continue


3. Save your data



import json

with open('datasets.json', 'w',encoding="utf8") as f:
    for i in range(len(data)):
        if i >0:
            json.dump(data[i], f, ensure_ascii=False)
            f.write('\n')


Congratulations!

Your data has been successfully loaded. Next, you need to configure your data and related algorithms through Microsoft Olive E2E_LoRA&QLoRA_Config_With_Olive.md