data version control

dvc is command-line tool similar with git to help researcher govern their data

environment

python -m pip install dvc

workflow

#initialize to genetate .dvc folder
dvc init
#deactivate analysis configuration
dvc config core.analytics false
#add remote repo for dvc
dvc remote add -d remote_storage path/to/your/dvc_remote
#add track files
dvc add data/raw/train
dvc add data/raw/val
#upload to remote
dvc push
#pull data from remote
dvc pull
#check dvc status
dvc status

#create cache folder
dvc cache dir path/to/shared_cache

DVC pipeline

pipeline consists of three stages: Inputs Outputs Command

# remove manuelly added .dvc files to avoid confusion
dvc remove data/prepared/train.csv.dvc \
    data/prepared/test.csv.dvc \
    model/model.joblib.dvc --outs #### dvc run dvc run needs  Dependencies: prepare.py and the data in data/raw Outputs: train.csv and test.csv Command: python prepare.py

#-n: stage name; -d:dependencies; -o: outputs
#DVC will create two files, dvc.yaml and dvc.lock
dvc run -n prepare \
    -d src/prepare.py -d data/raw \
    -o data/prepared/train.csv -o data/prepared/test.csv \
    python src/prepare.py

#define evaluation stage
#-M: to indicate metric to measure performance of the model
dvc run -n evaluate \
    -d src/evaluate.py -d model/model.joblib \
    -M metrics/accuracy.json \
    python src/evaluate.py #### dvc repro DVC checks all the dependencies of the entire pipeline to determine what’s changed and which commands need to be executed again

dvc repro evaluate
#show metric accuracy
dvc metrics show

data version control

environment

workflow

DVC pipeline

CATALOG

FEATURED TAGS

environment

workflow

Share a Development Machine

DVC pipeline

CATALOG

FEATURED TAGS