HDF5 big data

python vaex

Posted by neverset on June 13, 2020

vaex uses lazy processing, means that read fields from file when needed most advantages when dealing with HDF5 or Apache Arrow format

conversion

if the original file is not in hdf we can convert it into hdf for calculation efficiency

dv = vaex.from_csv(file_path, convert=True, chunk_size=5_000_000)
#to open hdf5 directly
dv = vaex.open('big_file.csv.hdf5')

calculation

sum
suma = dv.col1.sum()
plot
dv.plot1d(dv.col2, figsize=(110.10))
adding coloumn (virtual) dv[‘col1_plus_col2’] = dv.col1 + dv.col2
filter
dvv = dv[dv.col1 > 90]
aggregations dv[‘col1_50’] = dv.col1 >= 50
join v_join = dv.join(dv_group, on=’new_coloumn’)