python vaex

Posted by neverset on June 13, 2020

vaex uses lazy processing, means that read fields from file when needed most advantages when dealing with HDF5 or Apache Arrow format

conversion

if the original file is not in hdf we can convert it into hdf for calculation efficiency

dv = vaex.from_csv(file_path, convert=True, chunk_size=5_000_000)
#to open hdf5 directly
dv = vaex.open('big_file.csv.hdf5')

calculation

  • sum
    suma = dv.col1.sum()
  • plot
    dv.plot1d(dv.col2, figsize=(110.10))
  • adding coloumn (virtual) dv[‘col1_plus_col2’] = dv.col1 + dv.col2
  • filter
    dvv = dv[dv.col1 > 90]
  • aggregations dv[‘col1_50’] = dv.col1 >= 50
  • join v_join = dv.join(dv_group, on=’new_coloumn’)