netcdf - Writing a netcdf4 file is 6-times slower than writing a netcdf3_classic file and the file is 8-times as big? -
i using netcdf4 library in python , came across issue stated in title. @ first blaming groups this, turns out difference between netcdf4 , netcdf3_classic formats (edit: , appears related our linux installation of netcdf libraries).
in program below, creating simple time series netcdf file of same data in 2 different ways: 1) netcdf3_classic file, 2) netcdf4 flat file (creating groups in netcdf4 file doesn't make of difference). find simple timing , ls command is:
1) netcdf3          1.3483 seconds      1922704 bytes 2) netcdf4 flat     8.5920 seconds     15178689 bytes   it's same routine creates 1) , 2), difference format argument in netcdf4.dataset method. bug or feature?
thanks, martin
edit: have found must have our local installation of netcdf library on linux computer. when use program version below (trimmed down essentials) on windows laptop, similar file sizes, , netcdf4 2-times fast netcdf3! when run same program on our linux system, can reproduce old results. thus, question apparently not related python.
sorry confusion.
new code:
import datetime dt import numpy np import netcdf4 nc   def write_to_netcdf_single(filename, data, series_info, format='netcdf4'):     vname = 'testvar'     t0 = dt.datetime.now()     nc.dataset(filename, "w", format=format) f:         # define dimensions , variables         dim = f.createdimension('time', none)         time = f.createvariable('time', 'f8', ('time',))         time.units = "days since 1900-01-01 00:00:00"         time.calendar = "gregorian"         param = f.createvariable(vname, 'f4', ('time',))         param.units = "kg"         # define global attributes         k, v in sorted(series_info.items()):             setattr(f, k, v)         # store data values         time[:] = nc.date2num(data.time, units=time.units, calendar=time.calendar)         param[:] = data.value     t1 = dt.datetime.now()     print "writing file %s took %10.4f seconds." % (filename, (t1-t0).total_seconds())   if __name__ == "__main__":     # create array 1 mio values , datetime instances     time = np.array([dt.datetime(2000,1,1)+dt.timedelta(hours=v) v in range(1000000)])     values = np.arange(0., 1000000.)     data = np.array(zip(time, values), dtype=[('time', dt.datetime), ('value', 'f4')])                                                                                            data = data.view(np.recarray)     series_info = {'attr1':'dummy', 'attr2':'dummy2'}     filename = "testnc4.nc"     write_to_netcdf_single(filename, data, series_info)     filename = "testnc3.nc"     write_to_netcdf_single(filename, data, series_info, format='netcdf3_classic')   [old code deleted because had unnecessary stuff]
the 2 file formats have different characteristics. classic file format dead simple (well, more simple new format: http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/classic-format-spec.html#classic-format-spec ): small header described data, , (since have 3 record variables) 3 record variables interleaved.
nice , simple, 1 unlimited dimension, there's no facility parallel i/o, , no way manage data groups.
enter new hdf5-based back-end, introduced in netcdf-4.
in exchange new features, more flexibility, , fewer restrictions on file , variable size, have pay bit of price. large datasets, costs amortized, variables (relatively speaking) kind of small.
i think file size discrepancy exacerbated use of record variables. in order support arrays grow-able in n dimensions, there more metadata associated each record entry in netcdf-4 format.
hdf5 uses "reader makes right" convention, too. classic netcdf says "all data big-endian", hdf5 encodes bit of information how data stored. if reader process same architecture writer process (which common, on laptop or if restarting simulation checkpoint), no conversion need conducted.
Comments
Post a Comment