netcdf - Writing a netcdf4 file is 6-times slower than writing a netcdf3_classic file and the file is 8-times as big? -
i using netcdf4 library in python , came across issue stated in title. @ first blaming groups this, turns out difference between netcdf4 , netcdf3_classic formats (edit: , appears related our linux installation of netcdf libraries).
in program below, creating simple time series netcdf file of same data in 2 different ways: 1) netcdf3_classic file, 2) netcdf4 flat file (creating groups in netcdf4 file doesn't make of difference). find simple timing , ls command is:
1) netcdf3 1.3483 seconds 1922704 bytes 2) netcdf4 flat 8.5920 seconds 15178689 bytes
it's same routine creates 1) , 2), difference format argument in netcdf4.dataset method. bug or feature?
thanks, martin
edit: have found must have our local installation of netcdf library on linux computer. when use program version below (trimmed down essentials) on windows laptop, similar file sizes, , netcdf4 2-times fast netcdf3! when run same program on our linux system, can reproduce old results. thus, question apparently not related python.
sorry confusion.
new code:
import datetime dt import numpy np import netcdf4 nc def write_to_netcdf_single(filename, data, series_info, format='netcdf4'): vname = 'testvar' t0 = dt.datetime.now() nc.dataset(filename, "w", format=format) f: # define dimensions , variables dim = f.createdimension('time', none) time = f.createvariable('time', 'f8', ('time',)) time.units = "days since 1900-01-01 00:00:00" time.calendar = "gregorian" param = f.createvariable(vname, 'f4', ('time',)) param.units = "kg" # define global attributes k, v in sorted(series_info.items()): setattr(f, k, v) # store data values time[:] = nc.date2num(data.time, units=time.units, calendar=time.calendar) param[:] = data.value t1 = dt.datetime.now() print "writing file %s took %10.4f seconds." % (filename, (t1-t0).total_seconds()) if __name__ == "__main__": # create array 1 mio values , datetime instances time = np.array([dt.datetime(2000,1,1)+dt.timedelta(hours=v) v in range(1000000)]) values = np.arange(0., 1000000.) data = np.array(zip(time, values), dtype=[('time', dt.datetime), ('value', 'f4')]) data = data.view(np.recarray) series_info = {'attr1':'dummy', 'attr2':'dummy2'} filename = "testnc4.nc" write_to_netcdf_single(filename, data, series_info) filename = "testnc3.nc" write_to_netcdf_single(filename, data, series_info, format='netcdf3_classic')
[old code deleted because had unnecessary stuff]
the 2 file formats have different characteristics. classic file format dead simple (well, more simple new format: http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/classic-format-spec.html#classic-format-spec ): small header described data, , (since have 3 record variables) 3 record variables interleaved.
nice , simple, 1 unlimited dimension, there's no facility parallel i/o, , no way manage data groups.
enter new hdf5-based back-end, introduced in netcdf-4.
in exchange new features, more flexibility, , fewer restrictions on file , variable size, have pay bit of price. large datasets, costs amortized, variables (relatively speaking) kind of small.
i think file size discrepancy exacerbated use of record variables. in order support arrays grow-able in n dimensions, there more metadata associated each record entry in netcdf-4 format.
hdf5 uses "reader makes right" convention, too. classic netcdf says "all data big-endian", hdf5 encodes bit of information how data stored. if reader process same architecture writer process (which common, on laptop or if restarting simulation checkpoint), no conversion need conducted.
Comments
Post a Comment