netcdf - Writing a netcdf4 file is 6-times slower than writing a netcdf3_classic file and the file is 8-times as big? -

i using netcdf4 library in python , came across issue stated in title. @ first blaming groups this, turns out difference between netcdf4 , netcdf3_classic formats (edit: , appears related our linux installation of netcdf libraries).

in program below, creating simple time series netcdf file of same data in 2 different ways: 1) netcdf3_classic file, 2) netcdf4 flat file (creating groups in netcdf4 file doesn't make of difference). find simple timing , ls command is:

1) netcdf3          1.3483 seconds      1922704 bytes 2) netcdf4 flat     8.5920 seconds     15178689 bytes 

it's same routine creates 1) , 2), difference format argument in netcdf4.dataset method. bug or feature?

thanks, martin

edit: have found must have our local installation of netcdf library on linux computer. when use program version below (trimmed down essentials) on windows laptop, similar file sizes, , netcdf4 2-times fast netcdf3! when run same program on our linux system, can reproduce old results. thus, question apparently not related python.

sorry confusion.

new code:

import datetime dt import numpy np import netcdf4 nc   def write_to_netcdf_single(filename, data, series_info, format='netcdf4'):     vname = 'testvar'     t0 =     nc.dataset(filename, "w", format=format) f:         # define dimensions , variables         dim = f.createdimension('time', none)         time = f.createvariable('time', 'f8', ('time',))         time.units = "days since 1900-01-01 00:00:00"         time.calendar = "gregorian"         param = f.createvariable(vname, 'f4', ('time',))         param.units = "kg"         # define global attributes         k, v in sorted(series_info.items()):             setattr(f, k, v)         # store data values         time[:] = nc.date2num(data.time, units=time.units, calendar=time.calendar)         param[:] = data.value     t1 =     print "writing file %s took %10.4f seconds." % (filename, (t1-t0).total_seconds())   if __name__ == "__main__":     # create array 1 mio values , datetime instances     time = np.array([dt.datetime(2000,1,1)+dt.timedelta(hours=v) v in range(1000000)])     values = np.arange(0., 1000000.)     data = np.array(zip(time, values), dtype=[('time', dt.datetime), ('value', 'f4')])                                                                                            data = data.view(np.recarray)     series_info = {'attr1':'dummy', 'attr2':'dummy2'}     filename = ""     write_to_netcdf_single(filename, data, series_info)     filename = ""     write_to_netcdf_single(filename, data, series_info, format='netcdf3_classic') 

[old code deleted because had unnecessary stuff]

the 2 file formats have different characteristics. classic file format dead simple (well, more simple new format: ): small header described data, , (since have 3 record variables) 3 record variables interleaved.

nice , simple, 1 unlimited dimension, there's no facility parallel i/o, , no way manage data groups.

enter new hdf5-based back-end, introduced in netcdf-4.

in exchange new features, more flexibility, , fewer restrictions on file , variable size, have pay bit of price. large datasets, costs amortized, variables (relatively speaking) kind of small.

i think file size discrepancy exacerbated use of record variables. in order support arrays grow-able in n dimensions, there more metadata associated each record entry in netcdf-4 format.

hdf5 uses "reader makes right" convention, too. classic netcdf says "all data big-endian", hdf5 encodes bit of information how data stored. if reader process same architecture writer process (which common, on laptop or if restarting simulation checkpoint), no conversion need conducted.


Popular posts from this blog

matlab - "Contour not rendered for non-finite ZData" -

delphi - Indy UDP Read Contents of Adata -

qt - How to embed QML toolbar and menubar into QMainWindow -