-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exposing latest netcdf 4.9.x library functionality: quantize, zstandard #725
Comments
There is ongoing discussion on how Charlie Z's approach might get standardized under CF. See cf-convention/cf-conventions#403. Not sure if the above is advocating that or not. |
I was able to add
|
@mauzey1 great to see progress here! Do you need to have an updated version of |
Yes, I think Charlie could save us much time if he can respond. He's always been very helpful in the past. If we don't hear from him in say a week, let's discuss further. |
Hi All. @mauzey1 the |
Below is the output of
Below is the output of
|
That all looks nominal. Last requirements to check are contents of |
@czender this is the issue, right? |
The environment variable
|
@durack1 Good catch. I was focused on the plugin libraries for compression/decompression because that is usually the issue. I suppose the reported lack of quantization support could also mess things up. However, I'm a bit skeptical that the quantization is the issue because that should impact netCDF creation, not reading. (Quantized data are IEEE format so no special software is needed to read). And I thought @mauzey1 was just demonstrating that Zstandard was not working. @mauzey1 , is your test dataset also supposedly quantized? None of the variables' metadata includes the quantization attribute (e.g., However, back to the plugin directory contents. @mauzey1 please retry your |
@czender I did try setting |
As for the quantized issue, I haven't seen any errors arise from it. I'm currently just debugging errors caused by zstandard. The issue first came up in a test that was reading a netcdf file to append more data to it. The error message contained the same message that I saw in the ncdump output.
I edited the test to only run the part before the "appending" portion to get the file it was appending to. That's where I saw the error in the ncdump output. |
@czender How do you use NetCDF with zstandard in your code? Did you needed to pass certain flags to the compiler? I tried rebuilding CMOR with the How do you install NetCDF and other libraries? I'm using an Anaconda environment on MacOS. All of my packages are from conda-forge. Here's the branch of CMOR with the quantize/zstandard changes: https://github.com/PCMDI/cmor/tree/725_expose_netcdf_quantize_and_zstandard |
@mauzey1 FWIW my Conda-forge installation of netCDF 4.9.2 also reports The library names in my installation appear identical to yours, yet the dates and sizes differ. Mine are newer and larger than yours:
Any idea why this would be the case? Are you using conda-forge for all packages (otherwise library conflicts can easily arise)? In any case, please report results of the |
I tried creating a new environment to reinstall libraries hoping it would upgrade everything. However, I got the same error as before. It appears that I've gotten the same version of the HDF5 plugins. I wonder if it's due to dependencies in my environment setup that are giving me older versions of libraries. @czender If you tried creating the environments described in our source build instructions, do you get versions of the plugins different from your current environment? |
@mauzey1 Our messages crossed in the ether. To expand on the above, the
Here is an example of that conda-forge netCDF dumping a Zstandard-compressed, quantized file:
So the issue you have encountered is perplexing. Any ideas? |
@czender Could you try building and running the tests of the CMOR branch I posted? Ideally using the environment described in the source build instructions. Alternatively, could you send me a Zstandard-compressed file that I could try opening with my ncdump setup? |
@mauzey1 It would be easier for me if you would first try (and add to your CMOR instructions) this first: |
Here is a sample file. First rename it with |
That file works for me without issue. Using my current dev environment, I got the following output.
I'm trying to run |
@mauzey1 Since you can dump the file I sent but not your original test file then my hunch is now that your original test file was somehow generate in a corrupted state. Do you agree? Put it somewhere on acme1 where I can read it and I'll take a look. |
@czender Can you get it here? |
The file appears to be fine. Both |
I tried reading the same file on a Linux machine with a conda environment made with
Output of
Packages installed in this environment:
All of our CI jobs, running on both Linux and Mac, are failing on the same test that produced that file when testing the zstandard/quantize branch. Here is said test: https://github.com/PCMDI/cmor/blob/725_expose_netcdf_quantize_and_zstandard/Test/test_python_appending.py |
@mauzey1 My bad. You are not crazy :) I had only been dumping metadata, under the mistaken impression that the issue would show itself printing metadata. I can reproduce the error you encounter when printing the full data with
Using the latest netCDF snapshot the error comes from a different source line:
NCO's ncks gives a more informative error message:
Based on this error message, I looked more carefully at the metadata for
Next, I verified that ncks can dump the entire file except
A few points about all this: Given all this, my hunch is that the small chunksize of |
I tried disabling the use of zstandard compression on the bounds variables and ncdump read the generated file without issue. I also tried disabling deflate and that also made the file get read by ncdump without issue. I noticed that some files generated by the tests didn't give ncdump issues. I'll look at the other files to see what could be causing the issue. Thank you for helping debug this issue. |
@mauzey1 If it is easy to do, I would be curious to see what the values of the time_bnds are in the case where you disabled use of zstandard compression. Could you print those out? This would help answer the question "Are you sure that all the values written are "normal"? Of course if the same input bounds are used for other tests that are successfully written, this would be unlikely to be the cause of any problems. |
@taylor13 Here is the ncdump output with the time bounds displayed.
|
O.K. thanks. Nothing fishy about those numbers. |
It seems to be due to an interaction of "double compression" (i.e., two lossless codecs) with microscopic chunk sizes. |
Perhaps given @czender 's comment, would one option be to simply forbid "double" compression, which really doesn't buy much and does impact performance (and in this case messes up the file). If we forbid this configuration, will we solve all our problems? |
I did an experiment with the test that generated the file that had problems with the time bounds axis. I reduced the size of the time, latitude, and longitude of the data generated. The resulting file didn't have the problem like the original one.
|
@czender, is this an API problem, i.e., it enables this combination of compression to actually happen, generating a file that may have valid entries, but the libraries have no way to know how to interpret these data to reinflate them? |
@durack1 This is not an API problem in the sense that netCDF4/HDF5 filters are designed to be "chainable" without limit. The library keeps track of the order the filters are applied and reinflates during reads by applying the inverse filters in reverse order. My impression is that this is a bug in the HDF filter code that is exposed by using multiple filters at the extreme limits of small chunksizes. Remember that chunks are the unit for compression, and must be read in their entirety. So a chunksize of two integers = 8 bytes means that the overhead for the metadata to describe the compression characteristics for a chunk is probably much larger than the chunk itself. Compressing data in units of two integers at a time makes no sense. This may stress the codecs in unforeseen/untested ways. People debate what the optimal chunksize is. (FWIW, NCO avoids compressing variables with chunksizes < 2*filesystem_block_size or 8192 B, whichever is smaller). |
Given the "double compression" issue, should we just go with exposing the quantize function in CMOR 3.9 and save the zstandard function for future CMOR development? The quantize doesn't seem to break anything but we should make tests to confirm that. |
@czender I'd appreciate your take on this, it seems we might have work ahead to get these both working, or do you see a tangible path forward? |
Yes, let's get Charlie's advice on this before proceeding. Is there a way we can get good compression without running into this problem? perhaps by preventing users from "micro-chunking" the data? |
First, the quantize functionality seems orthogonal to all of this. No one has reported any issues with it. So it seems like CMOR can continue to test and then implement quantize functionality in its own branch (branch #1). Second, the results of @mauzey1's tests show the HDF bug is only triggered by variables that are "doubly compressed", but no one knows exactly what triggers it. In fact the bug might occur with any two lossless compressors (e.g., bzip2 and DEFLATE) operating on small chunks and may have nothing to do with Zstandard per se. Since "double compression" is a silly, slow, and non-productive waste of computer resources, there's no reason for CMOR to do it. So my suggestion is to implement the Zstandard functionality in a CMOR branch that has been modified to prevent "double compressing" files. E.g., when CMOR receives a DEFLATE'd file to process, the new behavior should be to fully decompress it before re-compressing it with another codec (e.g., Zstandard), rather than just applying a new codec on top of the old. This would be branch #2. Orthogonal to that, I suggest creating and testing a CMOR branch that always applies the Shuffle filter prior to lossless compression unless explicitly instructed not to. This would be branch #3. Branches 1, 2, and 3 are orthogonal and could be released in sequence or combined into one release if the testing goes smoothly. |
Looking at the slides again I saw this point.
So we should not be applying quantization and zstandard compression to the coordinates, bounds, z-factors, etc. I ended up adding the quantize and zstandard functions to places in CMOR where deflate was being applied, which included the "grid variables." Maybe restricting quantization and zstandard compression to the dataset variable will lessen the chance of the HDF error from occurring. I'll still following @czender advice in #725 (comment). |
nicely spotted, @mauzey1 . Hopefully (and naively coming from me), this will solve the problem. |
Thank you for reading the slides carefully, @mauzey1. The intent is to prevent the quantization of "grid variables", not to prevent their lossless compression. As you say, preventing their compression would reduce the chance of an HDF error. Moreover many grid variables (lat, lon, ...) for rectangular grids are too small to benefit from compression. However others (e.g., cell_measures = area) can be as large as any 2D geophysical field and thus definitely do benefit from compression. And for unstructured grids, many grid variables (lat, lon, ...) can be as large as the 2D geophysical fields, if not larger (cell bounds arrays for polygonal meshes). So be careful not to throw out the baby with the bathwater :) |
Yes, @czender is quite right about the size of these grid-related fields. Still, for CMIP, we almost invariably request multiple (100's) of time samples and often multiple (10's) of model levels, which multiples the 2-d spatial dimension by a factor of more than 100 (perhaps 1000's) to give the size of the data array of interest. That means that relative to the data array of interest, the benefit of compressing these "coordinate fields" is small (typically less than 1%). Would we really care if we threw out the baby with the bathwater? (oooh that sounds really mean; don't take it literally). |
@taylor13 When you put it that way, I agree with you. The "grid variables" may be comparable to a 2D geophysical field in the horizontal dimensions, but they are (so far, at least) constant in time and so usually tiny by comparison to full geophysical timeseries. So chuck out the baby too :) |
When applying zstandard compression, is there a zstandard level that applies no compression similar to how setting the deflate level to 0 applies no compression? I tried using 0 as the zstandard level but that also applies compression. I'm planning to keep deflate at level 1 as the default and not have zstandard enabled at the beginning. |
@mauzey1 Good to hear you're making progress on this.
Not to my knowledge
Same happened to me
FWIW, somewhere in the Zstandard filter code I read that a good default level for Zstandard is 3, so that's what NCO uses as default. |
I was thinking of keeping the current default of deflate level 1 without shuffle. If the user wants zstandard, then they can enable it by setting the zstandard level while also setting deflate level to 0 and enabling shuffle. In my testing, having both zstandard and deflate enabled will adversely affect the compression efficiency. If the zstandard compresson was on by default, then there should be a way to disable it if you want to use deflate instead. I do not know a way of disabling it once the zstandard level has been set. |
Not knowing much about this, my instincts say that "no compression" should be the default when files are written. Turning on compression should, by default, invoke it at some level that balances read/write performance against file size for the kinds of files we usually deal with. Of course, the user should be able to set the compression/shuffle parameters to any acceptable values. |
Agreed. Default should be no compression at all. If user specifies a compression level without naming the codec, then NCO assumes the codec is DEFLATE, and applies it with Shuffle. (Shuffle increases compression ratio by ~15% for both DEFLATE and Zstandard). If user specifies a codec but not compression level, then NCO uses default compression level of 1 for DEFLATE and 3 for Zstandard. User must explicitly specify multiple lossless codecs because that's usually a bad idea. When copying files, NCO preserves the codecs in the source file unless the copy command includes an explicit compression option, in which case the input compression codecs are disbanded and only the new codecs are applied. What I intended to convey yesterday is that inside the Zstandard source code it recommends defaults compression level of 3 as offering a good tradeoff between speed and efficiency. Other organizations (e.g., Amazon https://docs.aws.amazon.com/athena/latest/ug/compression-support-zstd-levels.html#:~:text=We%20recommend%20using%20the%20default,speed%20is%20not%20a%20concern.) also default to Zstandard=3, though I'd guess that level=1 would not be too different, a slightly faster compression yielding a slightly larger file. |
The latest versions of libnetcdf include new functions to further squash data using lossy compression, see Charlie Zender: Why & How to Increase Dataset Compression in CMIP7 - in particular the quantize and zstandard operations.
How easy is it to expose this in CMOR 3.9?
ping @taylor13 @matthew-mizielinski @sashakames @piotrflorek @czender
Also see discussion #724
The text was updated successfully, but these errors were encountered: