相关文章推荐
坏坏的羽毛球  ·  数据分析利器 pandas ...·  3 周前    · 
高兴的花卷  ·  MYSQL 8 VS MYSQL 5.7 ...·  1 年前    · 
重感情的葡萄酒  ·  xml - TCL TDom: ...·  1 年前    · 
Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I have several comma-separated data files that I want to load into an xarray dataset. Each row in each file represents a different spatial value of a field in a fixed grid, and every file represents a different point in time. The grid spacing is fixed and unchanging in time. The spacing of the grid is not uniform. The ultimate goal is to compute max_{x, y} { std_t[ value(x, y, t) * sqrt(y **2 + x ** 2)] } , where sqrt is the square root, std_t is standard deviation with respect to time and max_{x, y} is the maximum across all space.

I am having trouble loading the data. It is not clear to me how one is supposed to load several CSV files into an xarray dataset. There is an open_mfdataset function, which is designed for loading several data files into a dataset, but seems to expect hdf5 or netcdf files.

It seems like there is no way to load regular CSV files into an xarray dataset, and that preprocessing the data is necessary. In my example, I decided to preprocess the csv files to hdf5 files beforehand, to make use of the h5netcdf engine. This has created what appears to be an hdf5-specific problem for me.

below is my best attempt at loading the data so far. Unfortunately, it results in an empty xarray dataset. I tried several options in the open_mfdataset function, and the following code is only one realization of several attempts at using the function.

How can I load these csv files into a single xarray dataset, to set myself up to find the maximum across space of the standard deviation in time of the value of interest?

import xarray as xr
import numpy as np
import pandas as pd
Create example files
- Each file contains a spatial-dependent value, f(x, y)
- Each file represents a different point in time, f(x, y, t)
for ii in range(7):
   # create csv file
   fl = open('exampleFile%i.dat' % ii, 'w')
   fl.write('time x1 x2 value\n')
   for xx in range(10):
      for yy in range(10):
         fl.write('%i %i %i %i\n' % 
                  (ii, xx, yy, (xx - yy) * np.exp(ii)))
   fl.close()
   # convert csv to hdf5
   dat = pd.read_csv('exampleFile%i.dat' % ii)
   dat.to_hdf('exampleFile%i.hdf5' % ii, 'data', mode='w')
Read all files into xarray dataframe
   (the ultimate goal is to find the 
      maximum across time of 
      the standard deviation across space
      of the "value" column)
result = xr.open_mfdataset('exampleFile*.hdf5', engine='h5netcdf', combine='nested')

... When I run the code, the result variable does not appear to contain the desired data:

In: result
<xarray.Dataset>
Dimensions:  ()
Data variables:
    *empty*
Attributes:
    PYTABLES_FORMAT_VERSION:  2.1
    TITLE:                    Empty(dtype=dtype('S1'))
    VERSION:                  1.0

An answer was posted that assumes a uniformly spaced spatial grid. Here is a slightly modified example that does not assume an evenly-spaced grid of spatial points.

The example also assumes three spatial dimensions. That is more true to my real problem, and I realized that might be an important detail in this simple example.

import xarray as xr
import numpy as np
import pandas as pd
Create example files
- Each file contains a spatial-dependent value, f(x, y)
- Each file represents a different point in time, f(x, y, t)
for ii in range(7):
   # create csv file
   fl = open('exampleFile%i.dat' % ii, 'w')
   fl.write('time x y z value\n')
   for xx in range(10):
      for yy in range(int(10 + xx // 2)):
         for zz in range(int(10 + xx //3 + yy // 3)):
            fl.write('%i %f %f %f %f\n' % 
                    (ii, xx * np.exp(- 1 * yy * zz) , yy * np.exp(xx - zz), zz * np.exp(xx * yy), (xx - yy) * np.exp(ii)))
   fl.close()
   # convert csv to hdf5
   dat = pd.read_csv('exampleFile%i.dat' % ii)
   dat.to_hdf('exampleFile%i.hdf5' % ii, 'data', mode='w')
Read all files into xarray dataframe
   (the ultimate goal is to find the 
      maximum across time of 
      the standard deviation across space
      of the "value" column)
result = xr.open_mfdataset('exampleFile*.hdf5', engine='h5netcdf', combine='nested')
                @Val Fundamentally no. I think of this space time data as being a 3D data frame, so I thought xarray would be most appropriate. A functional pandas solution would certainly be illuminating and would resolve my underlying problem.
– kilojoules
                Dec 31, 2020 at 22:01
                @kiloules that's actually the heart of your issue - xarray isn't really good at manipulating irregular point clouds within ND grids. Its designed for manipulating regular grids. So you have two options - if you have the same (or similar) number of points in each observation set and you'll be doing a lot of ops across one dim at a time (like this example of the spatial std) then you could treat obs_id and time as your cords, with x and y being non-indexing coordinates (essentially metadata) on obs_id. Otherwise, I'd recommend pandas and using group by ops for this.
– Michael Delgado
                Jan 5, 2021 at 16:22

My approach would be to create a parsing function that converts the CSVs into xarray.Datasets.

This way you can use xarray.concat to combine them to a final dataset, on which you can perform your computations.

The following works with your example data:

from glob import glob
def csv2xr(csv, sep=" "):
    df = pd.read_csv(csv, sep)
    x = df.x1.unique()
    y = df.x2.unique()
    pix = df.value.values.reshape(1, x.size, y.size)
    ds = xr.Dataset({
        "value": xr.DataArray(
            dims=['time', 'x', 'y'],
            coords={"time": df.time.unique(), "x": x, "y": y})
    return ds
csvs = glob("*dat")
ds_full = xr.concat([csv2xr(x) for x in csvs], dim="time")
print(ds_full)
#<xarray.Dataset>
# Dimensions:  (time: 7, x: 10, y: 10)
# Coordinates:
#   * time     (time) int64 4 3 2 0 1 6 5
#   * x        (x) int64 0 1 2 3 4 5 6 7 8 9
#   * y        (y) int64 0 1 2 3 4 5 6 7 8 9
# Data variables:
#     value    (time, x, y) int64 0 -54 -109 -163 -218 -272 ... 593 445 296 148 0

Then to get the max of the std over time:

ds_full.std("time").max()
                You can also merge into a single dataframe and then use .to_xarray() method to convert to dataset.
– iury simoes-sousa
                Dec 30, 2020 at 9:11
                @Val It looks like this answer assumes a structured and evenly-spaced grid of points in space. I can not make this work for my problem--I have unstructured spatial data points. Thanks very much, I think your answer is still very useful.
– kilojoules
                Dec 31, 2020 at 2:30

I hope I understood your problem. See if this works for you.

When defining the key arguments for read_csv, note that is is better using delim_whitespace=True instead of sep=" ". This will avoind considering double columns if somewhere you have double spaces.

I am passing to read_csv that time,x,y and z are all coordinates and I am converting them to xarray. It will automatically structure your unstructured data and fill the holes with NaN. Then I am concatenating all xarray objects into a single object by time.

from glob import glob
fnames = glob('*.dat')
fnames.sort()
kw = dict(delim_whitespace=True,index_col=['time','x','y','z'])
ds = xr.concat([pd.read_csv(fname,**kw).to_xarray() for fname in fnames],'time')

The final result is an xarray object like this:

Now you can do everything with this object.

ds.max(['x','y','z']).std('time') will return the standard deviation in time of the spatial maximum value for all variables (in this case it is only value column). Beware that sometimes you may have to pass skipna=True to avoid having NaN outputs from your analysis.

Please, let me know it that solves your problem and I would be glad adapting it if it does not tackle some specific issue your are having with your data.

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.