Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I have several comma-separated data files that I want to load into an xarray dataset. Each row in each file represents a different spatial value of a field in a fixed grid, and every file represents a different point in time. The grid spacing is fixed and unchanging in time. The spacing of the grid is not uniform. The ultimate goal is to compute
max_{x, y} { std_t[ value(x, y, t) * sqrt(y **2 + x ** 2)] }
, where sqrt is the square root,
std_t
is standard deviation with respect to time and
max_{x, y}
is the maximum across all space.
I am having trouble loading the data. It is not clear to me how one is supposed to load several CSV files into an xarray dataset. There is an
open_mfdataset
function, which is designed for loading several data files into a dataset, but seems to expect hdf5 or netcdf files.
It seems like there is no way to load regular CSV files into an xarray dataset, and that preprocessing the data is necessary. In my example, I decided to preprocess the csv files to hdf5 files beforehand, to make use of the
h5netcdf
engine. This has created what appears to be an hdf5-specific problem for me.
below is my best attempt at loading the data so far. Unfortunately, it results in an empty xarray dataset. I tried several options in the
open_mfdataset
function, and the following code is only one realization of several attempts at using the function.
How can I load these csv files into a single xarray dataset, to set myself up to find the maximum across space of the standard deviation in time of the value of interest?
import xarray as xr
import numpy as np
import pandas as pd
Create example files
- Each file contains a spatial-dependent value, f(x, y)
- Each file represents a different point in time, f(x, y, t)
for ii in range(7):
# create csv file
fl = open('exampleFile%i.dat' % ii, 'w')
fl.write('time x1 x2 value\n')
for xx in range(10):
for yy in range(10):
fl.write('%i %i %i %i\n' %
(ii, xx, yy, (xx - yy) * np.exp(ii)))
fl.close()
# convert csv to hdf5
dat = pd.read_csv('exampleFile%i.dat' % ii)
dat.to_hdf('exampleFile%i.hdf5' % ii, 'data', mode='w')
Read all files into xarray dataframe
(the ultimate goal is to find the
maximum across time of
the standard deviation across space
of the "value" column)
result = xr.open_mfdataset('exampleFile*.hdf5', engine='h5netcdf', combine='nested')
... When I run the code, the result
variable does not appear to contain the desired data:
In: result
<xarray.Dataset>
Dimensions: ()
Data variables:
*empty*
Attributes:
PYTABLES_FORMAT_VERSION: 2.1
TITLE: Empty(dtype=dtype('S1'))
VERSION: 1.0
An answer was posted that assumes a uniformly spaced spatial grid. Here is a slightly modified example that does not assume an evenly-spaced grid of spatial points.
The example also assumes three spatial dimensions. That is more true to my real problem, and I realized that might be an important detail in this simple example.
import xarray as xr
import numpy as np
import pandas as pd
Create example files
- Each file contains a spatial-dependent value, f(x, y)
- Each file represents a different point in time, f(x, y, t)
for ii in range(7):
# create csv file
fl = open('exampleFile%i.dat' % ii, 'w')
fl.write('time x y z value\n')
for xx in range(10):
for yy in range(int(10 + xx // 2)):
for zz in range(int(10 + xx //3 + yy // 3)):
fl.write('%i %f %f %f %f\n' %
(ii, xx * np.exp(- 1 * yy * zz) , yy * np.exp(xx - zz), zz * np.exp(xx * yy), (xx - yy) * np.exp(ii)))
fl.close()
# convert csv to hdf5
dat = pd.read_csv('exampleFile%i.dat' % ii)
dat.to_hdf('exampleFile%i.hdf5' % ii, 'data', mode='w')
Read all files into xarray dataframe
(the ultimate goal is to find the
maximum across time of
the standard deviation across space
of the "value" column)
result = xr.open_mfdataset('exampleFile*.hdf5', engine='h5netcdf', combine='nested')
–
–
My approach would be to create a parsing function that converts the CSVs into xarray.Dataset
s.
This way you can use xarray.concat
to combine them to a final dataset, on which you can perform your computations.
The following works with your example data:
from glob import glob
def csv2xr(csv, sep=" "):
df = pd.read_csv(csv, sep)
x = df.x1.unique()
y = df.x2.unique()
pix = df.value.values.reshape(1, x.size, y.size)
ds = xr.Dataset({
"value": xr.DataArray(
dims=['time', 'x', 'y'],
coords={"time": df.time.unique(), "x": x, "y": y})
return ds
csvs = glob("*dat")
ds_full = xr.concat([csv2xr(x) for x in csvs], dim="time")
print(ds_full)
#<xarray.Dataset>
# Dimensions: (time: 7, x: 10, y: 10)
# Coordinates:
# * time (time) int64 4 3 2 0 1 6 5
# * x (x) int64 0 1 2 3 4 5 6 7 8 9
# * y (y) int64 0 1 2 3 4 5 6 7 8 9
# Data variables:
# value (time, x, y) int64 0 -54 -109 -163 -218 -272 ... 593 445 296 148 0
Then to get the max of the std
over time
:
ds_full.std("time").max()
–
–
I hope I understood your problem. See if this works for you.
When defining the key arguments for read_csv
, note that is is better using delim_whitespace=True
instead of sep=" "
. This will avoind considering double columns if somewhere you have double spaces.
I am passing to read_csv
that time
,x
,y
and z
are all coordinates and I am converting them to xarray
. It will automatically structure your unstructured data and fill the holes with NaN
. Then I am concatenating all xarray
objects into a single object by time
.
from glob import glob
fnames = glob('*.dat')
fnames.sort()
kw = dict(delim_whitespace=True,index_col=['time','x','y','z'])
ds = xr.concat([pd.read_csv(fname,**kw).to_xarray() for fname in fnames],'time')
The final result is an xarray
object like this:
Now you can do everything with this object.
ds.max(['x','y','z']).std('time')
will return the standard deviation in time of the spatial maximum value for all variables (in this case it is only value
column). Beware that sometimes you may have to pass skipna=True
to avoid having NaN
outputs from your analysis.
Please, let me know it that solves your problem and I would be glad adapting it if it does not tackle some specific issue your are having with your data.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.