WARNING! THIS PACKAGE IS IN ACTIVE DEVELOPMENT AND IS NOT YET STABLE!

PAL-flavoured Datatree#

The xarray Datatree is used as the core data structure for SwarmPAL. You can think of this like a file directory (a tree) which contains an arbitrary number of related xarray datasets. Data can be fetched from different resources (including VirES) and stored in a Datatree.

PalDataItem provides tools to construct an xarray.Dataset from different sources (VirES, HAPI, etc). create_paldata helps to construct a Datatree from a set of those datasets.

import datetime as dt

Fetching data#

from swarmpal.io import create_paldata, PalDataItem

from VirES API#

# Set of options which are passed to viresclient
data_params = dict(
    collection="SW_OPER_MAGA_LR_1B",
    measurements=["B_NEC"],
    models=["IGRF"],
    start_time="2016-01-01T00:00:00",
    end_time="2016-01-01T03:00:00",
    # start_time=dt.datetime(2016, 1, 1),  # Can use ISO string or datetime
    # end_time=dt.datetime(2016, 1, 1, 3),
    server_url="https://vires.services/ows",
    options=dict(asynchronous=False, show_progress=False),
)
# create_paldata takes an arbitrary number of args & kwargs
# If using args, dataset names will be used as tree names
# If using kwargs, user specifies the tree name/path
data = create_paldata(PalDataItem.from_vires(**data_params))
print(data)
DataTree('paldata', parent=None)
└── DataTree('SW_OPER_MAGA_LR_1B')
        Dimensions:     (Timestamp: 10800, NEC: 3)
        Coordinates:
          * Timestamp   (Timestamp) datetime64[ns] 2016-01-01 ... 2016-01-01T02:59:59
          * NEC         (NEC) <U1 'N' 'E' 'C'
        Data variables:
            Spacecraft  (Timestamp) object 'A' 'A' 'A' 'A' 'A' ... 'A' 'A' 'A' 'A' 'A'
            B_NEC_IGRF  (Timestamp, NEC) float64 -1.578e+03 -1.031e+04 ... -2.564e+04
            Radius      (Timestamp) float64 6.834e+06 6.834e+06 ... 6.833e+06 6.833e+06
            Latitude    (Timestamp) float64 -72.5 -72.56 -72.63 ... -44.9 -44.97 -45.03
            B_NEC       (Timestamp, NEC) float64 -1.581e+03 -1.049e+04 ... -2.564e+04
            Longitude   (Timestamp) float64 92.79 92.82 92.85 ... 41.83 41.83 41.83
        Attributes:
            Sources:         ['SW_OPER_AUX_IGR_2__19000101T000000_20241231T235959_010...
            MagneticModels:  ['IGRF = IGRF(max_degree=13,min_degree=1)']
            AppliedFilters:  []
            PAL_meta:        {"analysis_window": ["2016-01-01T00:00:00", "2016-01-01T...
# Interactive view of the datatree
data
<xarray.DatasetView>
Dimensions:  ()
Data variables:
    *empty*
# Refer to a branch of the tree like:
data["SW_OPER_MAGA_LR_1B"]
<xarray.DatasetView>
Dimensions:     (Timestamp: 10800, NEC: 3)
Coordinates:
  * Timestamp   (Timestamp) datetime64[ns] 2016-01-01 ... 2016-01-01T02:59:59
  * NEC         (NEC) <U1 'N' 'E' 'C'
Data variables:
    Spacecraft  (Timestamp) object 'A' 'A' 'A' 'A' 'A' ... 'A' 'A' 'A' 'A' 'A'
    B_NEC_IGRF  (Timestamp, NEC) float64 -1.578e+03 -1.031e+04 ... -2.564e+04
    Radius      (Timestamp) float64 6.834e+06 6.834e+06 ... 6.833e+06 6.833e+06
    Latitude    (Timestamp) float64 -72.5 -72.56 -72.63 ... -44.9 -44.97 -45.03
    B_NEC       (Timestamp, NEC) float64 -1.581e+03 -1.049e+04 ... -2.564e+04
    Longitude   (Timestamp) float64 92.79 92.82 92.85 ... 41.83 41.83 41.83
Attributes:
    Sources:         ['SW_OPER_AUX_IGR_2__19000101T000000_20241231T235959_010...
    MagneticModels:  ['IGRF = IGRF(max_degree=13,min_degree=1)']
    AppliedFilters:  []
    PAL_meta:        {"analysis_window": ["2016-01-01T00:00:00", "2016-01-01T...
# Note that the above is actually a Datatree object
# To get a view of the Dataset:
data["SW_OPER_MAGA_LR_1B"].ds
<xarray.DatasetView>
Dimensions:     (Timestamp: 10800, NEC: 3)
Coordinates:
  * Timestamp   (Timestamp) datetime64[ns] 2016-01-01 ... 2016-01-01T02:59:59
  * NEC         (NEC) <U1 'N' 'E' 'C'
Data variables:
    Spacecraft  (Timestamp) object 'A' 'A' 'A' 'A' 'A' ... 'A' 'A' 'A' 'A' 'A'
    B_NEC_IGRF  (Timestamp, NEC) float64 -1.578e+03 -1.031e+04 ... -2.564e+04
    Radius      (Timestamp) float64 6.834e+06 6.834e+06 ... 6.833e+06 6.833e+06
    Latitude    (Timestamp) float64 -72.5 -72.56 -72.63 ... -44.9 -44.97 -45.03
    B_NEC       (Timestamp, NEC) float64 -1.581e+03 -1.049e+04 ... -2.564e+04
    Longitude   (Timestamp) float64 92.79 92.82 92.85 ... 41.83 41.83 41.83
Attributes:
    Sources:         ['SW_OPER_AUX_IGR_2__19000101T000000_20241231T235959_010...
    MagneticModels:  ['IGRF = IGRF(max_degree=13,min_degree=1)']
    AppliedFilters:  []
    PAL_meta:        {"analysis_window": ["2016-01-01T00:00:00", "2016-01-01T...

swarmpal accessor#

The behaviour of the datatree is extended by the addition of an “accessor” that adds functionality from SwarmPAL under the .swarmpal namespace, e.g.:

# Metadata related to the SwarmPAL framework
data.swarmpal.pal_meta
{'.': {},
 'SW_OPER_MAGA_LR_1B': {'analysis_window': ['2016-01-01T00:00:00',
   '2016-01-01T03:00:00'],
  'magnetic_models': {'IGRF': 'IGRF(max_degree=13,min_degree=1)'}}}
data.swarmpal.magnetic_model_name
'IGRF'

The above properties are constructed from metadata which are stored within the datatree itself:

data["SW_OPER_MAGA_LR_1B"].attrs["PAL_meta"]
'{"analysis_window": ["2016-01-01T00:00:00", "2016-01-01T03:00:00"], "magnetic_models": {"IGRF": "IGRF(max_degree=13,min_degree=1)"}}'

It is possible to add more complex methods that work on the datasets:

data["SW_OPER_MAGA_LR_1B"].swarmpal.magnetic_residual()
<xarray.DataArray (Timestamp: 10800, NEC: 3)>
array([[  -3.50616551, -184.14602906,  -75.42492058],
       [  -4.01971892, -185.35397795,  -74.77706063],
       [  -3.76127931, -185.80363109,  -74.31929219],
       ...,
       [ -12.14993446,    6.86291637,    0.72439786],
       [ -12.1937803 ,    6.89781389,    1.02280304],
       [ -12.21376055,    6.82121016,    1.26699041]])
Coordinates:
  * Timestamp  (Timestamp) datetime64[ns] 2016-01-01 ... 2016-01-01T02:59:59
  * NEC        (NEC) <U1 'N' 'E' 'C'
Attributes:
    units:        nT
    description:  Magnetic field vector data-model residual, NEC frame

Defining and running a PalProcess#

A process can be defined which will act on datatrees obtained as above. Define processes by subclassing the abstract PalProcess class.

from swarmpal.io import PalProcess
help(PalProcess)
Help on class PalProcess in module swarmpal.io._paldata:

class PalProcess(abc.ABC)
 |  PalProcess(config: 'dict | None' = None, active_tree: 'str' = '/', inplace: 'bool' = True)
 |  
 |  Abstract class to define processes to act on datatrees
 |  
 |  Method resolution order:
 |      PalProcess
 |      abc.ABC
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __call__(self, datatree) -> 'DataTree'
 |      Run the process, defined in _call, to update the datatree
 |  
 |  __init__(self, config: 'dict | None' = None, active_tree: 'str' = '/', inplace: 'bool' = True)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  set_config(self, **kwargs) -> 'None'
 |  
 |  ----------------------------------------------------------------------
 |  Readonly properties defined here:
 |  
 |  active_tree
 |      Defines which branch of the datatree will be used
 |  
 |  process_name
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  config
 |      Dictionary that configures the process behaviour
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset({'_call', 'process_name', 'set_config'...

Here is an example of defining a process. Still subject to change!

Three methods must be set:

  • process_name identifies the process, and is used to update the "PAL_meta" attribute in the datatree when the process is applied.

  • set_config takes keyword arguments and stores them as a dict in the config property.

  • _call defines the behaviour of the process itself, and should accept the input datatree and return a modified datatree

When a process object is instantiated, the user optionally provides two arguments which are set as properties of the process

  • active_tree (str) selects which branch of the tree is to be used

  • config (dict) provides parameters to control the behaviour of the process

The config can also be provided using .set_config() after the process object is created. This enables the process to provide and document default configurations, as well allowing the IDE to provide hints for what configuration is available.

from datatree import DataTree
from xarray import Dataset


class MyProcess(PalProcess):
    """Compute the first differences on a given variable"""

    @property
    def process_name(self):
        return "MyProcess"

    def set_config(self, dataset="SW_OPER_MAGA_LR_1B", parameter="B_NEC"):
        self.config = dict(dataset=dataset, parameter=parameter)

    def _call(self, datatree):
        # Identify inputs for algorithm
        subtree = datatree[f"{self.config.get('dataset')}"]
        dataset = subtree.ds
        parameter = self.config.get("parameter")
        # Apply the algorithm
        output_data = dataset[parameter].diff(dim="Timestamp")
        # Create an output dataset
        data_out = Dataset(
            data_vars={
                f"d/dt ({parameter})": output_data,
            }
        )
        # Write the output into a new path in the datatree and return it
        subtree["output"] = DataTree(data=data_out)
        return datatree

The process can now be created with some configuration:

process = MyProcess(
    config={"dataset": "SW_OPER_MAGA_LR_1B", "parameter": "B_NEC"},
)

…and there is a tool to apply this process to the datatree:

data = data.swarmpal.apply(process)
print(data)
DataTree('paldata', parent=None)
│   Dimensions:  ()
│   Data variables:
│       *empty*
│   Attributes:
│       PAL_meta:  {"MyProcess": {"dataset": "SW_OPER_MAGA_LR_1B", "parameter": "...
└── DataTree('SW_OPER_MAGA_LR_1B')
    │   Dimensions:     (Timestamp: 10800, NEC: 3)
    │   Coordinates:
    │     * Timestamp   (Timestamp) datetime64[ns] 2016-01-01 ... 2016-01-01T02:59:59
    │     * NEC         (NEC) <U1 'N' 'E' 'C'
    │   Data variables:
    │       Spacecraft  (Timestamp) object 'A' 'A' 'A' 'A' 'A' ... 'A' 'A' 'A' 'A' 'A'
    │       B_NEC_IGRF  (Timestamp, NEC) float64 -1.578e+03 -1.031e+04 ... -2.564e+04
    │       Radius      (Timestamp) float64 6.834e+06 6.834e+06 ... 6.833e+06 6.833e+06
    │       Latitude    (Timestamp) float64 -72.5 -72.56 -72.63 ... -44.9 -44.97 -45.03
    │       B_NEC       (Timestamp, NEC) float64 -1.581e+03 -1.049e+04 ... -2.564e+04
    │       Longitude   (Timestamp) float64 92.79 92.82 92.85 ... 41.83 41.83 41.83
    │   Attributes:
    │       Sources:         ['SW_OPER_AUX_IGR_2__19000101T000000_20241231T235959_010...
    │       MagneticModels:  ['IGRF = IGRF(max_degree=13,min_degree=1)']
    │       AppliedFilters:  []
    │       PAL_meta:        {"analysis_window": ["2016-01-01T00:00:00", "2016-01-01T...
    └── DataTree('output')
            Dimensions:       (Timestamp: 10799, NEC: 3)
            Coordinates:
              * Timestamp     (Timestamp) datetime64[ns] 2016-01-01T00:00:01 ... 2016-01-...
              * NEC           (NEC) <U1 'N' 'E' 'C'
            Data variables:
                d/dt (B_NEC)  (Timestamp, NEC) float64 -26.66 0.796 3.476 ... -10.19 -12.09

The resulting data can be interrogated with the usual tools (in this case we added a new dataset to the tree under "SW_OPER_MAGA_LR_1B/output"):

data["SW_OPER_MAGA_LR_1B/output"].ds["d/dt (B_NEC)"].plot.line(x="Timestamp");
../../_images/d66c6c5aa171d3cf5e7f0e558f669222a83453328deca2fbca8c0efdcfb0f839.png

… and the datatree carries with it the metadata about the process which has been applied:

data.swarmpal.pal_meta
{'.': {'MyProcess': {'dataset': 'SW_OPER_MAGA_LR_1B', 'parameter': 'B_NEC'}},
 'SW_OPER_MAGA_LR_1B': {'analysis_window': ['2016-01-01T00:00:00',
   '2016-01-01T03:00:00'],
  'magnetic_models': {'IGRF': 'IGRF(max_degree=13,min_degree=1)'}},
 'SW_OPER_MAGA_LR_1B/output': {}}

More tricks with create_paldata#

Fetching data from HAPI#

Two differences from using VirES:

  • Parameters follow the scheme in hapiclient
    Example: http://hapi-server.org/servers/#server=VirES-for-Swarm&dataset=SW_OPER_MAGA_LR_1B&parameters=B_NEC&start=2016-01-01T00:00:00&stop=2016-01-01T03:00:00&return=script&format=python

  • The output dataset is not identical to that retrieved from VirES (variables and their content are the same, but less metadata etc)

data_params = dict(
    server="https://vires.services/hapi",
    dataset="SW_OPER_MAGA_LR_1B",
    parameters="B_NEC",
    start="2016-01-01T00:00:00",
    stop="2016-01-01T03:00:00",
)
data_hapi = create_paldata(alpha_hapi=PalDataItem.from_hapi(**data_params))
print(data_hapi)
DataTree('paldata', parent=None)
└── DataTree('alpha_hapi')
        Dimensions:    (Timestamp: 10800, B_NEC_dim1: 3)
        Coordinates:
          * Timestamp  (Timestamp) datetime64[ns] 2016-01-01 ... 2016-01-01T02:59:59
        Dimensions without coordinates: B_NEC_dim1
        Data variables:
            B_NEC      (Timestamp, B_NEC_dim1) float64 -1.581e+03 ... -2.564e+04
        Attributes:
            PAL_meta:  {"analysis_window": ["2016-01-01T00:00:00", "2016-01-01T03:00:...
/home/docs/checkouts/readthedocs.org/user_builds/swarmpal/envs/latest/lib/python3.11/site-packages/hapiclient/hapitime.py:287: UserWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
  Time = pandas.to_datetime(Time, infer_datetime_format=True).tz_convert(tzinfo).to_pydatetime()

Time padding#

A tuple of timedelta can be given as an extra parameter. This extends the retrieved time interval, while storing the original time interval in "analysis_window" within the "Pal_meta" attribute.

data_params = dict(
    server="https://vires.services/hapi",
    dataset="SW_OPER_MAGA_LR_1B",
    parameters="B_NEC",
    start="2016-01-01T00:00:00",
    stop="2016-01-01T03:00:00",
    pad_times=(dt.timedelta(hours=1), dt.timedelta(hours=1)),
)
data_hapi = create_paldata(alpha_hapi=PalDataItem.from_hapi(**data_params))
print(data_hapi)
DataTree('paldata', parent=None)
└── DataTree('alpha_hapi')
        Dimensions:    (Timestamp: 18000, B_NEC_dim1: 3)
        Coordinates:
          * Timestamp  (Timestamp) datetime64[ns] 2015-12-31T23:00:00 ... 2016-01-01T...
        Dimensions without coordinates: B_NEC_dim1
        Data variables:
            B_NEC      (Timestamp, B_NEC_dim1) float64 2.08e+04 -2.121e+03 ... 4.618e+04
        Attributes:
            PAL_meta:  {"analysis_window": ["2016-01-01T00:00:00", "2016-01-01T03:00:...
/home/docs/checkouts/readthedocs.org/user_builds/swarmpal/envs/latest/lib/python3.11/site-packages/hapiclient/hapitime.py:287: UserWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
  Time = pandas.to_datetime(Time, infer_datetime_format=True).tz_convert(tzinfo).to_pydatetime()