Data¶

Converters¶

RML2016¶

Data loaders for the RML2016.10x open source datasets provided by DeepSig, Inc.

rfml.data.converters.rml_2016.load_RML201610A_dataset(path: str = None) → rfml.data.dataset.Dataset[source]¶

Load the RadioML2016.10A Dataset provided by DeepSig Inc.

This dataset is licensed under Creative Commons Attribution - NonCommercial - ShareAlike 4.0 License (CC BY-NC-SA 4.0) by DeepSig Inc.

Parameters: path (str, optional) – Path to the dataset which has already been downloaded from DeepSig Inc., saved locally, and extracted (tar xjf). If not provided, the dataset will attempt to be downloaded from the internet and saved locally – subsequent calls would read from that cached dataset that is fetched.
Raises: ValueError – If path is provided but does not exist.
Returns: A Dataset that has been loaded with the data from RML2016.10A
Return type: Dataset

License: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode
Download Location: https://www.deepsig.io/datasets
Citation: T. J. O’Shea and N. West, “Radio machine learning dataset generation with GNU Radio” in Proceedings of the GNU Radio Conference, vol. 1, 2016.

rfml.data.converters.rml_2016.load_RML201610B_dataset(path: str = None) → rfml.data.dataset.Dataset[source]¶

Load the RadioML2016.10B Dataset provided by DeepSig Inc.

This dataset is licensed under Creative Commons Attribution - NonCommercial - ShareAlike 4.0 License (CC BY-NC-SA 4.0) by DeepSig Inc.

Parameters: path (str, optional) – Path to the dataset which has already been downloaded from DeepSig Inc., saved locally, and extracted (tar xjf). If not provided, the dataset will attempt to be downloaded from the internet and saved locally – subsequent calls would read from that cached dataset that is fetched.
Raises: ValueError – If path is provided but does not exist.
Returns: A Dataset that has been loaded with the data from RML2016.10B
Return type: Dataset

License: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode
Download Location: https://www.deepsig.io/datasets
Citation: T. J. O’Shea and N. West, “Radio machine learning dataset generation with GNU Radio” in Proceedings of the GNU Radio Conference, vol. 1, 2016.

Dataset Builder¶

Provide a builder pattern for the creation of a dataset.

class rfml.data.dataset_builder.DatasetBuilder(n: int = None, keys: Set[str] = None, defaults: Dict[str, Union[str, int, float]] = {})[source]¶

Builder pattern for programmatic creation of a Dataset

Parameters

n (int, optional) – Length of the time window (number of samples) that each entry in the dataset should have. If it is not provided, then it is inferred from the first added example. Defaults to None.
keys (Set[str], optional) – A set of column headers that will be included as metadata for all examples. If it is not provided, then it is inferred from the first added example. Subsequent examples that are added must either have all of these keys provided as metadata or they must be defined in the defaults below. Defaults to None.
defaults (Dict[str, Union, optional) – A mapping of default metadata values that will be included for each example if they aren’t overridden. Defaults to dict().

Examples

>>> iq = np.zeros((2, 1024))
>>> db = DatasetBuilder()
>>> db.add(iq, Modulation="BPSK")
>>> db.add(iq, Modulation="QPSK")
>>> dataset = db.build()

Raises

ValueError – If both keys and defaults are provided, but, the defaults have additional keys that were not provided.
ValueError – If n is negative or 0.

See also

rfml.data.Dataset

add(iq: numpy.ndarray, **kwargs) → rfml.data.dataset_builder.DatasetBuilder[source]¶

Add a new example to the Dataset that is being built.

Parameters

iq (np.ndarray) – A (2xN) array of IQ samples.
**kwargs – Each key=value pair is included as metadata for this example.

Returns

By returning the self, these calls can be chained.

Return type

DatasetBuilder

Raises

ValueError – If the IQ data does not match the expected shape – It should be (2xN) where N has been provided during construction of this builder or inferred from the first example added.
ValueError – If all of the necessary metadata values are not provided in kwargs. The necessary metadata values are either provided during construction of this builder or inferred from the first example added.

build() → rfml.data.dataset.Dataset[source]¶

Build the Dataset based on the examples that have been added.

Returns: A compiled dataset consisting of the added examples.
Return type: Dataset

Dataset¶

Wrap a premade dataset inside a Pandas DataFrame.

Provide a wrapper around a Pandas DataFrame for a premade dataset that splits the classes and other distinguishing factors evenly for training, testing, and validation sets. Additionally, this module facilitates data loading from file and transformation into the format needed by Keras and PyTorch.

By using Pandas masking functionality, this module can be used to subselect parts of a dataset (e.g. only trained with no frequency offset, a subset of modulatons, etc.)

class rfml.data.dataset.Dataset(df: pandas.core.frame.DataFrame)[source]¶

Provide a wrapper around a Pandas DataFrame containing a dataset.

Parameters: df (pd.DataFrame) – Pandas DataFrame that represents the dataset

as_numpy(le: rfml.data.encoder.Encoder, mask: pandas.core.generic.NDFrame.mask = None) → Tuple[numpy.ndarray, numpy.ndarray][source]¶

Encode the Dataset as a machine learning <X, Y> pair in NumPy format.

Parameters

le (Encoder) – Label encoder used to translate the label column into a format the neural network will understand (such as an index). The label column is embedded within this class.
mask (pd.DataFrame.mask, optional) – Mask to apply before creating the Machine Learning pairs. Defaults to None.

Returns

x, y

Return type

Tuple[np.ndarray, np.ndarray]

The X matrix is returned in the format (batch, channel, iq, time). The Y matrix is returned in the format (batch).

Batch corresponds to the number of examples in the dataset, channel is always 1, IQ is always 2, and time is variable length depending on how the underlying data has been sliced.

Note

Numpy is the format used by Keras. Other machine learning frameworks (such as PyTorch) require a separate method for getting the data ready.

See also

rfml.data.Encoder, rfml.data.Dataset.as_torch

as_torch(le: rfml.data.encoder.Encoder, mask: pandas.core.generic.NDFrame.mask = None) → torch.utils.data.dataset.TensorDataset[source]¶

Encode the Dataset as machine learning <X, Y> pairs in PyTorch format.

Parameters

le (Encoder) – Label encoder used to translate the label column into a format the neural network will understand (such as an index). The label column is embedded within this class.
mask (pd.DataFrame.mask, optional) – Mask to apply before creating the Machine Learning pairs. Defaults to None.

Returns

Dataset to be used in training or testing loops.

Return type

TensorDataset

The X matrix is returned in the format (batch, channel, iq, time). The Y matrix is returned in the format (batch).

Batch corresponds to the number of examples in the dataset, channel is always 1, IQ is always 2, and time is variable length depending on how the underlying data has been sliced.

Note

TensorDataset is the format used by PyTorch and allows for iteration in batches. For other machine learning frameworks, such as Keras, ensure you call the correct method.

See also

rfml.data.Encoder, rfml.data.Dataset.as_numpy

property columns¶

Return a list of the columns that are represented in the underlying Dataframe

Returns: Column names
Return type: List[str]

property df¶

Directly return the underlying Pandas DataFrame containing the data.

This can then be used for mask creation.

Returns: Pandas DataFrame that represents the dataset
Return type: pd.DataFrame

get_examples_per_class(label: str = 'Modulation') → Dict[str, int][source]¶

Count the number of examples per class in this Dataset.

Parameters: label (str, optional) – Column that is used as the class label. Defaults to “Modulation”.
Returns: Count of examples (value) per label (key).
Return type: Dict[str, int]

is_balanced(label: str = 'Modulation', margin: int = 0) → bool[source]¶

Check if the data contained in this dataset is evenly represented by a categorical label.

Parameters

label (str, optional) – The column of the data to verify is balanced. Defaults to “Modulation”.
margin (int, optional) – Difference between the expected balance and the true balance before this check would fail. This can be useful for checking for a “fuzzy balance” that would occur if the Dataset was previously split and therefore the length of the Dataset is no longer divisible by the number of categorical labels. Defaults to 0.

Returns

True if the Dataset is balanced, False otherwise.

Return type

bool

split(frac: float = 0.3, on: Tuple[str] = None, mask: pandas.core.generic.NDFrame.mask = None) → Tuple[rfml.data.dataset.Dataset, rfml.data.dataset.Dataset][source]¶

Split this Dataset into two based on fractional availability.

Parameters

frac (float, optional) – Percentage of the Dataset to put into the second set. Defaults to 0.3.
on (Tuple[str], optional) – Collection of column names, with categorical values, to evenly split amongst the two Datasets. If provided, each categorical value will have an equal percentage representation in the returned Dataset. Defaults to None.
mask (pd.DataFrame.mask, optional) – Mask to apply before performing the split. Defaults to None.

Raises

ValueError – If frac is not between (0, 1)

Returns

Two Datasets (such as train/validate)

Return type

Tuple[Dataset, Dataset]

Warning

Not providing anything for the on parameter may lead to incorrect behavior. For instance, you may have a class imbalance in the datasets. This may be desired in some cases, but, its likely one would want to explicitly specify this and not rely on randomness.

See also

Dataset.subsample

Encoder¶

Simple helper class for encoding/decoding the labels for classification

Note

While many packages like sklearn and keras provide similar functionality, they were all quite annoying and did not play well with others. Since this functionality is so simple, its easier to just write our own implementation.

class rfml.data.encoder.Encoder(labels: Tuple[str], label_name: str)[source]¶

Encode the labels as an index of the “one-hot” which is used by PyTorch.

Parameters

labels (Tuple[str]) – A collection of human readable labels that could be encountered
label_name (str) – Name of the label column in the dataset that is being categorically encoded.

Examples

>>> "WBFM" -> 1
>>> "QAM16" -> 6

decode(encoding: Tuple[int]) → Tuple[str][source]¶

Decode a list of machine readable labels into human readable labels.

Parameters: encoding (Tuple[int]) – A collection of machine readable labels.
Returns: A collection of human readable labels.
Return type: Tuple[str]

encode(labels: Tuple[str]) → Tuple[int][source]¶

Encode a list of human readable labels into machine readable labels.

Parameters: labels (Tuple[str]) – Human readable labels to encode.
Returns: A collection of machine readable labels.
Return type: Tuple[int]

property label_name¶: The name of the column in the dataset that is categorically encoded by this class.

property labels¶: A collection of human readable labels that could be encountered – This allows the extraction of these labels by another object in order to plot or log.

Factory¶

Simplistic factory pattern for swapping of datasets.

rfml.data.factory.build_dataset(dataset_name: str, test_pct: float = 0.3, val_pct: float = 0.05, path: str = None) → Tuple[rfml.data.dataset.Dataset, rfml.data.dataset.Dataset, rfml.data.dataset.Dataset, rfml.data.encoder.Encoder][source]¶

Opinionated factory method that allows easy loading of different datasets.

This method makes an assumption about the labels to use for each dataset – if you need more extensive control then you can call the underlying method directly.

Parameters

dataset_name (str) – Name of the dataset to load. Currently supported values are: - RML2016.10A - RML2016.10B
test_pct (float, optional) – Percentage of the entire Dataset that should be withheld as a test set. Defaults to 0.3.
val_pct (float, optional) – Percentage of the non-testing Dataset that should be split out to use for validation in an early stopping procedure. Defaults to 0.05.
path (str, optional) – If provided, this is directly passed to the dataset converters so that they do not download the dataset from the internet (a costly operation) if you have already downloaded it yourself. Defaults to None.

Raises

ValueError – If test_pct or val_pct are not between 0 and 1 (non-inclusive).
ValueError – If the dataset_name is unknown.

Returns

train, validation, test, encoder

Return type

Tuple[Dataset, Dataset, Dataset, Encoder]