Data¶
Converters¶
RML2016¶
Data loaders for the RML2016.10x open source datasets provided by DeepSig, Inc.
-
rfml.data.converters.rml_2016.
load_RML201610A_dataset
(path: str = None) → rfml.data.dataset.Dataset[source]¶ Load the RadioML2016.10A Dataset provided by DeepSig Inc.
This dataset is licensed under Creative Commons Attribution - NonCommercial - ShareAlike 4.0 License (CC BY-NC-SA 4.0) by DeepSig Inc.
- Parameters
path (str, optional) – Path to the dataset which has already been downloaded from DeepSig Inc., saved locally, and extracted (tar xjf). If not provided, the dataset will attempt to be downloaded from the internet and saved locally – subsequent calls would read from that cached dataset that is fetched.
- Raises
ValueError – If path is provided but does not exist.
- Returns
A Dataset that has been loaded with the data from RML2016.10A
- Return type
- License
- Download Location
- Citation
T. J. O’Shea and N. West, “Radio machine learning dataset generation with GNU Radio” in Proceedings of the GNU Radio Conference, vol. 1, 2016.
-
rfml.data.converters.rml_2016.
load_RML201610B_dataset
(path: str = None) → rfml.data.dataset.Dataset[source]¶ Load the RadioML2016.10B Dataset provided by DeepSig Inc.
This dataset is licensed under Creative Commons Attribution - NonCommercial - ShareAlike 4.0 License (CC BY-NC-SA 4.0) by DeepSig Inc.
- Parameters
path (str, optional) – Path to the dataset which has already been downloaded from DeepSig Inc., saved locally, and extracted (tar xjf). If not provided, the dataset will attempt to be downloaded from the internet and saved locally – subsequent calls would read from that cached dataset that is fetched.
- Raises
ValueError – If path is provided but does not exist.
- Returns
A Dataset that has been loaded with the data from RML2016.10B
- Return type
- License
- Download Location
- Citation
T. J. O’Shea and N. West, “Radio machine learning dataset generation with GNU Radio” in Proceedings of the GNU Radio Conference, vol. 1, 2016.
Dataset Builder¶
Provide a builder pattern for the creation of a dataset.
-
class
rfml.data.dataset_builder.
DatasetBuilder
(n: int = None, keys: Set[str] = None, defaults: Dict[str, Union[str, int, float]] = {})[source]¶ Builder pattern for programmatic creation of a Dataset
- Parameters
n (int, optional) – Length of the time window (number of samples) that each entry in the dataset should have. If it is not provided, then it is inferred from the first added example. Defaults to None.
keys (Set[str], optional) – A set of column headers that will be included as metadata for all examples. If it is not provided, then it is inferred from the first added example. Subsequent examples that are added must either have all of these keys provided as metadata or they must be defined in the defaults below. Defaults to None.
defaults (Dict[str, Union, optional) – A mapping of default metadata values that will be included for each example if they aren’t overridden. Defaults to dict().
Examples
>>> iq = np.zeros((2, 1024)) >>> db = DatasetBuilder() >>> db.add(iq, Modulation="BPSK") >>> db.add(iq, Modulation="QPSK") >>> dataset = db.build()
- Raises
ValueError – If both keys and defaults are provided, but, the defaults have additional keys that were not provided.
ValueError – If n is negative or 0.
See also
rfml.data.Dataset
-
add
(iq: numpy.ndarray, **kwargs) → rfml.data.dataset_builder.DatasetBuilder[source]¶ Add a new example to the Dataset that is being built.
- Parameters
iq (np.ndarray) – A (2xN) array of IQ samples.
**kwargs – Each key=value pair is included as metadata for this example.
- Returns
By returning the self, these calls can be chained.
- Return type
- Raises
ValueError – If the IQ data does not match the expected shape – It should be (2xN) where N has been provided during construction of this builder or inferred from the first example added.
ValueError – If all of the necessary metadata values are not provided in kwargs. The necessary metadata values are either provided during construction of this builder or inferred from the first example added.
Dataset¶
Wrap a premade dataset inside a Pandas DataFrame.
Provide a wrapper around a Pandas DataFrame for a premade dataset that splits the classes and other distinguishing factors evenly for training, testing, and validation sets. Additionally, this module facilitates data loading from file and transformation into the format needed by Keras and PyTorch.
By using Pandas masking functionality, this module can be used to subselect parts of a dataset (e.g. only trained with no frequency offset, a subset of modulatons, etc.)
-
class
rfml.data.dataset.
Dataset
(df: pandas.core.frame.DataFrame)[source]¶ Provide a wrapper around a Pandas DataFrame containing a dataset.
- Parameters
df (pd.DataFrame) – Pandas DataFrame that represents the dataset
-
as_numpy
(le: rfml.data.encoder.Encoder, mask: pandas.core.generic.NDFrame.mask = None) → Tuple[numpy.ndarray, numpy.ndarray][source]¶ Encode the Dataset as a machine learning <X, Y> pair in NumPy format.
- Parameters
le (Encoder) – Label encoder used to translate the label column into a format the neural network will understand (such as an index). The label column is embedded within this class.
mask (pd.DataFrame.mask, optional) – Mask to apply before creating the Machine Learning pairs. Defaults to None.
- Returns
x, y
- Return type
Tuple[np.ndarray, np.ndarray]
The X matrix is returned in the format (batch, channel, iq, time). The Y matrix is returned in the format (batch).
Batch corresponds to the number of examples in the dataset, channel is always 1, IQ is always 2, and time is variable length depending on how the underlying data has been sliced.
Note
Numpy is the format used by Keras. Other machine learning frameworks (such as PyTorch) require a separate method for getting the data ready.
See also
rfml.data.Encoder, rfml.data.Dataset.as_torch
-
as_torch
(le: rfml.data.encoder.Encoder, mask: pandas.core.generic.NDFrame.mask = None) → torch.utils.data.dataset.TensorDataset[source]¶ Encode the Dataset as machine learning <X, Y> pairs in PyTorch format.
- Parameters
le (Encoder) – Label encoder used to translate the label column into a format the neural network will understand (such as an index). The label column is embedded within this class.
mask (pd.DataFrame.mask, optional) – Mask to apply before creating the Machine Learning pairs. Defaults to None.
- Returns
Dataset to be used in training or testing loops.
- Return type
TensorDataset
The X matrix is returned in the format (batch, channel, iq, time). The Y matrix is returned in the format (batch).
Batch corresponds to the number of examples in the dataset, channel is always 1, IQ is always 2, and time is variable length depending on how the underlying data has been sliced.
Note
TensorDataset is the format used by PyTorch and allows for iteration in batches. For other machine learning frameworks, such as Keras, ensure you call the correct method.
See also
rfml.data.Encoder, rfml.data.Dataset.as_numpy
-
property
columns
¶ Return a list of the columns that are represented in the underlying Dataframe
- Returns
Column names
- Return type
List[str]
-
property
df
¶ Directly return the underlying Pandas DataFrame containing the data.
This can then be used for mask creation.
- Returns
Pandas DataFrame that represents the dataset
- Return type
pd.DataFrame
-
get_examples_per_class
(label: str = 'Modulation') → Dict[str, int][source]¶ Count the number of examples per class in this Dataset.
- Parameters
label (str, optional) – Column that is used as the class label. Defaults to “Modulation”.
- Returns
Count of examples (value) per label (key).
- Return type
Dict[str, int]
-
is_balanced
(label: str = 'Modulation', margin: int = 0) → bool[source]¶ Check if the data contained in this dataset is evenly represented by a categorical label.
- Parameters
label (str, optional) – The column of the data to verify is balanced. Defaults to “Modulation”.
margin (int, optional) – Difference between the expected balance and the true balance before this check would fail. This can be useful for checking for a “fuzzy balance” that would occur if the Dataset was previously split and therefore the length of the Dataset is no longer divisible by the number of categorical labels. Defaults to 0.
- Returns
True if the Dataset is balanced, False otherwise.
- Return type
bool
-
split
(frac: float = 0.3, on: Tuple[str] = None, mask: pandas.core.generic.NDFrame.mask = None) → Tuple[rfml.data.dataset.Dataset, rfml.data.dataset.Dataset][source]¶ Split this Dataset into two based on fractional availability.
- Parameters
frac (float, optional) – Percentage of the Dataset to put into the second set. Defaults to 0.3.
on (Tuple[str], optional) – Collection of column names, with categorical values, to evenly split amongst the two Datasets. If provided, each categorical value will have an equal percentage representation in the returned Dataset. Defaults to None.
mask (pd.DataFrame.mask, optional) – Mask to apply before performing the split. Defaults to None.
- Raises
ValueError – If frac is not between (0, 1)
- Returns
Two Datasets (such as train/validate)
- Return type
Warning
Not providing anything for the on parameter may lead to incorrect behavior. For instance, you may have a class imbalance in the datasets. This may be desired in some cases, but, its likely one would want to explicitly specify this and not rely on randomness.
See also
Dataset.subsample
Encoder¶
Simple helper class for encoding/decoding the labels for classification
Note
While many packages like sklearn and keras provide similar functionality, they were all quite annoying and did not play well with others. Since this functionality is so simple, its easier to just write our own implementation.
-
class
rfml.data.encoder.
Encoder
(labels: Tuple[str], label_name: str)[source]¶ Encode the labels as an index of the “one-hot” which is used by PyTorch.
- Parameters
labels (Tuple[str]) – A collection of human readable labels that could be encountered
label_name (str) – Name of the label column in the dataset that is being categorically encoded.
Examples
>>> "WBFM" -> 1 >>> "QAM16" -> 6
-
decode
(encoding: Tuple[int]) → Tuple[str][source]¶ Decode a list of machine readable labels into human readable labels.
- Parameters
encoding (Tuple[int]) – A collection of machine readable labels.
- Returns
A collection of human readable labels.
- Return type
Tuple[str]
-
encode
(labels: Tuple[str]) → Tuple[int][source]¶ Encode a list of human readable labels into machine readable labels.
- Parameters
labels (Tuple[str]) – Human readable labels to encode.
- Returns
A collection of machine readable labels.
- Return type
Tuple[int]
-
property
label_name
¶ The name of the column in the dataset that is categorically encoded by this class.
-
property
labels
¶ A collection of human readable labels that could be encountered – This allows the extraction of these labels by another object in order to plot or log.
Factory¶
Simplistic factory pattern for swapping of datasets.
-
rfml.data.factory.
build_dataset
(dataset_name: str, test_pct: float = 0.3, val_pct: float = 0.05, path: str = None) → Tuple[rfml.data.dataset.Dataset, rfml.data.dataset.Dataset, rfml.data.dataset.Dataset, rfml.data.encoder.Encoder][source]¶ Opinionated factory method that allows easy loading of different datasets.
This method makes an assumption about the labels to use for each dataset – if you need more extensive control then you can call the underlying method directly.
- Parameters
dataset_name (str) – Name of the dataset to load. Currently supported values are: - RML2016.10A - RML2016.10B
test_pct (float, optional) – Percentage of the entire Dataset that should be withheld as a test set. Defaults to 0.3.
val_pct (float, optional) – Percentage of the non-testing Dataset that should be split out to use for validation in an early stopping procedure. Defaults to 0.05.
path (str, optional) – If provided, this is directly passed to the dataset converters so that they do not download the dataset from the internet (a costly operation) if you have already downloaded it yourself. Defaults to None.
- Raises
ValueError – If test_pct or val_pct are not between 0 and 1 (non-inclusive).
ValueError – If the dataset_name is unknown.
- Returns
train, validation, test, encoder
- Return type