Tutorial

In this tutorial, we will walk through a simple example of how pysmo can be used in a project. This will provide insight into:

How to define a new seismogram class that is tailored to a specific use case (and why you would want to do that).
How to use this new class together with pysmo.
Writing new code that is potentially reusable by other projects or classes.

Custom seismogram class

The purpose of a class in Python is generally to store state (i.e. data). As the exact type of data can vary between use cases, it is often useful to define a new class that is tailored to the specific scenario.

We will assume that our use case here is to store seismogram data that is to be used in an ambient noise cross-correlation project. In this context, we don't need to store event information in our seismograms. In fact, we are probably more interested in knowing whether or not earthquake signals are present in the seismogram data. We would therefore want our class to have an attribute to store this particular piece of information. A simple class definition could look like this:

noise_seismogram.py

from dataclasses import dataclass, field
from datetime import datetime, timedelta
import numpy as np


@dataclass  # (1)!
class NoiseSeismogram:
    begin_time: datetime  # (2)!
    delta: timedelta = timedelta(seconds=0.01)  # (3)!
    data: np.ndarray = field(default_factory=lambda: np.array([]))  # (4)!

    def __len__(self) -> int:  # (5)!
        return len(self.data)

    @property
    def end_time(self) -> datetime:  # (6)!
        if len(self) == 0:
            return self.begin_time
        return self.begin_time + self.delta * (len(self) - 1)

    contains_earthquake: bool = False  # (7)!

dataclass is a decorator that automatically generates special methods for the class, such as __init__, __repr__, and __eq__, based on the class attributes. This makes it easier to create classes that are primarily used to store data.
Instance attributes are defined simply by declaring them in the class body with type annotations.
Attributes can have default values too.
Care must be given if default values are mutable types (like lists or dictionaries). In such cases, you need to use field(default_factory=...) to ensure that each instance of the class gets its own separate copy of the mutable object.
A __len__ method is a special method that lets us use the built-in len() function, defined here as the number of samples in the data array.
We add a read-only property end_time that computes the end time of the seismogram based on the start time, number of samples, and sampling rate.
Finally, we add an attribute that lets us know if our seismogram contains earthquake signals or not.

A real world example would likely have more attributes (such as station coordinates), but to keep things simple we will stick to this minimal example.

We are now ready to use this new class in our project. To test this run an interactive Python session and create an instance of the class:

$ python -i noise_seismogram.py
>>> begin_time = datetime(2023, 1, 1, 0, 0, 0)
>>> data = np.random.randn(1000)  # Simulated noise data
>>> noise_seis = NoiseSeismogram(begin_time=begin_time, data=data)
>>>

Key observations

Creating a purpose-built class whose primary purpose is to store data is simple and straightforward with the dataclass decorator.
With the exception of the __len__ method (which allows using len on NoiseSeismogram instances), we have not defined any methods in our class.
All attributes of the class are non-optional. If they were, we would have typed them with e.g. bool | None instead of just bool. While this is not strictly a requirement, it does avoid having to deal with the possibility of multiple instances of the class having different amounts of data (e.g. some instances having information in the contains_earthquake attribute and some not). This simplifies things later on, as we don't need to include checks for the presence of data in our functions.

Functions that operate on the new class

Now that we have our new class to store seismogram data, we likely want to do some form of processing on the data. In this example, we will write two functions:

check_for_earthquakes(): A function that checks if earthquake signals are present in the seismogram data.
detrend(): A function that detrends the seismogram data.

A first version of these functions could look like this:

functions_v1.py

from noise_seismogram import NoiseSeismogram
import scipy


def check_for_earthquakes(seismogram: NoiseSeismogram) -> None:
    if seismogram.contains_earthquake is True:
        print("Seismogram contains an earthquake.")
    elif seismogram.contains_earthquake is False:
        print("Seismogram does not contain an earthquake.")
    else:
        print("Seismogram earthquake status is unknown.")


def detrend(seismogram: NoiseSeismogram) -> None:
    seismogram.data = scipy.signal.detrend(seismogram.data)

These functions are already pretty good. They are easy to understand and are properly annotated with type hints. We can verify this by checking the file with mypy:

$ python -m mypy functions_v1.py
Success: no issues found in 1 source file

Checking code with mypy is a great way to prevent errors from sneaking into your codebase. Most modern code editors have the same type of checking built in, so it is really easy to make use of type hints in this way. If we really commit to checking all our code while writing it, we can simplify the functions by stripping out the bits that are unreachable. Running mypy again, but with an extra flag we can see where this is the case:

$ python -m mypy --warn-unreachable functions_v1.py
functions_v1.py:11: error: Statement is unreachable  [unreachable]
Found 1 error in 1 file (checked 1 source file)

We get this error because the contains_earthquake attribute is non-optional, so it can only be True or False. This means that the else branch of the function (highlighted in the code block above) is never used and can be removed:

functions_v2.py

from noise_seismogram import NoiseSeismogram
import scipy


def check_for_earthquakes(seismogram: NoiseSeismogram) -> None:
    if seismogram.contains_earthquake is True:
        print("Seismogram contains an earthquake.")
    else:  # (1)!
        print("Seismogram does not contain an earthquake.")


def detrend(seismogram: NoiseSeismogram) -> None:
    seismogram.data = scipy.signal.detrend(seismogram.data)

At this point we know that seismogram.contains_earthquake can only be False, so we don't need an elif check anymore.

Mypy is also happy with this version of the code:

$ python -m mypy --warn-unreachable functions_v2.py
Success: no issues found in 1 source file

Key observations

Writing functions for our bespoke class is also pretty straightforward.
Because we have correctly annotated both our class and our functions with type hints, we can easily check our code for errors with mypy.
Our purpose-built class avoids optional attributes, which allows us to avoid redundant checks in our functions. This makes our code simpler and likely easier to read.
Note that we are relying on checking our code before running it! If you want checking at runtime, you may need to look into using a library like beartype or pydantic.

Reusing functions in other contexts

Comparing the two functions, check_for_earthquakes() and detrend(), we notice that only the former relies on the contains_earthquake attribute of our class. Recall that this attribute was the only one introduced for the specific use case of ambient noise cross-correlation. This means that the remaining attributes form a sort of "baseline" seismogram, that we are likely to encounter in other use cases as well. Put another way, we ought to be able to reuse the detrend() function in other contexts.

To illustrate this, let's consider a different project, where we (for some reason) want to store the season together with seismogram data. Again we write a bespoke class for this:

season_seismogram.py

from dataclasses import dataclass, field
from datetime import datetime, timedelta
from enum import StrEnum
import numpy as np


class Season(StrEnum):  # (1)!
    SPRING = "spring"
    SUMMER = "summer"
    AUTUMN = "autumn"
    WINTER = "winter"


@dataclass
class SeasonSeismogram:
    begin_time: datetime
    delta: timedelta = timedelta(seconds=0.01)
    data: np.ndarray = field(default_factory=lambda: np.array([]))

    def __len__(self) -> int:
        return len(self.data)

    @property
    def end_time(self) -> datetime:
        if len(self) == 0:
            return self.begin_time
        return self.begin_time + self.delta * (len(self) - 1)

    season: Season = Season.SUMMER  # (2)!

StrEnum are a great way to limit the values strings can take.
Much like with NoiseSeismogram, we have just one project specific attribute (season).

Next we write a script that uses this new class together with the detrend() function from earlier:

season_detrend_v1.py

from season_seismogram import SeasonSeismogram, Season
from functions_v2 import detrend
from datetime import datetime
import numpy as np


def main() -> None:
    # Create a sample SeasonSeismogram instance with random data
    begin_time = datetime(2023, 1, 1, 0, 0, 0)
    data = np.random.randn(1000)
    season_seismogram = SeasonSeismogram(
        begin_time=begin_time, data=data, season=Season.WINTER
    )

    # Use the season_seismogram with the detrend function
    detrend(season_seismogram)


if __name__ == "__main__":
    main()

This script will actually run without a hitch:

$ python season_detrend_v1.py && echo "success\!"
success!

However, because we have a mismatch in our type annotations (detrend() expects a NoiseSeismogram but we are passing it a SeasonSeismogram), mypy will complain:

$ python -m mypy season_detrend_v1.py
season_detrend_v1.py:16: error: Argument 1 to "detrend" has incompatible type "SeasonSeismogram"; expected "NoiseSeismogram"  [arg-type]

The reason for this discrepancy is that type annotations prevent us from using non-existent attributes in our functions, but that doesn't mean we have to use all available attributes. In this case, the detrend() function only relies on the data attribute, which happens to exist in both NoiseSeismogram and SeasonSeismogram. In other words, we just got lucky this time.

To fix this, we need to amend the type annotations of the detrend() function:

functions_v3.py

from noise_seismogram import NoiseSeismogram
from season_seismogram import SeasonSeismogram  # (1)!
import scipy


def check_for_earthquakes(seismogram: NoiseSeismogram) -> None:
    if seismogram.contains_earthquake is True:
        print("Seismogram contains an earthquake.")
    else:
        print("Seismogram does not contain an earthquake.")


def detrend(seismogram: NoiseSeismogram | SeasonSeismogram) -> None:
    seismogram.data = scipy.signal.detrend(seismogram.data)

We need to import SeasonSeismogram to be able to use it in our type annotations.

With these changes in place, mypy is happy again:

$ python -m mypy season_detrend_v2.py
Success: no issues found in 1 source file

Key observations

We have successfully reused the detrend() function in a different context.
However, it did require changing the type annotations of the function.
While the changes were small, making them every time we want to reuse the function is cumbersome.
The check_for_earthquakes() function is not reusable at all, as it relies on the contains_earthquake attribute that only exists in NoiseSeismogram. Thus we can identify two types of functions: those that are reusable and those that are not. This is also reflected in their respective type annotations.

Introducing pysmo

If we are sold on the idea of writing custom seismogram classes for particular use cases, we end up with a bit of a maintenance nightmare if we have a lot of "shared" type functions. As shown by the detrend() function, whenever we want existing functions to work smoothly with a new class, we need to use it to annotate the function. Doing this also introduces a dependency between the functions and classes. It is therefore conceivable that sometime in the future one of the classes may change in a way that breaks functions.

This is not a new problem in programming, and is typically solved by defining interfaces that sit between the functions and the classes (or more generally between different parts of code). This means that we can write a function to be compatible with the interface rather than multiple different (seismogram) classes. Of course, the classes need to be written in a way that conforms to the interface too.

Pysmo provides such an interface for seismogram (and other) classes. These interfaces make use of Python's Protocol class. Below is the actual implementation of pysmo's Seismogram interface:

@runtime_checkable
class Seismogram(Protocol):
    """Protocol class to define the `Seismogram` type.

    Examples:
        Usage for a function that takes a Seismogram compatible class instance as
        argument and returns the begin time in isoformat:

        ```python
        >>> from pysmo import Seismogram
        >>> from pysmo.classes import SAC  # SAC is a class that "speaks" Seismogram
        >>>
        >>> def example_function(seis_in: Seismogram) -> str:
        ...     return seis_in.begin_time.isoformat()
        ...
        >>> sac = SAC.from_file("example.sac")
        >>> seismogram = sac.seismogram
        >>> example_function(seismogram)
        '2005-03-01T07:23:02.160000+00:00'
        >>>
        ```
    """

    begin_time: datetime
    """Seismogram begin time."""

    data: np.ndarray
    """Seismogram data."""

    delta: timedelta
    """The sampling interval.

    Should be a positive `timedelta` instance.
    """

    def __len__(self) -> int:
        """The length of the Seismogram.

        Returns:
            Number of samples in the data array.
        """
        return len(self.data)

    @property
    def end_time(self) -> datetime:
        """Seismogram end time."""
        if len(self) == 0:
            return self.begin_time
        return self.begin_time + self.delta * (len(self) - 1)

If you strip away the extra docstrings, you will notice that this looks remarkably similar to the common bits of the NoiseSeismogram and SeasonSeismogram classes we defined above. Important to mention at this point is that the __len__ method and end_time property do not really need to be implemented here. Typically the __len__ method would look more like this:

def __len__(self) -> int: ...

That is because the purpose of Seismogram is mainly to provide type information - it is actually impossible to create an instance from a Protocol class directly. However, it is possible to inherit from them, so we could rewrite the SeasonSeismogram as follows:

season_seismogram_short.py

from season_seismogram import Season
from pysmo import Seismogram
from dataclasses import dataclass, field
from datetime import datetime, timedelta
import numpy as np


@dataclass
class SeasonSeismogram(Seismogram):  # (1)!
    begin_time: datetime
    delta: timedelta = timedelta(seconds=0.01)
    data: np.ndarray = field(default_factory=lambda: np.array([]))
    season: Season = Season.SUMMER

__len__ and end_time are inherited from Seismogram, so we don't need to write an implementation anymore.

The NoiseSeismogram class could be shortened in the same manner.

Note

Python Protocol classes are used almost exclusively in type annotations. We will therefore refer to the ones shipped with pysmo as types rather than protocols or interfaces.

Python considers classes that have the same structure as a protocol class to be subclasses of the protocol class. For our particular case, this means that instances of NoiseSeismogram and SeasonSeismogram are also instances of Seismogram.

Using pysmo types to annotate our functions, even if they are to be used with loads of different seismogram classes, is now a breeze:

functions_v4.py

from noise_seismogram import NoiseSeismogram
from pysmo import Seismogram  # (1)!
import scipy


def check_for_earthquakes(seismogram: NoiseSeismogram) -> None:
    if seismogram.contains_earthquake is True:
        print("Seismogram contains an earthquake.")
    else:
        print("Seismogram does not contain an earthquake.")


def detrend(seismogram: Seismogram) -> None:  # (2)!
    seismogram.data = scipy.signal.detrend(seismogram.data)

Instead of importing SeasonSeismogram, we now import Seismogram.
We no longer need to list all the seismogram classes we want to use in the detrend function. We simply specify Seismogram and are done.

Key observations

The detrend() function now uses pysmo types in its annotations.
Because NoiseSeismogram and SeasonSeismogram are subclasses of Seismogram, type checkers will accept instances of NoiseSeismogram and SeasonSeismogram as valid inputs for detrend().
Future custom seismogram classes will also be accepted as inputs without any changes to the detrend() function, provided the structure prescribed by the Seismogram type is adhered to.
The check_for_earthquakes() is still annotated with NoiseSeismogram, because it uses the contains_earthquake attribute that doesn't exist in the Seismogram type.

Conclusion

This tutorial doesn't actually show much "pysmo code". Instead it introduces core concepts of pysmo. These are:

Pysmo is not centred around a single seismogram class that is used as the core of the library. Often these kinds of single purpose classes restrict users to the use cases envisioned by the library authors, rather than being well suited for new applications.
Custom seismogram classes are suited for new applications. However, they introduce challenges when trying to write reusable code.
Pysmo provides a solution for this by focusing on what the different custom classes have in common, and defining an interface that can be targeted when writing code.
These interfaces are referred to as pysmo types. One of their most defining characteristics is that they are very simple (very few attributes, and methods are almost entirely avoided).

Pysmo itself is guided by these same principles. Thus, modules packaged with pysmo are easily reusable in other projects.