Introduction#
Rationale#
Anyone who has taken a course on programming (in any language) will most likely first have been taught about the different data types available. These typically consist of simple types like integers, floats, strings, as well as the more complicated arrays, dictionaries, etc. These types are typically very clearly defined and intuitive to use. For example, we don't need to think very hard whether or not it makes sense to pass the string "hello world" as input to a program that calculates the square root of numbers. That simply doesn't make much sense, and not just within a piece of code, but also when using pen and paper.
As one moves from general-purpose programming towards more specialised problem solving (e.g. within a particular scientific field), the expected input (and output) for a piece of code is likely to become more complex. We may therefore quite easily reach a point where we no longer have that intuitive feeling about the data we are working with. In seismology this effect may be compounded by the fact that different file formats (and thus definitions) for seismograms exist.
Pysmo aims to mitigate this by defining types that make sense for seismological problems the same way the built-in types make sense for general-purpose programming. This is implemented by differentiating between how data are stored, and how they are processed. As a matter of fact, the focus is pretty much exclusively on how they are processed. The two main reasons for this are:
- The vast majority of code we write deals with processing data, not storing.
- When we are processing data, we are typically thinking of data more in terms of what they represent in the physical world. This is less so when storing data, as we are then less interested in capturing only the essence of e.g. a seismogram, but rather want to add as much useful metadata as possible, combine different data sources into one large object, etc.
A Software Contract#
When we write software, we often need to pass information around between
different parts of our code. To do this, the different parts need to be in
an agreement about the type of data they are going to exchange. One could say
that a contract exists that needs to be followed for our code to work properly.
In Python we can use type hints to describe such a contract. In the simplest
case, the information that is being exchanged consists of a built-in type such
as a float
:
def times_two(input_variable: float) -> float:
"""Returns input variable multiplied by 2."""
return input_variable * 2
The above function makes it quite clear what the contract for its usage is: it
expects a float
as as input, and it returns a float
as
output. If we try to call this function in other parts of our code using e.g. a
str
as input, our code editor will notify us that we made a mistake
before we even run our program! The same is true for the function output - if
we use the output as if it were a str
, we would again get an error
message from our code editor.
Things get more challenging when the information passed around is not just a built-in type, but instead a piece of code itself. To illustrate this, lets imagine a hypothetical function that is used to sign up a new user for our service, and sends them an email for confirmation. The function does not itself send the email, but instead uses a 3rd party module passed as an argument to the function. Our code may look something like this:
from emailprovider import EmailProvider
def signup(username: str, email_address: str, emailer: EmailProvider) -> None:
"""Send a welcome email to a new user using 3rd party email service."""
formatted_message = f"Hello {username}, welcome to the pysmo service!"
emailer.send_email(email_address, formatted_message)
The contract here, again expressed using type hints (and by using the
send_email
method of the EmailProvider
class...), is between our signup
function and the 3rd party module. The contract is also quite specifically with
this particular email service provider. This means our function is strongly
coupled with this 3rd party module. An easy way to tell this would be to remove
the first line from the above code - your code editor would straight away
complain that EmailProvider
is not defined. Should the email provider go out
of business, or even just change their API (i.e. how their module works), our
code would break. This is likely an easy fix if we use this email provider in
just one place, but if we have multiple functions that make use of this module,
they will all need to be refactored too. This would potentially be a lot of
work, which can be avoided by writing the contract outside of the function(s).
This contract would only need to be written once. If you remember the
first-steps section, you'll likely guess that this way of
defining the contract is by using Protocol
classes:
from typing import Protocol
class EmailSender(Protocol):
def send_email(self, email_address: str, message: str) -> None: ...
def signup(username: str, email_address: str, emailer: EmailSender) -> None:
"""Send a welcome email to a new user using 3rd party email service."""
formatted_message = f"Hello {username}, welcome to the pysmo service!"
emailer.send_email(email_address, formatted_message)
Now there exists a contract between the signup
function and the EmailSender
class. Any third party email provider we might want to use will need to fulfill
a contract with the EmailSender
class instead of the signup
function. Here,
the EmailSender
class doesn't do anything by itself. Instead it describes
what a generic "class that can send emails" should look like in order to work
with the signup
function. The EmailSender
class is only used by the type
checker in your code editor, so it cannot be used in parts of the code that
actually do things (i.e. you cannot create an instance of the EmailSender
class). Instead you use a generic class that matches what EmailSender
prescribes to actually perform the desired send_email
action:
from emailprovider import EmailProvider #(1)!
from emailprovider2 import EmailProvider2 #(2)!
from signup import signup as signup1
from signup2 import signup as signup2
emailer1 = EmailProvider()
emailer2 = EmailProvider2()
new_user = "joe"
new_user_email = "[email protected]"
signup1(new_user, new_user_email, emailer1) #(3)!
signup1(new_user, new_user_email, emailer2) #(4)!
signup2(new_user, new_user_email, emailer1) #(5)!
signup2(new_user, new_user_email, emailer2) #(6)!
- This is the email provider that is also used in
signup.py
. - This is another 3rd party email provider. We assume it also has a
send_email
method. - Our type checker will be happy with this - all contracts are fulfilled.
- Type checker will complain -
signup1
expectsEmailProvider
, but gotEmailProvider2
. - Type checker is happy with this -
signup2
doesn't expect any particular email provider class, but whichever one provided must have asend_email
method. - Type checker is happy with this -
EmailProvider2
has asend_email
method.
To summarise, we can describe Python's typing system as contracts with the following two important properties:
- They are only used by type checkers that scan code rather than executing it. They are ignored at runtime.
- We can define intermediary contracts with Protocol classes. These serve as interfaces that describe how information is exchanged (hence the name "protocol").
Tip
If you are still not sure what protocols are (or why they are so useful), consider how ubiquitous email is despite the existance of a myriad of alternatives for communicating in todays digital world. The likely reason for this success is interoperability: we can freely email each other using different work or private email addresses, and we can do so using various email clients or webmail. This is only possible because the way emails are sent and received is prescribed by protocols. These protocols define a common standard that allows using email everywhere, without needing to be concerned about things such as differences between the inner workings of different types of email servers.
What does pysmo do?#
What we haven't discussed so far is how Protocol classes
allow us to be very specific about what we need (and also don't need) for our
code to run. Consider the two email providers we used in the example above:
perhaps EmailProvider
can only send emails, while EmailProvider2
also is
able to receive emails using a receive_email
method. However, as this method
is not specified in our EmailSender
class, we can safely ignore this difference
and use both email providers interchangeably. Should we at some point in the
future need to receive emails, we would just create a EmailReceiver
protocol
for that purpose. We would then have two smaller contracts with
EmailProvider2
instead of one big one.
In seismology, a seismogram is typically a time series with some metadata attached to it. What these metadata are, often varies between different file types or use cases. Consider for example the SAC file format: it contains over 150 different header fields, of which only six are required (granted, some of those 150 are unused). This probably makes a lot of sense for storing data, or using SAC files with the SAC application. However, it also means we can never be certain about exactly what information is contained in a given SAC file. This is not dissimilar to the example with two different email providers. There, we dealt with it by defining ourselves what an email sender should look like.
In pysmo we do exactly the same thing, but for types relevant to seismology. Pysmo forms the contract between areas where we want to add lots of detail to our data (e.g. when storing data, or within an application) and general-purpose processing of data where we typically don't care about metadata. For example, one of the most basic operations is applying a filter to a seismogram. Rather than writing an implementation for it over and over again for different types of seismograms, we can define a common interface for them and then use the same implementation all the time.
At its core, pysmo is simply a set of contracts that allow writing code for a very simple and narrow definition of e.g. a seismogram, and then re-using that same code for a seismogram that is highly specific in its implementation.
Tip
Above we used the terms contract, type and protocol somewhat interchangeably. Hopefully it is clear that pieces of software don't actually sign contracts between them; we just used the term to express how they depend on each other and are thus coupled. That leaves us with types and (protocol) classes. Let's explore their relationship using the built-in float type as an example:
>>> a = 1.2 #(1)!
>>> type(a) #(2)!
<class 'float'>
>>> type(float) #(3)!
<class 'type'>
>>>
- We first assign a float to the variable
a
. - Then we verify it is indeed a float using the
type
command. - The type of the float class is...
Remember, in Python everything is an object. So in the above snippet we
created an object called a
of the float
class (objects are
instances of a class). Where it gets interesting, is when we query what type
our variable a
is using the type
command; instead of returning
simply "float", the Python interpreter tells us the type of a
is <class
'float'>
. In other words, the float
class is itself a type
(which we verify in the last line). Simply put then, every time we define a
class in Python, we also define a type.
Use Cases#
There is actually no specific use case for psymo; the main purpose of pysmo is to serve as a library when writing new code, which means when and how to use pysmo is essentially up to you to decide. To help with this, pysmo has exactly two priorities:
- Focus on making the coding experience as intuitive and pleasant as possible (proper typing, autocompletion, etc.).
- Ensure code can be easily reused.
Priority (1) is probably fairly obvious. As for (2), consider for example two applications that store data in different ways internally (i.e. they define their own types), but share some common steps in their processing flows. If you ensure the application specific types match pysmo types, you only need to write the common steps once for both applications. If you've been using pysmo for a while already you may even find you unwittingly already wrote a function that does just what you need (or someone else may have written it)!
Note
If an application or framework that solves your problem already exists, there is little reason to use pysmo (though you might find it useful to use existing frameworks in combination with pysmo). Note also that from pysmo's perspective, there is little difference between an application and a framework - they both typically user their own particular structures for storing and processing data.