mht.wtf

Computer science, programming, and whatnot.

How Much Abstraction Is Too Much?

June 21, 2017 back to posts

Let's talk about abstraction.As we know from RFC 1925 it is easier to move a problem around than it is to solve it. This directly suggests Abstraction Based Development. It goes like this:

  1. Have a problem
  2. Solve 5% of the problem
  3. Invent an abstraction to solve the remaining 95%
  4. Recucrse on the abstraction

After only 91 steps (more or less) we have reduced the problem to only 1% of its original size, making the problem trivial to solve, since it can be solved with a one-liner in Python (this is the unit of problem hardness in CS1).

On a more serious note, abstractions are useful. Abstractions are everywhere. Dynamically-sized arrays are an abstraction. Iterators are an abstraction. Abstractions are all about hiding the stuff we do not care about, and reduce it to the stuff we do care about. We don't really care about the fact that arrays are of fixed size in memory, and that we have to resize it when we need more, we just want to push stuff onto our Vec. It is easy to infer from this that abstractions are about simplification: we do not want the details, but only the big picture.

What often follows from abstraction is indirection. Dynamic dispatch. Compile time generics. The magic stuff the compiler does for us when we only kinda say what we want. Calling iterator.next()? Ah compiler, you understand what I mean. But to someone reading the code it is not always obvious what happens. In the conceptual sense, we understand that the iterator will produce the next value in the set of values it is iterating over. That part is alright. But what exactly is happening? Where is that code? Is this operation cache friendly? Is it computaitonally complex? How confident can we be in that the implementation is correct? What can we do if we suspect something is wrong? These questions are not always simple to answer2.

I will try to argue that abstractions have a very real and very serious downside, that (seemingly) is often overlooked: complexity.

But first of all, the code in this post is real code written by real people. This should go without saying, but just to be absolutely clear: I do not mean to talk down on neither the code nor the authors of the code, and I do not think this is bad code. It is just a good example.

A Motivating Example: String::contains

Maybe you have just learned about string searching algorithms. Aho-Corasick, Boyer-Moore, Knuth-Morris-Pratt, you name it. You, a curious person and a Rust programmer, start to wonder. How is String::contains implemented in the Rust standard library? Let us take a look. First we need to find the method:

$ rg "struct String"
src/liballoc/string.rs
262:pub struct String {
...
...

Okay, String is defined in liballoc, which is kind of weird? But alright.

We enter the file and search for fn contains, but nothing shows up. Strange, isn't it listed under String in the docs? After scrolling up 21 methods in the docs, we can find our issue: Methods from Deref<Target = str>

Yes, of course. The function is not a String method, but a str method (we know the difference between String and &str, but what was str again? Oh, maybe str becomes &str in (&self) methods). rg "struct str" and rg "struct Str " gives us nothing. No worries, we have fuzzy file search in our editor. Besides, str sounds fundamental enough that it should be in libcore. And we do find libcore/str/mod.rs.

Again we search for fn contains, and now we get three matches:

fn contains_nonascii(x: usize) -> bool {

...

/// Methods for string slices
pub trait StrExt {
    // NB there are no docs here are they're all located on the StrExt trait in
    // liballoc, not here.

    #[stable(feature = "core", since = "1.6.0")]
    fn contains<'a, P: Pattern<'a>>(&'a self, pat: P) -> bool;

...

#[stable(feature = "core", since = "1.6.0")]
impl StrExt for str {
    #[inline]
    fn contains<'a, P: Pattern<'a>>(&'a self, pat: P) -> bool {
        pat.is_contained_in(self)
    }

Here they are. The function is generic over Pattern. What is a Pattern anyways? If we read the docs of contains, we clearly see that Pattern is the argument, even though the examples would suggest that the argument is a &str. So we click on Pattern. Aha, it is just an abstraction that allows us to use different types as the pattern - both char, String, &str, and more3. Personally I would think that searching for a char and matching a &str are rather different problems4, but maybe this turned out to be a convenient way to handle Rusts lack of function overloading.

Back to contains. Pattern::is_contained_in is called. Maybe this is used to allow the types that implements Pattern to choose how they want to search themselves. Sounds reasonable, since we then are in the same situation as if we had function overloading. We are mostly concerned about &str (or String, or str?).

pub trait Pattern<'a>: Sized {
    ...
    /// Checks whether the pattern matches anywhere in the haystack
    #[inline]
    fn is_contained_in(self, haystack: &'a str) -> bool {
        self.into_searcher(haystack).next_match().is_some()
    }
    ...
}

So we make the pattern into a Searcher, and the haystack is the text we are searching in. The searcher seemingly iterates over all matches, but we are only interested if it is there at all, so we take the first and see if we got something.

We search for Searcher (a little ironic, don't you think?), and get this, in code form. We are getting closer. next_match seems alright: if we get a match between a and b, we got a match. If we are done without getting a match, we didn't get a match. Otherwise, we continue to call next (it is not really clear what the remaining case is, but maybe we get some information about mismatches). So what does next do? And where is its implementation for &str? Let's search for &str:

/////////////////////////////////////////////////////////////////////////////
// Impl for &str
/////////////////////////////////////////////////////////////////////////////

/// Non-allocating substring search.
///
/// Will handle the pattern `""` as returning empty matches at each character
/// boundary.
impl<'a, 'b> Pattern<'a> for &'b str {
    type Searcher = StrSearcher<'a, 'b>;

    #[inline]
    fn into_searcher(self, haystack: &'a str) -> StrSearcher<'a, 'b> {
        StrSearcher::new(haystack, self)
    }

Oh, okay so this is how we got the implementor of Searcher in the first place. Here we find StrSearcher, which sounds promising. The struct has members, there is an enum here, and another struct with something fw and bw (forwards and backwards?). No need to worry, we can try to understand all this stuff when it comes up. Let us look at the Searcher implementation.

unsafe impl<'a, 'b> Searcher<'a> for StrSearcher<'a, 'b> {
    ...
    fn next(&mut self) -> SearchStep {
        match self.searcher {
            StrSearcherImpl::Empty(ref mut searcher) => {
                // empty needle rejects every char and matches every empty string between them
                let is_match = searcher.is_match_fw;
                searcher.is_match_fw = !searcher.is_match_fw;
                let pos = searcher.position;
                match self.haystack[pos..].chars().next() {
                    _ if is_match => SearchStep::Match(pos, pos),
                    None => SearchStep::Done,
                    Some(ch) => {
                        searcher.position += ch.len_utf8();
                        SearchStep::Reject(pos, searcher.position)
                    }
                }
            }
            StrSearcherImpl::TwoWay(ref mut searcher) => {
                // TwoWaySearcher produces valid *Match* indices that split at char boundaries
                // as long as it does correct matching and that haystack and needle are
                // valid UTF-8
                // *Rejects* from the algorithm can fall on any indices, but we will walk them
                // manually to the next character boundary, so that they are utf-8 safe.
                if searcher.position == self.haystack.len() {
                    return SearchStep::Done;
                }
                let is_long = searcher.memory == usize::MAX;
                match searcher.next::<RejectAndMatch>(self.haystack.as_bytes(),
                                                      self.needle.as_bytes(),
                                                      is_long)
                {
                    SearchStep::Reject(a, mut b) => {
                        // skip to next char boundary
                        while !self.haystack.is_char_boundary(b) {
                            b += 1;
                        }
                        searcher.position = cmp::max(b, searcher.position);
                        SearchStep::Reject(a, b)
                    }
                    otherwise => otherwise,
                }
            }
        }
    }

We start of by matching on self.searcher, which is either Empty or TwoWay (what about OneWay?), and by the comment in the first case, we understand what is happening: StrSearchImpl::Empty is actually an empty pattern (this is confirmed by StrSearcher::new above). Not a very interesting case for us, so we move on to TwoWay.

First we check if we are Done. If so, we return Done. Then we check something about searcher.memory, but it is not clear what memory is, so maybe we should check that out. We find struct TwoWaySearcher (which is the type that TwoWay contains), and lo and behold: A comment describing the algorithm! Well, some background information anyways, but the code in TwoWaySearcher, which turns out to be the place where the real stuff happens, is well documented. Natually, the algorithm is rather convoluted (hard problems are hard --- who knew?), but we found out what we wanted to.

Let us try to sum up our journey. We wanted to know which string searching algorithm String::contains was. This method is from str, as String Derefs to str, and we would like to call contains on strings we don't own. Then our search string becomes a Pattern which we transform into a Searcher, which takes a haystack, which is our original text, and this searcher does the searching. Simple, right?

So what is the point?

I think that this journey was not trivial. We have skimmed a lot of code (I have, anyways), jumped between files and modules, read inline comments and markdown docs, and finally, at the bottom of the rabbit hole, we actually found out what we wanted. Why does something seemingly so simple have to be behind so many layers5?

Some of these layers have benefits --- there is no doubt about that. Maybe we would like to write s.contains('a'), s.contains("ab") or |s: &[char]| "ayyy".contains(s), and since we don't have function overloading, we need Pattern to abstract over the variations. Maybe we would like to have a common argument type for &str methods: &str have 20 methods that takes a Pattern, including contains, find, split, and replace.

But all of the layers? If we list all structs, enums, and Traits one needs to understand in order to dig through something rather simple like this6, can we explain why they all have to be there? Are some of them there because we might need them in the future7? Is it possible that we made this more convoluted than strictly needed? I don't claim to have an answer to any of these questions, but I think these are important questions, and I do not think they are asked often enough.

Of course, this is not to say that the str module is overengineered, or that I think there is anything wrong with this implementation. The only reason I brought it up as an example is because I did try to find out how contains works some months ago, but I gave up because I could not understand the whole system (admittedly I didn't spend too much time on it). There were simply too much stuff! It is ironic that we can simplify a system with abstractions so far as to end up with an even more complex system.

We like short functions8, and we like to introduce types to ensure type safety9. We like flexible solutions, and generalized interfaces. But it is easy to overlook what we are giving up by building a tower of abstractions, namely simplicity.

I think simplicity is something we, as developers, should strive for, and I think it is often something that is forgotten. I don't think that there is an inherit tradeoff between complex and flexible, and simple but inflexible. I think we can get both. But, as always, the best solution is the hardest to find.


Footnotes

  1. And don't let your algorithm professor/friend/family relative tell you otherwise!

  2. Documentation is a great tool here, but docs might be (1) misleading, (2) out of date, (3) lacking, or (4) non-existent.

  3. ... and impl<'a, F> Pattern<'a> for F where F: FnMut(char) -> bool? This is seemingly the same as s.chars().any(f).

  4. In the implementation sense, not the conceptual sense.

  5. Is str::contains actually an orge?

  6. You might think "Aha, but this isn't rather simple, because such and such", but I do think that something like this should be simple. I did not want to understand how the algorithm works, I just wanted to find it!

  7. In which case why are they there today?

  8. Although why functions and methods should be short is not always explained.

  9. For instance, see Pascal Hertleif's talk Writing Idiomatic Libraries in Rust [10:38]

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License