Week 8: A Recap
Living In A Materialized World
The issues surrounding my choice of an underlying data store are well-documented here. My initial plan was to postpone a final decision until I have gained more insight into the exact requirements of my backend. But over the last week, I came to realize that the choice of how to store user-submitted data is one of the central (and probably the most crucial) aspects of the backend design. So it only made sense to keep my focus on that, as it would have really felt wasteful to write a whole makeshift persistence layer, only to rip it out in two weeks for a complete rewrite.
Initially, I was exploring my options from a high-level perspective. I was researching different database types, with the book Seven Databases in Seven Weeks (Pragmatic Bookshelf, 2012) providing some great guidance and introductions. What piqued my curiosity were graph databases like Neo4J because they seem to solve my hierarchy problems quite well. Like the underlying data structure, graph databases emphasize relationships (or edges) between different datasets (nodes). Since the 200 OK data stored for each API resembles a tree structure (which for this purpose is just a graph with restrictive rules), it provides an interesting alternative. But I decided against exploring this option after seeing a whole new ecosystem unbekownst to me, with its own set of best practices and a unique query language called Cypher 1.
Plus, I discovered another approach that can be used with almost any regular relational or non-relational database which provided an even better solution to my core problem.
Nested API resource URI’s like
/users/42/images/23/comments/2 are already a (URL-friendly) way of representing a tree hierarchy. So in hindsight it’s pretty obvious that this can be saved as meta information alongside the actual resource data to represent the exact relationship of said data. This concept bears the funky name materialized paths and is a pretty good fit for 200 OK as it makes having a relational schema unnecessary. Now, I was a bit hesitant towards following a schema-less persistence all the way through, but it’s really hard to translate the volatile nature of my data into a proper SQL structure. This might be an expression of my lack of knowledge regarding SQL (something I really need to change at some point), but it doesn’t feel right to create SQL tables during runtime when I’m really only dumping unstructured JSON into it, negating the mainadvantage that SQL brings to the table (pun not intended).
Here is my current solution with materialized paths: the store of choice is MongoDB where each user-created API results in just one collection generated for it. Every resource item is stored as a separate document, with two fields dedicated to uniquely identifying it: a
'users/42/images/23/comments' for the example above) and an
5 for the example). Together these two can form a compound index that makes data retrieval lookups cheap and easy because it essentially means I can just pipe the request URL almost 1-to-1 into the MongoDB driver. And while this way of storing hierarchy might not hold up for huge collections, this won’t be an issue for me since I intend to restrict the overall number of items for each API anyway, keeping all collections relatively small in size.
One actual downside of this approach is the lack of a proper auto-increment for each resource’s item
id’s. I could rely on MongoDB’s globally unique
ObjectID generated for each document, but I feel that having an incrementing integer instead of a 24-digit alphanumerical string makes for easier debuggind and reasoning about your response data.
The lack of auto-generated integer id’s is not specific to the materialized path, but a general problem when not dedicating a whole overarching structure (like a table or collection) to each resource or having a database that allows for a custom method to implement this (and I haven’t found a way for MongoDB). So I’ll have to implement at least one other collection that stores each resource path and the highest
id for each new item for that resource. That increases the number of database queries by one for each
POST request operation but I don’t see a way to circumvent that. Plus, it’s the only disadvantage I can think of right now and what seems to be a worthy trade-off.
Talking And Reading About Code
While the decision for data storage sounds like a confident choice, it absolutely wasn’t. As a novice programmer working alone, I lack any confidence regarding my decisions because of a lingering fear of missing the obvious, glaring problems of my solution or of not realizing that there might be a better way.
My confidence of following through with this NoSQL solution stems mostly from having been able to talk about it at length with my sister Christina, a battle-hardened software engineer currently working at conversational AI startup Mercury.ai. And she’s an all-around awesome person (plus, she publicly shares her “notes on nuclear technology and code”. Nuclear Technology and Code, how cool is that?), so I was glad to be able to talk her through my project and especially the data store issues behind it. And just hearing that my ideas are not completely terrible has boosted my confidence enormously and will allow me to finally focus on actually implementing stuff.
Fueled by my desire for external validation, I was also looking for authoritative resources on other topics where I wasn’t sure what the best practices were. Hunting after a digital copy of Web Development with Node and Express (O’Reilly, 2019), I stumbled upon O’Reilly’s subscription learning platform again and started a 10 day trial out of curiosity (and because they had the book I was looking for, of course). It was only after searching for other books on the platform that I realized the extensive back catalogue of books available: not just O’Reilly titles, but also a sizeable portion of titles from Manning, the Pragmatic Bookshelf, Packt and a few more. Granted, the subscription fee is pretty hefty (49$/month), but paying once, I now have a month to skim through at least a dozen interesting titles, picking up snippets of knowledge from books that hold at most a few interesting chapters for me and that I wouldn’t have bought (or bought) otherwise, but they’re easily searchable now.
But yeah, before all that comes a week or two of actually churning out some code. I have the soft goal of finishing the backend API server by the beginning of March, so I find myself in need of digging really deep into my codebase. More on that next week.
Time spent this week: 41 hours
- A database called Neo, a language called Cypher: having blatantly obvious Matrix references makes anything immediately more likeable to me. [return]