On The Value of Decentralized Querying

Thursday, 2nd of May 2019 · by

In our previous post in TD research, Dr. Mohamed Fouda discussed our enthusiasm for the next wave of crypto companies who will build their offerings around blockchain-data querying and analytics services. One of the rising companies in this sector is The Graph which presents a decentralized protocol for querying blockchain data and storage networks. In this article, we take a closer look into the reasons behind building the new protocol, its design, and its progressive approach towards decentralization.

In general, I believe that we can divide the Blockchain Data-Query sector into main two categories: Queries that require: 1) Simple and fast Data-Accesses and transformations — commonly used by front-end dApps — and those that require 2) Complex & expressive Data-Analytics, e.g. SQL — commonly used by data scientists for number crunching & harvesting deep insights from large data corpuses. Companies like Infura, Etherscan, Bloxy, Amberdata target the first category of queries, whereas companies like Coinmetrics, TokenAnalyst, Alethio, and ChainAnalysis target the latter.

The Graph lies under the first category and they take several positive steps towards the decentralization of this direction: 1) It is designed from the ground up for front-end engineers in mind as their needs and preferences will be the driving force in the widespread adoption of the Decentralized web. This allows teams to focus on their dApps’ core functionality and not worry about data storage, curation, and management. 2)They take a pragmatic and progressive approach in design and implementation, i.e., smooth transition from a simple centralized solution towards trustless decentralization. This allows them to remain agile & flexible and to enable community adoption along the way.

The Evolution: Web 2.0 ⟶ 3.0

Web 2.0

For almost three decades, the web has provided us with a flourishing substrate/infrastructure for interaction and communication. Along the way, it has also been undergoing continuous evolution in its design till it converged to a client-server architecture as we know it today.

However, in the current digital age, the manner, in which people store, curate, query, and share information directly influences the structure of society and each individual’s experience in life. That being said, a client-server architecture favours centralization by giving a lot of power to whomever runs the server, causing an asymmetric imbalance in information power. For example, a server administrator has ultimate power to grant and revoke access and control the data thereof, as opposed to end-users who do not have a say on how things function. This is the reason why there are only a very few companies —  e.g., Facebook, Google, etc — running the show in the software and tech world, i.e., a highly skewed distribution in power.

Web 2.0: The client-server architecture Image taken from The Graph.

Monopolies make it tougher for other participants to contribute with their skills. Competition is suppressed, and freedom of choice disappears. The current transition to the new design of the web 3.0 paves the way to unique opportunities that shift the balance of power away from monopolies towards a less centralized design that offers more capacity and choice to individuals.

Web 3.0

We are currently experiencing a new architectural shift that will potentially enable a more decentralized experience on a massive scale. To realize that, a new set of fundamental protocols are being designed to enable a new wave of decentralized experiences to flourish and replace how software is built and deployed globally.

In this new wave, decentralized applications will treat users as first class citizens, giving them control of their own data. That is, dApps are built using data that is either managed by the community or is privately controlled by the user directly — rather than a centralized entity. This way, many products and services can be engineered and designed on pluggable datasets, empowering countless developers to thrive and users to choose and control how things work. From among the fundamental protocols, it is believed that Blockchains like Ethereum and storage networks like IPFS will be central.

Why a Decentralized Query Protocol?

Now an important question arises, and that is how do dApps efficiently query the data in Blockchains and storage networks. That is the question, the Graph mainly addresses. Current dApps build their own centralized indexing servers or contract with centralized data-query providers to leverage querying functionalities. Their servers pull data from Ethereum, store it in their own databases, and expose it over an API. Suddenly we’re not far from where we started with web 2.0, where users still need to trust these project owners to continue operating these service correctly. Quoting Yaniv Tal:

The ultimate goal is to build a global API that all applications can use to query data. If one company controlled that API it would introduce an unacceptable level of risk for the system. They would be able to choose what data sources to surface, the validity of data, and unilaterally fix prices.

A decentralized data layer with an open marketplace will allow for a healthy competition where various data vendors can challenge the accuracy and validity of the data provided by other vendors in real-time. Such a data ecosystem provides a robust and efficient infrastructure layer for dapps which can significantly improve dapp performance and increase their mainstream adoption.

The Decentralized Data-Query Layer

While blockchains and storage networks are essential elements of the stack, their data is not stored in a format that is optimized for direct application-consumption. Applications typically need to perform data-transformations before using it, i.e., filtrations, projections, and ranking, etc.

The Graph: Decentralized intermediate protocol that bridges between dApps and decentralized Storage layers. Image taken from The Graph.

The Legacy: Proprietary APIs

Today’s web is connected via proprietary Application Programming Interfaces (APIs), which predominantly exist as an external interface — for the outer world — to the workings of internal database management systems. However, with the advent of blockchains, we are on the verge of a new paradigm where data — and not proprietary APIs, is the fundamental foundation for interoperability.

Proprietary APIs act as a protective encapsulation to a central database. They limit interactions with the raw data. Moreover, with the microservices pattern, data is even encapsulated further within organizations, such that services built by different teams must go through one another’s APIs. This proliferation of APIs has been a massive success for software design and productivity, yet, it still has its drawbacks:

  1. APIs are rigid and costly to maintain: APIs are designed to take specific use cases into account, such as a special feature in a web application. The further you get from those use cases, the more difficult they become to use. This leads to an ever-expanding API domain in organizations resulting in technical debt, i.e., additional engineering work is needed to extend the API for each new feature.
  2. The current model is inefficient: The rigidity of the APIs model leads to a proliferation of databases and APIs, which often result in unnecssary indirections and redundancies — within data pipelines and across data pipelines —  to store the same data in another format or behind different API semantics. These inefficiencies require maintaining additional infrastructure and engineering resources.
  3. Proprietary APIs result in data monopolies: Placing data in silos leads to the centralization of power. It has become a common narrative in the tech industry, where a company revokes access to its APIs after previously encouraging developers to build on their own platform. The damaging practices of such monopolies are a natural outcome of the software architecture paradigm that requires data to sit behind custodians.

The Future: GraphQL on Public Blockchain Data

Since Blockchains and Content Addressed Networks are designed to be decentralized and public there is no need for them to be hidden behind protective layers to be secure and robust. Proprietary APIs are no longer an architectural necessity — processes can interact directly with decentralized data as a shared substrate for interoperability.

GraphQL: Querying Decentralized Data

GraphQL for dApp Data-Querying. Image taken from The Graph.

The Graph adopts GraphQL as a declarative language for data querying. GraphQL is both a query language and interface definition language that was invented and open-sourced by Facebook. It was designed to overcome the rigidity and inefficiency of traditional APIs by exposing a powerful and ergonomic query language directly to the consumers of an API. Applications built on traditional APIs might make hundreds of round-trip network calls, GraphQL, on the other hand, is powerful enough to express all the data an application requires in a single query.

GraphQL vs SQL

SQL has global adoption; is familiar to millions of developers; and is much more expressive than GraphQL as a query language. However, according to The Graph founders, Yaniv and Brandon, SQL will not be the language preferred by dApps building on the blockchain.

I tend to agree with this vision but within context. That is, GraphQL and SQL can co-exist to address different usecases. Each of them are designed for specific tasks. GraphQL can probably win over in the case of front-end developer-productivity and ease-of-use. It is more convenient for fast and well-defined indexed data-access which most dApp developers need. On the other hand, SQL will remain powerful in expressing complex queries, advanced data analysis, and large scale number crunching, e.g., coinmetrics, tokenanalyst, chainanalysis. Again, each one targets specific usecases, services, scenarios, and audience, i.e., there is no one-size-fits-all.

GraphQL is suitable for dApp development for the following reasons:

  1. GraphQL is powerful enough: Even though the GraphQL query language isn’t natively as expressive as SQL, a well-designed GraphQL endpoint can be designed to give you most of the querying capabilities you expect from an SQL query interface. For example, OpenCRUD exposes aggregation functionality. However, there will always be some functionality that will likely always be out of reach, e.g., data-joins across entities.
  2. GraphQL is more ergonomic for front-end developers: What you sacrifice in expressibility, you gain back in ergonomics. For example, GraphQL has a JSON-like syntax which is the most commonly used data format for transmitting data on the web. This syntax is approachable for the majority of web developers. Moreover, The GraphQL ecosystem, also has advanced toolings, such as React-Apollo, which makes it incredibly simple to integrate data fetched via GraphQL directly into UI components of web applications. Finally, it worth mentioning that The Graph has plans to support SQL-like capabilities in the future. The design of the query tree — intermediate representation — is language agnostic.

A Journey Towards Decentralization

So, how far are we from the final product? Well, the team is progressing fast in their timeline. In July 2018, they announced their roadmap that takes progressive steps towards decentralization. They are following an approach that encourages the developers community to board along the way from a centralized solution to a decentralized one.

Graph Protocol Design and Implementation TimeLine. Currently, at the stage of building a Hybrid Network.

The goal is to reach a trustless and decentralized network that will allow anyone to run a Graph node so as to contribute indexing, caching, validation, and query processing to the network. An efficient marketplace will be setup so that nodes can earn fees for their services and projects can have a low cost, dependable, decentralized indexing solution.

However, to reach that goal, the team has been taking pragmatic steps:

  1. Local Node✓ — The first version of the software is a centralized standalone indexing server for Ethereum and IPFS. The node subscribes to events in Ethereum, executes user provided scripts to transform the data, indexes it, and makes it available over GraphQL. The scripts run on WASM, therefore, processing is fast and results are deterministic.
  2. Hosted Service✓ — After that, they started hosting their own centralized service within the network which kickstarts the community and makes it easier for teams to build on the Graph. At Graph Day they launched the Hosted Service and the Graph Explorer which makes it easy for developers to discover all the data being indexed on The Graph and easily pull it into their dApps.
  3. The Hybrid Network: This is the current step of the Graph protocol-design specification that bridges the gap between the previously described hosted service and the fully decentralized goal. It helps them remain agile and flexible, by allowing them to begin prototyping and evaluating many of the mechanisms, economics, and architectural decisions of the protocol. This also allows them to incorporate feedback from the community, e.g., dispute resolution, governance, name registration. Systems design and implementation are adaptive and iterative processes. The team’s approach and philosophy around progressive development speaks for a responsible and mature strategy in achieving their goals.

Conclusion

We are currently witnessing exciting times during the development of the new Decentralized web. We believe that Data will be a first-class citizen in this new era. Moreover, we are confident that an entire ecosystem will be built around Data Querying and Analytics.

The Blockchain Data-Query Landscape. Most of the current products are centralized. Both, The Graph and Fluence pioneer in the decentralization space. Note: This is not a comprehensive list.

The Blockhain Data-Query ecosystem can be divided into four main quadrants as shown in the figure above. The X-axis ranges from complex queries to simpler ones, whereas the Y-axis ranges from centralization to decentralization. The majority of current companies and products are centralized. On the other hand, very few ones — The Graph and Fluence —  target a decentralized infrastructure for dApp developers. dApps are decentralized and hence they ought to be built upon decentralized and trust-minimized services.

The Graph Protocol is designed from the ground up for front-end engineers in mind to motivate wide spread adoption. In a paradigm where dApps primarily consist of browser apps interfacing with smart contracts running on a blockchain, the needs and preferences of front-end engineers will be a driving force in determining the query technology stack of the decentralized web. Will The Graph succeed in this vision? Only time will tell. In my next post, I will delve deeply in the second and exciting category of complex analytics in this space.

Thanks to Mohamed Fouda & Yaniv Tal for their feedback and insights on this article.

Stay up to date with the latest in blockchain and cryptocurrency

The Token Daily newsletter is the best way to keep up to date on important happenings in all things blockchain and cryptocurrency. Subscribe below and you’ll also receive exclusive token analysis articles from our team.

No thanks.
Author

Mohamed ElSeidy
Research Partner @tokendaily | PhD @EPFL Switzerland | Distributed Systems & Data Analytics | Former Senior Engineer @Verisign | Former Researcher @Microsoft