Descent Context

"DeSceNt aims to ease the writing of distributed programs on a federation of plug computers." 
 
Plug computers are a generation of low-cost computers, such as Raspberry pi (25$), HummingBoard, Banana Pi, beagleBone, MinnowBoard Max, Odriod-U3, Udoo, Radxa Rock, Low-cost smartphones. They offer a cheap and readily available infrastructure to deploy domestic on-line software. 
 
A federation  interconnects a large number of socially organized, autonomous and heterogeneous participants that agree on a minimal set of services. The concept of federation already exist in various domains with federated database, federated social networks, networks, federation of cloud...
 
Why a federation of plugs? 
 
A federation of Plug computers open the opportunity for everyone to create cheap nano-clusters of domestic servers, host data and services and federate these resources with their friends, colleagues, and families based on social links. This project's vision is that plug computers can form the core component of a social-based infrastructure organized around the concept of federation. 
 
Compared to exisitng services, a federation of plugs allow users to better keep control over their data. Instead of spreading informations across multiple isolated providers and waste time to synchronize them, data can be defragmented on personal plugs and available for global search. This vision is clearly promoted by personal information systems, with companies and organizations selling home servers such as YunoHost, Amahi, ArkOS, OwnCloud, CozyCloud, Tonido, Freedombox...
 
Use-cases
 
A Federation of plugs should be able to deploy services like Google drive, with google docs, google slides,Dropbox, distributed social networks (Diaspora,mediagoblin), distributed recommendation system (whatsup), federated store.
 
 

What we need and Organization

To be able to build a federation of plugs, we need several basic components:

  • Reliable communications between plugs:
    • publish-subscribe, multicast are basic communication primitives for programming on federated infrastructure.
    • On the context of plugs, we should be able to foster and delineate small scale sub-topologies (e.g. a node and its 2-hop neighbors) and optimize communication costs.
  • Federated data structures and consistency criteria to write programs 
    • Distributed infrastuctures have their data structures, it can be DHT for P2P network, or NoSql stores for cloud computing. Federations needs their own data structures.
    • Consistency in distributed systems in rules by the CAP theorem (Consistency, Availability and Partition Tolerance). our constraints are pushing for keeping Availability and Partition Tolerance.
    • If strong consistency has to be sacrificed, what kind of consistency we can reach? On which data structures?
  • We need security enforcement and usage control of this social infrastructure 
    • Switching from centralized architecture to distributed are not solving privacy issues, it transforms them. In some way, we switch from one big brother to thousands of potential attackers. How monitoring federations and at least detect milicious?
    • Basic motivations for federations is control on data usage. So we should be able to know how data on a plug is really used ?

These needs motivated Descent organization the following tasks:

  • Task1: Federated Social Infrastructure. We want provide to federated program developer a communication component for handling reliable communication on a federation of plugs. Such component should be able to foster and delineate small scale sub-topologies (e.g. a node and its 2-hop neighbors), that demonstrate a high level of locality, either in the data space (similar users, similar changes, similar interests), on the network plane (close to each other), or geographically (close geographically).
  • Task2: Quasi-causality and quasi-CRDT. Given a federated social infrastructure produced by task 1, the objective is to provide probabilistic causal delivery and probabilistic Conflict-free Replicated Data Types (CRDT) structure such as sequences, setsand graphs.
  • Task3: Non-monotonic disorderly programming. Given quasi-CRDT data structures produced by task 2, the objective is to deliver a language able to compose Quasi-CRDTand verify properties such as confluence. We aim to integrate Quasi-CRDT into dedicated languages such BloomL and go beyond monotonicity.
  • Task4: Securing federation of plugs. Given a federated social infrastructure produced by task 1, the objective is to secure the federation of plugs by monitoring divergence evolution of streams on each node.
  • Task5: Usage Control in federation of plugs. Given a federated social infrastructure produced by task 1, the objective is to attach usage control policies to each data retrieved from the federation and to ensure usage policies at plug level.

Task 1: Federated Social Infrastructure

Polystyrene: The Decentralized Data Shape that Never Dies
Simon Bouget, Hoel Kervadec, Anne-Marie Kermarrec, François Taiani, Proceedings of the 2014 IEEE 34th International Conference on Distributed Computing Systems (ICDCS 2014), 30 June - 3 July 2014, Madrid, Spain, pp. 288-297 (10p), abstract, complete document, talk, doi: http://dx.doi.org/10.1109/ICDCS.2014.37.

The goal of Task 1 is to construct overlay topologies that connect plug computers so as to support federation of plug computer for personal cloud service. One key building block is the construction of emerging topologies created using decentralized topology construction protocols. Such decentralized topology construction protocols organize nodes along a predefined topology (e.g. a torus, ring, or hypercube) and can in many contexts ranging from routing and storage systems, to publish-subscribe and event dissemination. These protocols however typically assume no correlation between the physical location of nodes and their positions in the topology, and, as a result, do not handle catastrophic failures well, in which a whole region of the topology disappears. When this occurs, the overall shape of the system typically gets lost. This is highly problematic in applications such as the ones considered in the Descent project, in which overlay nodes are used to map a virtual data space, be it for routing, indexing or storage. In this work, we proposed a novel decentralized approach that can recover from catastrophic correlated failures and reform the system's original topology when this happens even if a large (consecutive) portion of the topology fails. Our approach relies on the dynamic decoupling between physical nodes and virtual ones enabling a fast reshaping. Our results show that a 51,200-node torus converges back to a full torus in only 10 rounds after 50% of the nodes have crashed. Our protocol is both simple and flexible and provides a novel form of collective survivability that goes beyond the current state of the art.

More precisely, the shape-preserving decentralized protocol we propose (called Polystyrene), comes in the form of an add-on layer that can be plugged into any decentralized topology construction algorithm. The simple intuition behind our work consists in decoupling the positions of the nodes in the topology from the nodes themselves. As in existing epidemic topology construction protocols (e.g. T-Man, Vicinity), each Polystyrene node starts with one position. However, contrary to traditional topology construction systems, we allow Polystyrene nodes to change their positions (i.e. migrate) when nodes fail, and to redistribute themselves around the target shape. As a result, the original shape is maintained, albeit at a lower sampling density, resulting from the lower number of surviving nodes.

We enable this migratory behavior by storing two sets of data points per node: a first set of guest data points hold the points the node is in charge of, either as initial assignment or as a result of failures. We say that the node is a primary holder of these guest data points. A second set of ghost data points contain copies of data held elsewhere in the network. When Polystyrene starts, there are no ghosts, and only one guest data point per node: the node's original position. At any given time, guest data points are used to derive a node's actual position, which is then fed to the underlying topology construction protocol. We use a simple projection mechanism, but this is an independent piece of our protocol that can be easily adapted to more complex situation.

Decoupling nodes from data points allows us to implement the migration we need to redistribute nodes around the target shape when catastrophic failures occur. Nodes migrate by periodically exchanging guest data points in order to reach a density-aware tessellation of the data space, i.e. a partition of data points across physical nodes that seeks to maximize locality. After each exchange, each node recomputes the position it provides to the underlying topology construction algorithm, moving in effect around the shape. In other words, nodes migrate by following the migration of their data points.

The result is a decentralized collectively resilient overlay that unlike existing algorithms can maintain its systemic topological properties even in the face of extreme failures. The resulting protocol is further highly scalable, showing a logarithmic convergence time in the size of the system: for instance, a 51,200-node torus converges back to a full torus in only 10 rounds after 50% of the nodes have crashed.

 

Task 2: Quasi-causality and quasi-CRDT

Nédelec, B., Molli, P., Mostefaoui, A., & Desmontils, E. (2013, September). LSEQ: an adaptive structure for sequences in distributed collaborative editing. In Proceedings of the 2013 ACM symposium on Document engineering (pp. 37-46). ACM. (Best Student Paper)

Google Docs made distributed collaborative editors popular. There was more 190M of google drive users in 2015. If such service is really usefull, it has some issues about privacy, economic intelligence and service limitation. Currently, Google is limited to 50 users in real-time editing. In this paper, we aim to build an editor that can scale to millions of simultaneous users. It raises the issue of concurrently editing in real-time a sequence of elements. Real-time constraints force to replicate the shared sequence of characters on each participant. The main problem is to maintain consistency of the massively replicated sequence of element.

Distributed collaborators follow the optimistic replication approach. Given a high number of different participants, each participant has a copy of the sequence of char.

  • Each Site freely update his copy locally, with no lock, no communication with any other site.
  • Then, it broadcast operations to others. Thank to task 1, we make the hypothesis that all operation eventually arrive.
  • Each site integrates remote operations.

The system is correct if it ensures eventual consistency (every site see the same values when system is idle) and preserves intentions (effects observed at generation time are preserved at integration time i.e. if 'x' inserted between 'a' and 'b', then it will be integrated between 'a' and 'b' whatever concurrent operations). Consistency can be formally defined thanks to Task 3.

Conflict-free replicated data types can be used to build collaborative editors. The general principle is to attach to every element, a unique identifier encoding the position of the element in the sequence as line identified in a BASIC program. The complexities of the sequence type is mainly decided by the function that compute these identifiers.

Imagine you want to type "QWERTY" and have internally of sequence of 32 slots. Each time a character is typed, you have to choose an index to store the character. If user start typing 'Q', then a simple function can allow index 1 for 'Q', '8' for 'W' up to 21 for 'Y'. But imagine, you started typing 'Y' and the function allocated index 1 for 'Y', then 'T' has to be inserted before, but there is no room. A naive function can just create a new sequence of 32 slots to expend the sequence at this position. But 'T' will take the index 1 in this new sequence and 'R' is coming. This allocation function will produce identifier with a complexity linear to the size of the sequence. Consequently, the sequence CRDT built on this will not scale in space.

The problem is to craft a function that will create small identifiers whatever the order of creation of the sequence.

The function LSEQ proposed in the paper rely on exponential tree and random allocation. Each time a new allocation space has to be created, the size of this new space is doubled and the strategy of allocation in this space (incremental or decremental) is chosen randomly. We demonstrate that this function allocates identifiers of (log(n))2 in the average case. Consequently, an editor built with LSEQ do not require costly procedures to rebalance identifiers.

Based on this contribution, we plan to demonstrate how it is possible for 1M of people to edit in real-time a shared document. The prototype CRATE allow to test LSEQ in an editor, CRATE relies on the SPRAY gossip algorithm produced by task 1. You can try online.

 

 

video demo

Task 3 results: Disorderly Programming

PERRIN, Matthieu, MOSTÉFAOUI, Achour, et JARD, Claude. Brief Announcement: Update Consistency in Partitionable Systems. DISC2014, p. 546.
 
Reliability of large scale systems is a big challenge when building massive distributed applications over the Internet. At this scale, data replication is essential in order to ensure availability and convergence. Formal specifications of shared objects are a requirement for any attempt of verification. This problem is made harder by the CAP theorem (see Slide 2): it is impossible to provide the three following properties 
together for a shared object:
  • strong consistency: the object behaves as if there was a unique physical instance of it in the network,
  • availability: the methods of the object always return when they are called,
  • partition tolerance: separate parts of the network may be unable to communicate with each other for an unbounded amount of time.
As partition tolerance and availability are required in decentralized federations, dealing with weak consistency is not an option. But if shared objects do not behave as their sequential counterparts, tools developed for sequential specifications cannot be used unchanged in large scale distributed systems.
 
Our general approach is to split concurrent specifications into two complementary aspects: 
  • an abstract data type defined by a transition system and which produces a sequential specification (see Slide 3)
  • a consistency criterion that transforms distributed histories (i.e. partially ordered sets of events, see Slide 4) into sequential histories in order to match them with the sequential specification. 
One advantage to do so is that sequential specifications are easy to understand and to produce, they already are used in oriented objects programming languages, and they are at the base of most verification tools. Consistency criteria are relatively few and independant of the application: they can be explained and understood once for all.
 
Sequential consistency (see Slide 5) is an example of strong consistency criteria. All the events of a history admitted by a sequentially consistent object can be totally ordered with an ordering that respects the partial  ordering of the distributed history (lin(H)) as well as the sequential specification of the object (L(T)).
 
Eventual consistency (see Slide 6) is much weaker: it only requires that, if all the participants stop updating, they will eventually converge to a common state. What it does not specify is to which states the object can converge. Because the definition of eventual consistency hardly mentions the sequential specification, fully specifying eventually consistent shared objects requires additional counterintuitive and error-prone concurrent specification techniques. 
 
We propose a new consistency criterion, *update consistency* (see Slide 7), that strengthens eventual consistency by imposing that the common state can be obtained by a linearization of the history. In other words, if we remove the reads made before convergence, the history is sequentially consistent. We also proved that any abstract data type has an update consistent implementation.
 
As a future work, we plan to extend the analysis to other weak consistency criteria, such as pipelined consistency and causal consistency. The goal is to fulfill a complete map of consistency criteria available in weak systems, with their relative strength (see Slide 8 for a preliminary version). 

Task 4: Securing & Usage Control in Federation of Plugs

E. Anceaume, and Y. Busnel, "Deviation Estimation between Distributed Data Streams", Proceddings of the 10th European Dependable Computing Conference (EDCC), May 2014.(pdf)

 

Performance of many complex  monitoring applications, including Internet monitoring applications, data mining, sensors networks, network intrusion/anomalies detection applications,   depend  on the detection of correlated events. For instance, detecting correlated network anomalies should drastically reduce the number of false positive or negative alerts that networks operators have to currently face when using network management tools such as SNMP or NetFlow. Indeed, to cope with the complexity and the amount of raw data, current network management tools analyze their input streams in isolation.   Diagnosing flooding attacks through the detection of correlated flows should improve intrusions detection tools.  


The point is that, in all these monitoring applications,  data streams arrive at nodes in a very high rate and  may contain  up to several billions of data items per day. Thus computing statistics with traditional methods is unpractical due to constraints on both available processing capacity, and memory. The problem tackled in this paper is the on-line  estimation of   data streams correlation. More precisely, we propose a distributed algorithm that approximates with guaranteed error bounds  in a single pass the linear relation between massive   distributed sequences of data.

Two main approaches exist to monitor in real time massive data streams. The first one consists in regularly sampling the input streams so that only a limited amount of data items is locally kept. This allows to    exactly compute  functions on these samples. However, accuracy of this computation with respect to the stream in its entirety fully depends on the volume of data items that has been sampled and their order in the stream. Furthermore, an adversary may  easily take advantage of the sampling policy to hide its attacks among data items that are not sampled, or in a way that prevents its ``malicious'' data items  from being  correlated.  
In contrast, the streaming approach consists in scanning each piece of data of the input stream on the fly, and in locally  keeping only compact synopses or sketches that contain the most important information about these data. This approach enables to derive  some data streams statistics with guaranteed error  bounds without making any assumptions on the order in which data items are received at nodes. Most of the research done so far with this approach has focused on computing functions or statistics measures with very small error using sublinear space in the item domain size.

On the other hand, very few works have tackled the distributed streaming model, also called the functional monitoring problem, which combines features of both the streaming model and communication complexity models. As in the streaming model, the input data is read on the fly, and processed with a minimum workspace and time.  
In the communication complexity model, each node receives an input data stream, performs some local computation, and communicates  only with a coordinator who wishes to continuously compute or estimate a given function of the union of all the input streams. The challenging issue in this model is for the coordinator to compute the given function by minimizing the number of communicated bits.

In this task, we go a step further by studying the dispersion matrix of distributed streams.
Specifically, we have proposed a novel metric that allows to approximate in real time  the correlation between distributed and massive streams. This metric, called the sketch* metric, allows us to quantify how observed data items  change together, and in which proportion. We are convinced that such a network-wide traffic monitoring tool should allow monitoring applications  to get significant information on the traffic behavior changes to subsequently inform more detailed detection tools on where DDoS attacks are currently active.

 

Conclusion & Perspectives

Descent achieved important results:

  1. An experimentation platform based on 48 raspberry pi is operational. It has to be extended to 150 raspberry pis. We deployed the LSEQ prototype on the platform sucessfully.
  2. Task 1 on federated social infrastructure demonstrated how it possible to establish reliable communications between federated participant. Thanks to polystyren, the communication layer is able to recover its topology even in case of catastrophic failure.
  3. Task 2 on quasi-CRDT presented how it is possible to build probabilistic data structure that support constraints of federations. LSEQ allow to build sequences that can be updated by millions of participants.
  4. Task 3 formalized and mapped consistency criteria that are weaker than  strong consistency. It proposed a new criteria "update-consistency" that can be implemented for any data structure.
  5. Task 4&5 proposed stream based approaches to detect malicious behaviors in a federation. The different contributions are compatible wwith limited memory and low CPU capacities of plugs computers.

As perspectives:

  • Task1 is working with one PhD on optimizations of communication layer. It aims to take advantage of small scale sub-topologies.
  • Task 2 is working on causality tracking with original approach to anti-entropy
  • Task 3 is continuing exploration of weak-consistency models.
  • Task 4 is working on windowing techniques for continuous monitoring
  • Task 5 is working on an original approach on usage control.

In order to demonstrate Descent achievements, we plan to build a million editor challenge. The objective is to deploy a social infrastructure an editing environment able to support in real-time 1 million of users. Why doing that ? Because it is hard:

“We choose to go to the moon in this decade and do the other things, not because they are easy, but because they are hard." Kennedy 1962