What we need and Organization
To be able to build a federation of plugs, we need several basic components:
Reliable communications between plugs:
- publish-subscribe, multicast are basic communication primitives for programming on federated infrastructure.
- On the context of plugs, we should be able to foster and delineate small scale sub-topologies (e.g. a node and its 2-hop neighbors) and optimize communication costs.
Federated data structures and consistency criteria to write programs
- Distributed infrastuctures have their data structures, it can be DHT for P2P network, or NoSql stores for cloud computing. Federations needs their own data structures.
- Consistency in distributed systems in rules by the CAP theorem (Consistency, Availability and Partition Tolerance). our constraints are pushing for keeping Availability and Partition Tolerance.
- If strong consistency has to be sacrificed, what kind of consistency we can reach? On which data structures?
We need security enforcement and usage control of this social infrastructure
- Switching from centralized architecture to distributed are not solving privacy issues, it transforms them. In some way, we switch from one big brother to thousands of potential attackers. How monitoring federations and at least detect milicious?
- Basic motivations for federations is control on data usage. So we should be able to know how data on a plug is really used ?
These needs motivated Descent organization the following tasks:
- Task1: Federated Social Infrastructure. We want provide to federated program developer a communication component for handling reliable communication on a federation of plugs. Such component should be able to foster and delineate small scale sub-topologies (e.g. a node and its 2-hop neighbors), that demonstrate a high level of locality, either in the data space (similar users, similar changes, similar interests), on the network plane (close to each other), or geographically (close geographically).
- Task2: Quasi-causality and quasi-CRDT. Given a federated social infrastructure produced by task 1, the objective is to provide probabilistic causal delivery and probabilistic Conflict-free Replicated Data Types (CRDT) structure such as sequences, setsand graphs.
- Task3: Non-monotonic disorderly programming. Given quasi-CRDT data structures produced by task 2, the objective is to deliver a language able to compose Quasi-CRDTand verify properties such as confluence. We aim to integrate Quasi-CRDT into dedicated languages such BloomL and go beyond monotonicity.
- Task4: Securing federation of plugs. Given a federated social infrastructure produced by task 1, the objective is to secure the federation of plugs by monitoring divergence evolution of streams on each node.
- Task5: Usage Control in federation of plugs. Given a federated social infrastructure produced by task 1, the objective is to attach usage control policies to each data retrieved from the federation and to ensure usage policies at plug level.
Task 1: Federated Social Infrastructure
Polystyrene: The Decentralized Data Shape that Never Dies
Simon Bouget, Hoel Kervadec, Anne-Marie Kermarrec, François Taiani, Proceedings of the 2014 IEEE 34th International Conference on Distributed Computing Systems (ICDCS 2014), 30 June - 3 July 2014, Madrid, Spain, pp. 288-297 (10p), abstract, complete document, talk, doi: http://dx.doi.org/10.1109/ICDCS.2014.37.
The goal of Task 1 is to construct overlay topologies that connect plug computers so as to support federation of plug computer for personal cloud service. One key building block is the construction of emerging topologies created using decentralized topology construction protocols. Such decentralized topology construction protocols organize nodes along a predefined topology (e.g. a torus, ring, or hypercube) and can in many contexts ranging from routing and storage systems, to publish-subscribe and event dissemination. These protocols however typically assume no correlation between the physical location of nodes and their positions in the topology, and, as a result, do not handle catastrophic failures well, in which a whole region of the topology disappears. When this occurs, the overall shape of the system typically gets lost. This is highly problematic in applications such as the ones considered in the Descent project, in which overlay nodes are used to map a virtual data space, be it for routing, indexing or storage. In this work, we proposed a novel decentralized approach that can recover from catastrophic correlated failures and reform the system's original topology when this happens even if a large (consecutive) portion of the topology fails. Our approach relies on the dynamic decoupling between physical nodes and virtual ones enabling a fast reshaping. Our results show that a 51,200-node torus converges back to a full torus in only 10 rounds after 50% of the nodes have crashed. Our protocol is both simple and flexible and provides a novel form of collective survivability that goes beyond the current state of the art.
More precisely, the shape-preserving decentralized protocol we propose (called Polystyrene), comes in the form of an add-on layer that can be plugged into any decentralized topology construction algorithm. The simple intuition behind our work consists in decoupling the positions of the nodes in the topology from the nodes themselves. As in existing epidemic topology construction protocols (e.g. T-Man, Vicinity), each Polystyrene node starts with one position. However, contrary to traditional topology construction systems, we allow Polystyrene nodes to change their positions (i.e. migrate) when nodes fail, and to redistribute themselves around the target shape. As a result, the original shape is maintained, albeit at a lower sampling density, resulting from the lower number of surviving nodes.
We enable this migratory behavior by storing two sets of data points per node: a first set of guest data points hold the points the node is in charge of, either as initial assignment or as a result of failures. We say that the node is a primary holder of these guest data points. A second set of ghost data points contain copies of data held elsewhere in the network. When Polystyrene starts, there are no ghosts, and only one guest data point per node: the node's original position. At any given time, guest data points are used to derive a node's actual position, which is then fed to the underlying topology construction protocol. We use a simple projection mechanism, but this is an independent piece of our protocol that can be easily adapted to more complex situation.
Decoupling nodes from data points allows us to implement the migration we need to redistribute nodes around the target shape when catastrophic failures occur. Nodes migrate by periodically exchanging guest data points in order to reach a density-aware tessellation of the data space, i.e. a partition of data points across physical nodes that seeks to maximize locality. After each exchange, each node recomputes the position it provides to the underlying topology construction algorithm, moving in effect around the shape. In other words, nodes migrate by following the migration of their data points.
The result is a decentralized collectively resilient overlay that unlike existing algorithms can maintain its systemic topological properties even in the face of extreme failures. The resulting protocol is further highly scalable, showing a logarithmic convergence time in the size of the system: for instance, a 51,200-node torus converges back to a full torus in only 10 rounds after 50% of the nodes have crashed.
Task 2: Quasi-causality and quasi-CRDT
Nédelec, B., Molli, P., Mostefaoui, A., & Desmontils, E. (2013, September). LSEQ: an adaptive structure for sequences in distributed collaborative editing. In Proceedings of the 2013 ACM symposium on Document engineering (pp. 37-46). ACM. (Best Student Paper)
Google Docs made distributed collaborative editors popular. There was more 190M of google drive users in 2015. If such service is really usefull, it has some issues about privacy, economic intelligence and service limitation. Currently, Google is limited to 50 users in real-time editing. In this paper, we aim to build an editor that can scale to millions of simultaneous users. It raises the issue of concurrently editing in real-time a sequence of elements. Real-time constraints force to replicate the shared sequence of characters on each participant. The main problem is to maintain consistency of the massively replicated sequence of element.
Distributed collaborators follow the optimistic replication approach. Given a high number of different participants, each participant has a copy of the sequence of char.
- Each Site freely update his copy locally, with no lock, no communication with any other site.
- Then, it broadcast operations to others. Thank to task 1, we make the hypothesis that all operation eventually arrive.
- Each site integrates remote operations.
The system is correct if it ensures eventual consistency (every site see the same values when system is idle) and preserves intentions (effects observed at generation time are preserved at integration time i.e. if 'x' inserted between 'a' and 'b', then it will be integrated between 'a' and 'b' whatever concurrent operations). Consistency can be formally defined thanks to Task 3.
Conflict-free replicated data types can be used to build collaborative editors. The general principle is to attach to every element, a unique identifier encoding the position of the element in the sequence as line identified in a BASIC program. The complexities of the sequence type is mainly decided by the function that compute these identifiers.
Imagine you want to type "QWERTY" and have internally of sequence of 32 slots. Each time a character is typed, you have to choose an index to store the character. If user start typing 'Q', then a simple function can allow index 1 for 'Q', '8' for 'W' up to 21 for 'Y'. But imagine, you started typing 'Y' and the function allocated index 1 for 'Y', then 'T' has to be inserted before, but there is no room. A naive function can just create a new sequence of 32 slots to expend the sequence at this position. But 'T' will take the index 1 in this new sequence and 'R' is coming. This allocation function will produce identifier with a complexity linear to the size of the sequence. Consequently, the sequence CRDT built on this will not scale in space.
The problem is to craft a function that will create small identifiers whatever the order of creation of the sequence.
The function LSEQ proposed in the paper rely on exponential tree and random allocation. Each time a new allocation space has to be created, the size of this new space is doubled and the strategy of allocation in this space (incremental or decremental) is chosen randomly. We demonstrate that this function allocates identifiers of (log(n))2 in the average case. Consequently, an editor built with LSEQ do not require costly procedures to rebalance identifiers.
Based on this contribution, we plan to demonstrate how it is possible for 1M of people to edit in real-time a shared document. The prototype CRATE allow to test LSEQ in an editor, CRATE relies on the SPRAY gossip algorithm produced by task 1. You can try online.
Task 3 results: Disorderly Programming
- strong consistency: the object behaves as if there was a unique physical instance of it in the network,
- availability: the methods of the object always return when they are called,
- partition tolerance: separate parts of the network may be unable to communicate with each other for an unbounded amount of time.
- an abstract data type defined by a transition system and which produces a sequential specification (see Slide 3)
- a consistency criterion that transforms distributed histories (i.e. partially ordered sets of events, see Slide 4) into sequential histories in order to match them with the sequential specification.
Task 4: Securing & Usage Control in Federation of Plugs
E. Anceaume, and Y. Busnel, "Deviation Estimation between Distributed Data Streams", Proceddings of the 10th European Dependable Computing Conference (EDCC), May 2014.(pdf)
Performance of many complex monitoring applications, including Internet monitoring applications, data mining, sensors networks, network intrusion/anomalies detection applications, depend on the detection of correlated events. For instance, detecting correlated network anomalies should drastically reduce the number of false positive or negative alerts that networks operators have to currently face when using network management tools such as SNMP or NetFlow. Indeed, to cope with the complexity and the amount of raw data, current network management tools analyze their input streams in isolation. Diagnosing flooding attacks through the detection of correlated flows should improve intrusions detection tools.
The point is that, in all these monitoring applications, data streams arrive at nodes in a very high rate and may contain up to several billions of data items per day. Thus computing statistics with traditional methods is unpractical due to constraints on both available processing capacity, and memory. The problem tackled in this paper is the on-line estimation of data streams correlation. More precisely, we propose a distributed algorithm that approximates with guaranteed error bounds in a single pass the linear relation between massive distributed sequences of data.
Two main approaches exist to monitor in real time massive data streams. The first one consists in regularly sampling the input streams so that only a limited amount of data items is locally kept. This allows to exactly compute functions on these samples. However, accuracy of this computation with respect to the stream in its entirety fully depends on the volume of data items that has been sampled and their order in the stream. Furthermore, an adversary may easily take advantage of the sampling policy to hide its attacks among data items that are not sampled, or in a way that prevents its ``malicious'' data items from being correlated.
In contrast, the streaming approach consists in scanning each piece of data of the input stream on the fly, and in locally keeping only compact synopses or sketches that contain the most important information about these data. This approach enables to derive some data streams statistics with guaranteed error bounds without making any assumptions on the order in which data items are received at nodes. Most of the research done so far with this approach has focused on computing functions or statistics measures with very small error using sublinear space in the item domain size.
On the other hand, very few works have tackled the distributed streaming model, also called the functional monitoring problem, which combines features of both the streaming model and communication complexity models. As in the streaming model, the input data is read on the fly, and processed with a minimum workspace and time.
In the communication complexity model, each node receives an input data stream, performs some local computation, and communicates only with a coordinator who wishes to continuously compute or estimate a given function of the union of all the input streams. The challenging issue in this model is for the coordinator to compute the given function by minimizing the number of communicated bits.
In this task, we go a step further by studying the dispersion matrix of distributed streams.
Specifically, we have proposed a novel metric that allows to approximate in real time the correlation between distributed and massive streams. This metric, called the sketch* metric, allows us to quantify how observed data items change together, and in which proportion. We are convinced that such a network-wide traffic monitoring tool should allow monitoring applications to get significant information on the traffic behavior changes to subsequently inform more detailed detection tools on where DDoS attacks are currently active.
Conclusion & Perspectives
Descent achieved important results:
- An experimentation platform based on 48 raspberry pi is operational. It has to be extended to 150 raspberry pis. We deployed the LSEQ prototype on the platform sucessfully.
- Task 1 on federated social infrastructure demonstrated how it possible to establish reliable communications between federated participant. Thanks to polystyren, the communication layer is able to recover its topology even in case of catastrophic failure.
- Task 2 on quasi-CRDT presented how it is possible to build probabilistic data structure that support constraints of federations. LSEQ allow to build sequences that can be updated by millions of participants.
- Task 3 formalized and mapped consistency criteria that are weaker than strong consistency. It proposed a new criteria "update-consistency" that can be implemented for any data structure.
- Task 4&5 proposed stream based approaches to detect malicious behaviors in a federation. The different contributions are compatible wwith limited memory and low CPU capacities of plugs computers.
- Task1 is working with one PhD on optimizations of communication layer. It aims to take advantage of small scale sub-topologies.
- Task 2 is working on causality tracking with original approach to anti-entropy
- Task 3 is continuing exploration of weak-consistency models.
- Task 4 is working on windowing techniques for continuous monitoring
- Task 5 is working on an original approach on usage control.
In order to demonstrate Descent achievements, we plan to build a million editor challenge. The objective is to deploy a social infrastructure an editing environment able to support in real-time 1 million of users. Why doing that ? Because it is hard:
“We choose to go to the moon in this decade and do the other things, not because they are easy, but because they are hard." Kennedy 1962