Add kademlia slides

2026-02-06 19:26:41 +01:00 · 2020-03-10 22:23:46 +01:00
parent df47c73acd
commit 9e3f55568f
2 changed files with 164 additions and 0 deletions
--- a/kademlia.pdf
+++ b/kademlia.pdf
--- a/src/kademlia/kademlia.md
+++ b/src/kademlia/kademlia.md
@@ -0,0 +1,164 @@
+---
+marp: true
+---
+
+
+# Kademlia
+
+
+<br><br><br>
+[@arnaucube](https://twitter.com/arnaucube)
+
+2019-04-26
+
+---
+
+### Overview
+- nodes self sets a random unique ID (UUID)
+- nodes are grouped in `neighbourhoods` determined by the `node ID` distance
+- Kademlia uses `distance` calculation between two nodes
+	- distance is computed as XOR (exclusive or) of the two `node ID`s
+
+---
+
+- XOR acts as the distance function between all `node ID`s. Why:
+	- distance between a node and itself is zero
+	- is symmetric: distance between A to B is the same to B to A
+	- follows `triangle inequality`
+		- given A, B, C vertices (points) of a triangle
+		- AB <= (AC + CB)
+			- the distance from A to B is shorter or equal to the sum of the distance from A to C plus the distance from C to B
+		- so, we get the shortest path
+---
+- with that last 3 properties we ensure that XOR
+	- captures all of the essential & important features of a "real" distance function
+	- is simple and cheap to calculate
+- each search iteration comes one bit closer to the target
+	- a basic Kademlia network with `2^n` nodes will only take `n` steps (in worst case) to find that node
+
+---
+
+### Routing tables
+- each node has a routing table, that consists of a `list` for each bit of the `node ID`
+	- each entry holds the necessary data to locate another node
+		- IP address, port, `node ID`, etc
+	- each entry corresponds to a specific distance from the node
+		- for example, node in the Nth position in the `list`, must have a differing Nth bit from the `node ID`
+		- so, the list holds a classification of 128 distances of other nodes in the network
+---
+- as nodes are encountered on the network, they are added to the `lists`
+	- store and retrieval operations
+	- helping other nodes to find a key
+	- every node encountered will be considered for inclusion in the lists
+	- keps network constantly updated
+		- adding resilience to failures and attacks
+---
+- `k-buckets`
+	- `k` is a system wide number
+	- every `k-bucket` is a `list` having up to `k` entries inside
+	- example:
+		- network with `k=20`
+		- each node will have `lists` containing up to 20 nodes for a particular bit
+- possible nodes for each `k-bucket` decreases quickly
+	- as there will be very few nodes that are that close
+- since quantity of possible IDs is much larger than any node population, some of the `k-buckets` corresponding to very short distances will remain empty
+---
+- example:
+![k-buckets](https://upload.wikimedia.org/wikipedia/commons/6/63/Dht_example_SVG.svg "k-buckets")
+	- network size: 2^3
+	- max nodes: 8, current nodes: 7
+	- let's take 6th node (110) (black leaf)
+	- 3 `k-buckets` for each node in the network (gray circles)
+		- nodes 0, 1, 2 (000, 001, 010) are in the farthest `k-bucket`
+		- node 3 (011) is not participating in the network
+		- middle `k-bucket` contains the nodes 4 and 5 (100, 101)
+		- last `k-bucket` can only contain node 7 (111)
+
+---
+
+- Each node knows its neighbourhood well and has contact with a few nodes far away which can help locate other nodes far away.
+- Kademlia priorizes long connected nodes to remain stored in the `k-buckets`
+	- as the nodes that have been connected for a long time in a network will probably remain connected for a long time in the future
+---
+- when a `k-bucket` is full and a new node is discovered for that `k-bucket`
+	- node sends a ping to the last recently seen node in the `k-bucket`
+	- if the node is still alive, the new node is stored in a secondary list (a replacement cache)
+		- replacement cache is used if a node in the `k-bucket` stops responding
+	- basically, new nodes are used only when older nodes disappear
+
+---
+
+### Protocol messages
+- PING
+- STORE
+- FIND_NODE
+- FIND_VALUE
+Each `rpc` msg includes a random value from the initiator, to ensure that the response corresponds to the request
+
+---
+
+### Locating nodes
+- node lookups can proceed asynchronously
+	- `α` denotes the quantity of simultaneous lookups
+	- `α` tipically is 3
+- node initiates a FIND_NODE request to the `α` nodes in its own `k-bucket` that are closest ones to the desired key
+---
+- when the recipient nodes receive the request, they will look in their `k-buckets` and return the `k` closest nodes to the desired key that they know
+- the requester will update a results list with the results (`node ID`s) that receives
+	- keeping the `k` best ones (the `k` nodes that are closer to the searched key)
+- the requester node will select these `k` best results and issue the request to them
+- the proces is repeated again and again until get the searched key
+---
+- iterations continue until no nodes are returned that are closer than the best previous results
+	- when iterations stop, the best `k` nodes in the results list are the ones in the whole network that are the closest to the desired key
+- node information can be augmented with RTT (round trip times)
+	- when the RTT is spended, another query can be initiated
+	- always the query's number are <= `α` (quantity of simultaneous lookups)
+
+---
+
+### Locating resources
+- data (values) located by mapping it to a key
+	- typically a hash is used for the map
+- locating data follows the same procedure as locating the closest nodes to a key
+	- except the search terminates when a node has the requested value in his store and returns this value
+
+---
+
+#### Data replicating & caching
+- values are stored at several nodes (k of them)
+- the node that stores a value
+	- periodically explores the network to find the k nodes close to the key value
+	- to replicate the value onto them
+	- this compensates the disappeared nodes
+---
+- avoiding "hot spots"
+	- for popular values (might have many requests)
+	- near nodes outside the k closest ones, store the value
+		- this new storing is called `cache`
+		- caching nodes will drop the value after a certain time
+			- depending on their distance from the key
+	- in this way the value is stored farther away from the key
+		- depending on the quantity of requests
+	- allows popular searches to find a storer more quickly
+	- alleviates possible "hot spots"
+- not all implementations of Kademlia have these functionallities (replicating & caching)
+	- in order to remove old information quickly from the system
+
+---
+
+### Joining the network
+- to join the net, a node must first go through a `bootstrap` process
+- `bootstrap` process
+	- needs to know the IP address & port of another node (bootstrap node)
+	- compute random unique `node ID` number
+	- inserts the bootstrap node into one of its k-buckets
+---
+- `bootstrap` process [...]
+	- perform a node lookup of its own `node ID` against the bootstrap node
+		- this populate other nodes `k-buckets` with the new `node ID`
+		- populate the joining node `k-buckets` with the nodes in the path between that node and the bootstrap node
+	- refresh all `k-buckets` further away than the `k-bucket` the bootstrap node falls in
+		- this refresh is a lookup of a random key that is within that `k-bucket` range
+	- initially nodes have one `k-bucket`
+		- when is full, it can be split