Changes

akwizgran · 8a91561f
--- a/Analysis-of-Human-Mobility-Datasets.md
+++ b/Analysis-of-Human-Mobility-Datasets.md
 [[_TOC_]]
-## Abstract
+# Abstract
 This report describes the Briar team's ongoing research into human mobility patterns and the implications of these patterns for the design of delay-tolerant networks.
 We analyse three published datasets that record encounters between people carrying short-range wireless devices in different enviroments: an urban university, an office building and a rural village.
@@ -16,13 +16,13 @@ For many sender-recipient pairs, delivery latencies are on the order of days or
 The consistency of these results across three different environments (university, office building and village) suggests that the findings are robust.
 These findings will inform the Briar team's strategy for enabling people to communicate securely without access to the Internet.
-## Research Goals
+# Research Goals
 The goal of this research is to investigate whether offline communication using delay-tolerant networks made up of smartphones or similar mobile devices is feasible.
 We analyse several human mobility datasets to find bounds on the performance that can be achieved by any network of this kind, given realistic mobility patterns.
-## Human Mobility Datasets
+# Human Mobility Datasets
 The datasets analysed in this research were all produced by similar methods: the researchers asked a group of participants to carry or wear small electronic devices for several days as they went about their lives.
 The devices were capable of communicating wirelessly over a range of a few metres.
@@ -30,7 +30,7 @@ Each device would periodically detect the presence of other participants' device
 Detection events indicate the participants' mobility patterns only indirectly, but for our purposes this is fine: in designing delay-tolerant networks we're less interested in the precise details of how people move around than in the question of whether wireless devices carried by people can detect and communicate with nearby devices, and this is exactly what the datasets record.
-### Haggle Content Dataset
+## Haggle Content Dataset
 https://ieee-dataport.org/open-access/crawdad-cambridgehaggle-v-2009-05-29
@@ -74,7 +74,7 @@ All contacts with devices not involved in the experiment were ignored.
 This is the version of the dataset that we used for our analysis.
 Specifically, we used the file `haggle-one-cambridge-city-complete.tsv` from the archive `haggle-one-cambridge-city-complete.zip`, published at https://ieee-dataport.org/open-access/crawdad-uoihaggle.
-### SocioPatterns Office Dataset
+## SocioPatterns Office Dataset
 http://www.sociopatterns.org/datasets/contacts-in-a-workplace/
@@ -99,7 +99,7 @@ However, due to the much shorter scan interval used in this experiment, the diff
 Our analysis is based on the file `tij_InVS.dat` from the archive `tij_InVS.zip`, published by the SocioPatterns project at http://www.sociopatterns.org/datasets/contacts-in-a-workplace/.
-### SocioPatterns Village Dataset
+## SocioPatterns Village Dataset
 http://www.sociopatterns.org/datasets/contact-patterns-in-a-village-in-rural-malawi/
@@ -119,7 +119,7 @@ We translated the detection events into contacts in the same way as for the offi
 Our analysis is based on the file `tnet_malawi_pilot.csv`, published by the SocioPatterns project at http://www.sociopatterns.org/datasets/contact-patterns-in-a-village-in-rural-malawi/.
-### HYCCUPS 2012 Dataset
+## HYCCUPS 2012 Dataset
 https://ieee-dataport.org/open-access/crawdad-upbhyccups
@@ -150,7 +150,7 @@ We don't know how to reconcile these numbers.
 After trying to contact the researchers and receiving no reply, we reluctantly decided to exclude this dataset from our research, as we don't understand the data well enough to analyse it.
-## Directed and Undirected Contacts
+# Directed and Undirected Contacts
 The detection events recorded in the datasets are *directed*, meaning that an event in which device `A` detects device `B` is distinct from an event in which `B` detects `A`.
@@ -171,17 +171,15 @@ Therefore we use the less restrictive definition: there's an undirected contact
 The difference between these definitions is less significant for the SocioPatterns datasets, with their 20-second scan intervals, than for the Haggle dataset, where the scan interval of most devices is 10 minutes.
-## Initial Analysis of Contact Patterns
+# Initial Analysis of Contact Patterns
 Before moving on to analysing the emergent properties of the datasets we'll discuss some characteristics that are apparent in the raw data.
-### Contact Events per Hour
+## Contact Events per Hour
 Counting the number of contact events per hour reveals clear daily patterns in all the datasets.
 (Here we use "contact event" as shorthand for the detection event that begins a contact between two devices.)
---
 ![haggle-contact-frequency.png](uploads/img/haggle-contact-frequency.png)
 **Contact events per hour in the Haggle dataset.**
@@ -220,13 +218,11 @@ The office and village datasets were collected using the same experimental setup
 The difference may be due in part to the fact that contacts in the office dataset only occur during office hours (roughly a quarter of all hours in the week).
-### Number of Connected Devices
+## Number of Connected Devices
 We refer to a device as "connected" when it's involved in one or more undirected contacts.
 All three datasets show daily patterns in the number of connected devices, with similar patterns to the number of contact events per hour.
---
 ![haggle-density.png](uploads/img/haggle-density.png)
 **Number of connected devices for the Haggle dataset.**
@@ -252,15 +248,13 @@ The Haggle and office datasets have clear differences between weekdays and weeke
 Both of the SocioPatterns datasets show a lot of rapid variation: even within each daily peak, the number of connected devices frequently drops to zero for short periods, whereas in the Haggle dataset the peaks are smoother.
 This may be due in part to the greater detection range of the devices used in the Haggle experiment (5-10 metres versus 1.5 metres), and the fact that the devices used in the SocioPatterns experiments were designed to detect only face-to-face contacts.
-### Time Spent Connected
+## Time Spent Connected
 The three datasets differ considerably in the amount of time devices spend connected.
 The following figure shows the distribution for each dataset.
 The boxes show the interquartile range, while the whiskers show the maximum and minimum values.
---
 ![time-active.png](uploads/img/time-active.png)
 **Fraction of time spent connected, compared across datasets.**
@@ -274,7 +268,7 @@ The difference between the Haggle dataset and the others may be due in part to t
 Contacts in the office dataset only occur during office hours, which are roughly a quarter of all hours in the week.
 But even after accounting for this there's still a large difference between the office and village datasets, indicating that face-to-face contacts in the office environment are sparser than in the village environment.
-## Static Contact Graphs
+# Static Contact Graphs
 The following figures show the *static contact graphs* for the three datasets.
@@ -283,8 +277,6 @@ An edge connects two nodes if the corresponding devices were in contact at any t
 The thickness of the edge represents the total duration the devices spent in contact, in proportion to the maximum duration any two devices spent in contact.
 Contact durations are calculated using the definition of an undirected contact given above.
---
 ![haggle-contact-graph.png](uploads/img/haggle-contact-graph.png)
 **Static contact graph for the Haggle dataset.**
@@ -314,7 +306,7 @@ The village dataset's static contact graph is sparser than the others: most node
 Several cliques of three to five nodes are visible, perhaps representing households.
 A small component of two nodes is disconnected from the rest of the graph.
-### Isolated Nodes and Small Components
+## Isolated Nodes and Small Components
 The static contact graph for the Haggle dataset has two isolated nodes representing devices that didn't make contact with any other participating devices during the experiment, while the static contact graph for the village dataset has a small component representing two devices that made contact with each other but not with any other devices.
@@ -324,7 +316,7 @@ In the real world there are many people who are physically isolated from others,
 These people are less likely to be well-served by delay-tolerant networks than people who are more mobile.
 By including isolated nodes and small components in our analysis, rather than analysing only the main component of the graph, we aim to ensure that this aspect of reality is reflected in our results.
-### Measuring Static Reachability
+## Measuring Static Reachability
 One way to measure the connectivity of a dataset is to count how many devices in the dataset's static contact graph are reachable from a given source device by following the edges of the contact graph.
 Averaging over all possible source devices gives a metric that we call the *mean static reachability* of the dataset.
@@ -347,7 +339,7 @@ For example, given the mobility patterns recorded in the Haggle dataset, no dela
 Next we'll see how we can tighten this upper bound by including information about the order in which contacts happened.
-## Dynamic Contact Graphs
+# Dynamic Contact Graphs
 The static contact graph captures information about which devices made contact with each other during the experiment, but it doesn't include any information about the order in which contacts happened.
 To capture this information as well, we can construct a *dynamic contact graph* for each dataset.
@@ -384,21 +376,19 @@ A dynamic contact graph can be built from directed or undirected contacts, resul
 As discussed above, for the purposes of this research we're mainly interested in undirected contacts, because the delay-tolerant networks we're interested in building will use two-way communication to exchange messages.
 But it's possible to imagine delay-tolerant networks that use one-way broadcasts to propagate messages, and such networks can be modeled using dynamic contact graphs with directed contacts - that is, with directed edges within each layer.
-### Events With the Same Timestamp
+## Events With the Same Timestamp
 The Haggle and SocioPatterns datasets use timestamps with one-second precision and contain many events with the same timestamp.
 When building a dynamic contact graph, if two or more events have the same timestamp then we process them in the order in which they occurred in the original dataset.
 In particular, an event indicating the end of a contact is always processed after the event indicating the beginning of the same contact, even if they have the same timestamp.
 (This is common, because as discussed above, many contacts have recorded durations of zero, and so the events indicating the beginning and end of such a contact have the same timestamp.)
-### Example
+## Example
 The following figures show how to construct a dynamic contact graph for a simple dataset with three devices: `A`, `B` and `C`.
 The dataset starts at time `t1`.
 There's a contact between `A` and `B` beginning at `t2` and ending at `t3`, and later there's a contact between `B` and `C` beginning at `t4` and ending at `t5`.
---
 ![dynamic-contact-graph-1.png](uploads/img/dynamic-contact-graph-1.png)
 **The bottom layer of the contact graph has a node for each device.**
@@ -430,7 +420,7 @@ Thus, unlike a static contact graph, the dynamic contact graph captures the fact
 If we collapsed all the layers of the dynamic contact graph into a single layer then we'd recover the static contact graph.
-## Dynamic Reachability Analysis
+# Dynamic Reachability Analysis
 By building a dynamic contact graph for each dataset we can begin to study which devices are reachable from which others at which times, which tells us which devices would potentially be able to send messages to each other in a delay-tolerant network.
@@ -447,7 +437,7 @@ The upper bound is also useful as a way of assessing whether *any* delay-toleran
 If the upper bound for delivery performance is too low to be useful then we don't need to investigate the performance of specific designs.
 A negative result at this stage could save us a lot of effort designing and deploying real networks, if we can see that no network could possibly achieve the level of performance we need.
-### Measuring Dynamic Reachability
+## Measuring Dynamic Reachability
 Just as we measured static reachability in a static contact graph by counting how many devices are reachable from a given source device, in a dynamic contact graph we can measure *dynamic reachability* by counting how many devices are reachable *from a given source device, starting at a given time*.
 Averaging over all possible source devices gives a metric that we call the *mean dynamic reachability* at the given time.
@@ -461,8 +451,6 @@ The figures below show the mean dynamic reachability for each dataset as a funct
 The dashed line in each figure shows the mean static reachability, for comparison.
 Static reachability is always at least as high as dynamic reachability, as static reachability ignores the order in which contacts occur.
---
 ![haggle-reachability.png](uploads/img/haggle-reachability.png)
 **Mean dynamic reachability for the Haggle dataset.**
@@ -504,8 +492,6 @@ If the hypothesis of a boundary effect is wrong and the observed decline in reac
 On the other hand, if the hypothesis of a boundary effect is right then cutting off the end of the dataset will reduce the dynamic reachability at times before the cutoff time.
 By measuring how far back from the cutoff time the change in reachability extends, we'll be able to estimate how far back from the end of the full dataset the boundary effect extends.
---
 ![haggle-reachability-cutoff-time.png](uploads/img/haggle-reachability-cutoff-time.png)
 **Mean dynamic reachability for the Haggle dataset with different cutoff times.**
@@ -531,7 +517,7 @@ Simulations based on these datasets would not yield meaningful results: messages
 So we have a problem.
 How can we prevent the boundary effect from distorting our results?
-## Looping the Data
+# Looping the Data
 In an ideal world we'd like to go back and repeat the experiments over much longer durations, capturing several week of data.
 We would measure the extent of the boundary effect by truncating the data, as we did above, and then run the reachability analysis for all start times up to the beginning of the boundary effect.
@@ -548,8 +534,6 @@ It's easy to find suitable start and end times for the office dataset, which has
 The following figures show the part of each dataset that we selected for looping.
---
 ![haggle-loop-points.png](uploads/img/haggle-loop-points.png)
 **Data selected to be looped for the Haggle dataset.**
@@ -568,15 +552,13 @@ The following figures show the part of each dataset that we selected for looping
 ---
-## Static Contact Graphs for Looped Data
+# Static Contact Graphs for Looped Data
 The static contact graphs for the looped data are somewhat sparser than the graphs for the full datasets.
 Recall that in a static contact graph, each node represents a device and an edge connects two nodes if the corresponding devices were in contact at any time.
 The thickness of the edge represents the total duration the devices spent in contact, in proportion to the maximum duration any two devices spent in contact.
---
 ![haggle-contact-graph-looped.png](uploads/img/haggle-contact-graph-looped.png)
 **Static contact graph for the looped Haggle dataset.**
@@ -613,12 +595,10 @@ Due to these isolated nodes and small components, static reachability is lower f
 | Office  |                          100.0% |                             95.7% |
 | Village |                           95.4% |                             82.2% |
-## Dynamic Reachability Analysis for Looped Data
+# Dynamic Reachability Analysis for Looped Data
 Repeating the dynamic reachability analysis with five weeks of looped data confirms that the decline in reachability seen with the original data is due to a boundary effect.
---
 ![haggle-loop-5-reachability.png](uploads/img/haggle-loop-5-reachability.png)
 **Mean dynamic reachability for the looped Haggle dataset.**
@@ -645,14 +625,12 @@ For the office dataset, the extent of the boundary effect is 14 days, while for
 This leaves us with at least two weeks of "clean" data at the start of each looped dataset.
 The rest of the analysis of the looped data will focus on these two weeks of "clean" data.
-### Time-to-Live
+## Time-to-Live
 If our understanding of the boundary effect caused by the end of the dataset is right then some paths take many days to reach their destinations.
 We can explore this further by imposing a *time-to-live*: paths that take longer than this to reach their destinations are ignored when measuring reachability.
 This lets us see how many source-destination pairs can reach each other within a given amount of time.
---
 ![haggle-loop-5-reachability-timeout-short.png](uploads/img/haggle-loop-5-reachability-timeout-short.png)
 **Mean dynamic reachability for the looped Haggle dataset, with various time-to-live values.**
@@ -696,7 +674,7 @@ In principle it's possible for paths to depend on longer chains of meetings, and
 We can see from the figures above that this is not just a theoretical concern: for all of the datasets but especially the village dataset, a significant fraction of paths don't reach their destinations within one week.
-### Path Latency
+## Path Latency
 The figures in the previous section can be read in two ways: for a given time-to-live, they can tell us the fraction of paths that reach their destinations; or for a given probability of reaching the destination, they can tell us the time-to-live required to achieve that probability.
 In other words, the figures show the distribution of path latencies.
@@ -706,7 +684,7 @@ The distribution changes throughout the week due to weekly patterns in the parti
 It's clear that for all of the datasets, these latencies are much longer than the latencies that users of Internet-based communication tools are used to.
-### Maximum Path Latency
+## Maximum Path Latency
 For each dataset, we can find the minimum time-to-live that achieves the same level of reachability as an unlimited time-to-live.
 This tells us the *maximum path latency* of the dataset.
@@ -729,7 +707,7 @@ Real-world performance may therefore *decrease* when the time-to-live is increas
 Fully understanding these performance tradeoffs requires a more realistic simulation of a specific delay-tolerant network design, including its rules for deciding which messages to store and forward when storage and bandwidth are limited.
-## Summary of Findings
+# Summary of Findings
 * All of the datasets show daily patterns in the frequency of contact events and the number of connected devices. The Haggle and office datasets also show weekly patterns.
 * The datasets differ greatly in the amount of time devices spend connected.
@@ -740,7 +718,7 @@ Fully understanding these performance tradeoffs requires a more realistic simula
 * The static contact graphs for the looped data are qualitatively similar to those for the original data, but with a higher incidence of isolated nodes and small components. Thus static reachability for the office and village datasets is lower for the looped data than for the original data.
 * Path latencies are high for all datasets. For the Haggle dataset, 44-79% of paths take more than a day to reach their destinations (depending on the time of week), while the slowest paths take 14 days to reach their destinations. The other datasets have even higher latencies.
-## Conclusions and Future Work
+# Conclusions and Future Work
 We extracted static and dynamic contact graphs from three datasets and analysed the graphs to obtain some bounds on the performance achievable by any delay-tolerant network, given the mobility patterns recorded in the datasets.