Changes

akwizgran · 39c3cfcf
--- a/Analysis-of-Human-Mobility-Datasets.md
+++ b/Analysis-of-Human-Mobility-Datasets.md
@@ -3,12 +3,12 @@
 ## Abstract

 This report describes the Briar team's ongoing research into human mobility patterns and their implications for the design of delay-tolerant networks.
-We analysed three published datasets that record encounters between people carrying short-range wireless devices in different enviroments: an urban university, an office building and a rural village.
+We analyse three published datasets that record encounters between people carrying short-range wireless devices in different enviroments: an urban university, an office building and a rural village.

-We constructed a *transitive contact graph* for each dataset and analysed these graphs to understand the emergent structures created by the individual encounters.
-By analysing the transitive contact graphs we were able to obtain some bounds on the performance achievable by any delay-tolerant network when operating under the conditions recorded in the datasets.
+We construct *static and dynamic contact graphs* for each dataset and analyse these graphs to understand the emergent structures created by the individual encounters.
+By analysing the contact graphs we can obtain some bounds on the performance achievable by any delay-tolerant network, given the mobility patterns recorded in the datasets.

-This analysis produced two interesting findings.
+This analysis produces two interesting findings.
 First, the upper bound for *delivery probability* is high: under the conditions recorded in the datasets, a delay-tolerant network can, in principle, deliver a message from any sender to any recipient with high probability.
 Second, the lower bound for *delivery latency* is also high: under the conditions recorded in the datasets, a delay-tolerant network cannot, even in principle, deliver messages quickly between all potential senders and recipients.
 For some sender-recipient pairs, delivery latencies were on the order of days or weeks, rather than the seconds or minutes that users of internet-based communication tools are accustomed to.
@@ -18,7 +18,7 @@ These findings will inform the Briar team's strategy for enabling people to comm

 ## Research Goals

-The goal of this research is to investigate whether offline communication using a delay-tolerant network made up of smartphones or similar mobile devices is feasible.
+The goal of this research is to investigate whether offline communication using delay-tolerant networks made up of smartphones or similar mobile devices is feasible.

 We analyse several human mobility datasets to find bounds on the performance that can be achieved by any network of this kind, given realistic mobility patterns.

@@ -172,42 +172,43 @@ Therefore we use the less restrictive definition: there's an undirected contact

 Before moving on to analysing the emergent properties of the datasets we'll discuss some characteristics that are apparent in the raw data.

-### Contact Graphs
+### Static Contact Graphs

-The following figures show the contact graphs for the three datasets.
-The nodes represent devices.
-An edge connects two nodes if the corresponding devices were in contact at any time during the experiment.
-The thickness of the edge represents the total duration the devices spent in contact during the experiment, in proportion to the maximum duration any two devices spent in contact.
+The following figures show the *static contact graphs* for the three datasets.
+
+Each node in a static contact graph represents a device.
+An edge connects two nodes if the corresponding devices were in contact at any time.
+The thickness of the edge represents the total duration the devices spent in contact, in proportion to the maximum duration any two devices spent in contact.
 Contact durations are calculated using the definition of an undirected contact given above.

 ---

 ![haggle-contact-graph.png](uploads/img/haggle-contact-graph.png)

-**Contact graph for the Haggle dataset.**
+**Static contact graph for the Haggle dataset.**

 ---

 ![office-contact-graph.png](uploads/img/office-contact-graph.png)

-**Contact graph for the SocioPatterns office dataset.**
+**Static contact graph for the SocioPatterns office dataset.**

 ---

 ![malawi-contact-graph.png](uploads/img/malawi-contact-graph.png)

-**Contact graph for the SocioPatterns village dataset.**
+**Static contact graph for the SocioPatterns village dataset.**

 ---

-The Haggle contact graph has a densely connected core and a sparsely connected periphery.
+The Haggle dataset's static contact graph has a densely connected core and a sparsely connected periphery.
 Within the core are two clusters of nodes connected by thick edges, representing devices that spent a lot of time in contact with each other.
 The periphery contains about a dozen devices that each made contact with one or two devices from the core.
-There are two isolated nodes representing the two devices that didn't make contact with any other devices.
+There are two isolated nodes, which represent the two devices that didn't make contact with any other participating devices during the experiment.

-The office contact graph is more uniform: there's no clear core or periphery, no obvious clusters, and most of the edges have similar thickness.
+The office datatet's static contact graph is more uniform: there's no clear core or periphery, no obvious clusters, and most of the edges have similar thickness.

-The village contact graph is sparser than the others: most nodes have fewer edges than in the other graphs.
+The village dataset's static contact graph is sparser than the others: most nodes have fewer edges than in the other graphs.
 Several cliques of three to five nodes are visible, perhaps representing households.
 A small component of two nodes is disconnected from the rest of the graph.

@@ -291,6 +292,9 @@ This may be due in part to the greater detection range of the devices used in th

 The three datasets differ considerably in the amount of time devices spend connected.

+The following figure shows the distribution for each dataset.
+The boxes show the interquartile range, while the whiskers show the maximum and minimum values.
+
 ---

 ![time-active.png](uploads/img/time-active.png)
@@ -301,27 +305,48 @@ The three datasets differ considerably in the amount of time devices spend conne

 The median device in the Haggle dataset spends 9.8% of its time connected, whereas the median is just 0.2% for the office dataset and 2.7% for the village dataset.

-The difference between the Haggle dataset and the others may be due in part to the greater detection range of the devices used in the Haggle experiment and the fact that the devices used in the SocioPatterns experiments detected only face-to-face contacts.
+The difference between the Haggle dataset and the others may be due in part to the greater detection range of the devices used in the Haggle experiment, and the fact that the devices used in the SocioPatterns experiments detected only face-to-face contacts.

-Contacts in the office dataset only occur during office hours, which are roughly a quarter of all hours in the week, but even after accounting for this there's still a large difference between the office and village datasets, indicating that face-to-face contacts in the office environment are sparser than in the village environment, even during office hours.
+Contacts in the office dataset only occur during office hours, which are roughly a quarter of all hours in the week.
+But even after accounting for this there's still a large difference between the office and village datasets, indicating that face-to-face contacts in the office environment are sparser than in the village environment.

 ### Isolated Nodes and Small Components

-Before continuing with our analysis we need to decide how to handle isolated nodes and small components in the contact graphs.
-
-The contact graph for the Haggle dataset has two isolated nodes representing devices that didn't make contact with any other participating devices during the experiment, while the contact graph for the village dataset has a small component representing two devices that made contact with each other but not with any other devices.
+The static contact graph for the Haggle dataset has two isolated nodes representing devices that didn't make contact with any other participating devices during the experiment, while the static contact graph for the village dataset has a small component representing two devices that made contact with each other but not with any other devices.

-Clearly devices such as these will cause difficulties for delay-tolerant communication, as they can't reach or be reached by any of the devices in the main component.
-If we excluded them from our analysis then we might reach unrealistically positive conclusions about the feasibility of delay-tolerant networks.
+Clearly, devices such as these will cause difficulties for delay-tolerant communication, as they can't reach or be reached by any of the devices in the graph's main component.

 In the real world there are many people who are physically isolated from others, or from those outside their households: for example, people who are unable to leave their homes due to illness or disability.
 These people are less likely to be well-served by delay-tolerant networks than people who are more mobile.
-By including isolated nodes and small components in our analysis, we aim to ensure that this aspect of reality is reflected in our results.
+By including isolated nodes and small components in our analysis, rather than analysing only the main component of the graph, we aim to ensure that this aspect of reality is reflected in our results.
+
+### Static Reachability
+
+| Dataset | Mean static reachability |
+|---------|--------------------------|
+| Haggle  |                    92.7% |
+| Office  |                   100.0% |
+| Village |                    95.4% |
+
+We can measure *static reachability* by counting how many devices in the static contact graph are reachable from a given source device.
+Averaging over all possible source devices gives the *mean static reachability*.
+This is the fraction of pairs of devices that belong to the same component as each other in the static contact graph.
+
+The mean static reachability of the office dataset is 100% because all devices belong to the same component.
+
+The mean static reachability of the Haggle dataset is below 100% because the devices represented by the two isolated nodes in the static contact graph can't reach (or be reached by) any other devices.
+
+Similarly, the mean static reachability of the village dataset is below 100% because the two devices that belong to the small component of the static contact graph can only reach (and be reached by) each other.
+
+The mean static reachability of a dataset gives us an upper bound on the delivery performance that we can expect from any delay-tolerant network, given the mobility patterns recorded in the dataset.
+For example, no delay-tolerant network could deliver messages between more than 92.7% of sender-recipient pairs, given the mobility patterns recorded in the Haggle dataset, because two of the potential sender/recipient devices never made contact with other devices.

-## Transitive Contact Graphs
+Next we'll see that we can tighten this upper bound by including information about the order in which contacts happened.

-In order to study the datasets at a higher level of abstraction, we can extract a *transitive contact graph* from each dataset.
-This graph represents the whole series of events recorded in the dataset and allows us to study the emergent properties of those events.
+## Dynamic Contact Graphs
+
+The static contact graph captures information about which devices made contact with each other during the experiment, but it doesn't include any information about the order in which contacts happened.
+To capture this information as well, we can construct a *dynamic contact graph* for each dataset.

 To do this, we create a layered graph in which each layer represents the contacts between devices at a given instant.
 We start by creating the bottom layer, which simply contains a node representing each device.
@@ -342,29 +367,30 @@ Thus, for example, if a contact occurs between devices `A` and `B`, and then aft
 `A` could transfer a message to `B` during their contact, and `B` could then transfer this message to `C` during their later contact, but the reverse would not be possible, as the contact between `A` and `B` ends before the contact between `B` and `C` begins.
 This asymmetry is an important property of delay-tolerant networks, and the directed edges between layers allow our model to capture it.

-We can use the transitive contact graph to study the emergent properties of the mobility patterns recorded in the datasets.
+We can use the dynamic contact graph to study the emergent properties of the mobility patterns recorded in the datasets.
 For example, if we start from a given device at a given time, which devices are transitively reachable, and how long does it take to reach them?
 How do these properties vary with the choice of starting device or starting time?

-A transitive contact graph can be built from directed or undirected contacts.
+A dynamic contact graph can be built from directed or undirected contacts.
 As discussed above, we're mainly interested in undirected contacts because the delay-tolerant networks we're interested in building will use two-way communication to exchange messages.
-But it's possible to imagine delay-tolerant networks that use one-way broadcasts to propagate messages, and such networks can be modeled using transitive contact graphs with directed contacts - that is, with directed edges within each layer.
+But it's possible to imagine delay-tolerant networks that use one-way broadcasts to propagate messages, and such networks can be modeled using dynamic contact graphs with directed contacts - that is, with directed edges within each layer.

 ### Events With the Same Timestamp

 The Haggle and SocioPatterns datasets use timestamps with one-second precision and contain many events with the same timestamp.
-When building a transitive contact graph, if two or more events have the same timestamp then we process them in the order in which they occurred in the original dataset.
+When building a dynamic contact graph, if two or more events have the same timestamp then we process them in the order in which they occurred in the original dataset.
 In particular, an event indicating the end of a contact is always processed after the event indicating the beginning of the same contact, even if they have the same timestamp.
 (This is common, because as discussed above, many contacts have recorded durations of zero, and so the events indicating the beginning and end of such a contact have the same timestamp.)

-## Initial Reachability Analysis
+## Dynamic Reachability Analysis

-By building a transitive contact graph for each dataset we can begin to study which devices are reachable from which others, which tells us which devices would potentially be able to send messages to each other in a delay-tolerant network.
+By building a dynamic contact graph for each dataset we can begin to study which devices are reachable from which others at which times, which tells us which devices would potentially be able to send messages to each other in a delay-tolerant network.

-It's important to bear in mind that a real delay-tolerant network might not be able to deliver all messages that are theoretically deliverable according to the transitive contact graph.
+It's important to bear in mind that a real delay-tolerant network might not be able to deliver all messages that are theoretically deliverable according to the dynamic contact graph.
 A real network will have to deal with issues such as limited storage space on devices and limited communication bandwidth between devices, which may prevent some messages from being delivered.
-On the other hand, if a message is *not* deliverable according to the transitive contact graph then no real network could deliver that message - there is simply no path by which the message could reach its destination.
+On the other hand, if a message is *not* deliverable according to the dynamic contact graph then no real network could deliver that message - there is simply no path by which the message could reach its destination.
 Thus the reachability analysis provides an upper bound for the delivery performance that any real network could achieve.
+This upper bound is tighter than the one provided by measuring static reachability, and it allows us to study variations in reachability over time.

 This information is useful alongside more realistic simulations of specific network designs, as it tells us how close a simulated network is coming to the upper bound.
 If a simulated network does not manage to deliver all messages then the upper bound provided by the reachability analysis can tell us to what extent this is due to the inherent properties of the mobility data, and to what extent it's due to the specific network design being simulated.
@@ -373,82 +399,97 @@ The upper bound is also useful as a way of assessing whether any delay-tolerant
 If the upper bound for delivery performance is too low to be useful then we don't need to investigate the performance of specific designs.
 A negative result at this stage could save us a lot of effort designing and deploying real networks, if we can see that no network could possibly achieve the level of performance we need.

-### Mean Reachability
+### Dynamic Reachability
+
+Just as we measured static reachability in a static contact graph by counting how many devices are reachable from a given source device, in a dynamic contact graph we can measure *dynamic reachability* by counting how many devices are reachable from a given source device, *starting at a given time*.
+Averaging over all possible source devices gives the *mean dynamic reachability* at the given time.
+
+This is the fraction of source-destination pairs such that a delay-tolerant path exists from the source to the destination starting at the given time, and thus a message sent by the source at the given time could, in principle, eventually reach the destination.

-A basic way to measure reachability is to count how many devices are reachable, starting from a given source device at a given time. By averaging over all possible source devices, we obtain the *mean reachability* at the given time. This is the fraction of source-destination pairs such that delay-tolerant path exists from the source to the destination starting at the given time, and thus a message sent by the source at the given time could theoretically reach the destination.
+The figures below show the mean dynamic reachability for each dataset, as a function of the start time.
+
+The dashed line in each figure shows the mean static reachability, for comparison.
+Static reachability is always at least as high as dynamic reachability, as it ignores the order in which contacts occur.

 ---

 ![haggle-reachability.png](uploads/img/haggle-reachability.png)

-**Mean reachability for the Haggle dataset.**
+**Mean dynamic reachability for the Haggle dataset.**

 ---

 ![office-reachability.png](uploads/img/office-reachability.png)

-**Mean reachability for the SocioPatterns office dataset.**
+**Mean dynamic reachability for the SocioPatterns office dataset.**

 ---

 ![malawi-reachability.png](uploads/img/malawi-reachability.png)

-**Mean reachability for the SocioPatterns village dataset.**
+**Mean dynamic reachability for the SocioPatterns village dataset.**

 ---

-It's immediately apparent that mean reachability declines with time for all the datasets.
-At the beginning of the experiments, mean reachability is above 90% for every dataset.
-By the end of the experiments, mean reachability is zero for every dataset.
+It's immediately apparent that dynamic reachability declines with time for all the datasets.
+(Static reachability doesn't vary with time, by definition.)
+At the beginning of the experiments, the mean dynamic reachability is above 90% for every dataset.
+By the end of the experiments, the mean dynamic reachability is zero for all datasets.

-Given that we've already seen above that contacts continue to happen for the duration of the experiments (subject to daily and weekly patterns), it seems unlikely that this decline in reachability is due to a change in the participants' mobility patterns.
+We've already seen above that contacts continue to happen at similar rates throughout the experiments (subject to daily and weekly patterns), so it seems unlikely that this decline in reachability is due to a systematic change in the participants' mobility patterns.

 Instead, we hypothesise that the decline is due to a *boundary effect* created by the end of the experiment.
-The mean reachability metric counts all paths through the transitive contact graph that start at a given time - that is, at a given layer of the graph - and reach their destinations by the end of the experiment.
-Paths that start earlier in the experiment can access more layers of the transitive contact graph than paths that start later.

-If, counterfactually, the experiment had ended earlier than it did, then the upper layers of the transitive contact graph would not have existed, and any paths passing through those layers would not have reached their destinations.
-Thus we would expect the mean reachability to have been lower than what was measured in reality, especially for times closer to the counterfactually earlier end of the experiment.
+The mean dynamic reachability metric counts all paths through the dynamic contact graph that start at a given time (ie, at a given layer of the graph) and reach their destinations by the end of the experiment (ie, by the top layer of the graph).
+Paths that start earlier in the experiment can make use of more layers of the dynamic contact graph than paths that start later.
+
+If, counterfactually, the experiment had ended earlier than it did, then the upper layers of the dynamic contact graph would not have existed, and any paths passing through those layers would not have reached their destinations.
+Thus we would expect the dynamic reachability to have been lower than what was measured in reality.
 The experiment's earlier end would have caused the boundary effect to occur earlier.

-If our hypothesis is wrong and the observed decline in reachability is *not* due to a boundary effect, then ending the experiment earlier would have had no effect on the mean reachability.
+We can test the hypothesis of a boundary effect by cutting off the end of the dataset, effectively ending the experiment earlier, and comparing the results with the full dataset.

-Thus we can test the hypothesis about a boundary effect by cutting off the end of the dataset, effectively ending the experiment earlier, and comparing the results with the full dataset.
+If the hypothesis of a boundary effect is right then cutting off the end of the dataset will reduce reachability at times before the cutoff time.
+By measuring how far back from the cutoff time the reduction in reachability extends, we'll be able to estimate how far back from the end of the full dataset the boundary effect extends.
+
+If the hypothesis is wrong and the observed decline in reachability is *not* due to a boundary effect, then cutting off the end of the dataset will have no effect on reachability at times before the cutoff time.

 ---

 ![haggle-reachability-cutoff-time.png](uploads/img/haggle-reachability-cutoff-time.png)

-**Mean reachability for the Haggle dataset with different cutoff times.**
+**Mean dynamic reachability for the Haggle dataset with different cutoff times.**

 ---

 The results support the hypothesis of a boundary effect.
-Cutting off the end of the dataset reduces the mean reachability compared with the full dataset, and later times (closer to the cutoff time) are affected more than earlier times (further from the cutoff time).
+Cutting off the end of the dataset reduces the dynamic reachability compared with the full dataset, and later times (closer to the cutoff time) are affected more than earlier times (further from the cutoff time).

-For some cutoff times, reachability drops to zero before the cutoff time, while for others it remains above zero, indicating that some devices are transitively connected to each other when the cutoff time is reached (ie in the highest remaining layer of the transitive contact graph).
+For some cutoff times, reachability drops to zero before the cutoff time, while for others it remains above zero, indicating that some devices are transitively connected to each other when the cutoff time is reached (ie, in the highest remaining layer of the dynamic contact graph).

-By looking at the left edge of the figure above, we can see that the mean reachability at the start of the experiment is affected by the cutoff time.
-Even a cutoff time of 11 days slightly affects the mean reachability at the beginning of the experiment, indicating that some of the paths that started at the beginning of the experiment took more than 11 days to reach their destinations, and were thus removed when the cutoff time was set to 11 days.
+By looking at the left edge of the figure above, we can see that the dynamic reachability at the start of the experiment is affected by changing the cutoff time.
+Even a cutoff time of 11 days slightly affects the dynamic reachability at the beginning of the experiment, indicating that some of the paths that started at the beginning of the experiment took more than 11 days to reach their destinations, and were thus removed when the cutoff time was set to 11 days.

 This leads to the troubling conclusion that the boundary effect extends at least 11 days back from the end of the 12-day dataset, and thus we have little or no data that is not affected by the boundary.

+Results for the other datasets are similar.
+
 It isn't meaningful to calculate average measurements across the whole dataset, or to compare one day to another, if we know that the results are distorted by this boundary effect.
 So we have a problem.
 How can we prevent the boundary effect from distorting our results?

 ## Looping the Data

-In an ideal world we'd like to repeat the experiments over much longer durations, capturing several weeks' worth of data.
-We would measure the extent of the boundary effect by truncating the data, as we did above, and then we would run the reachability analysis on the full dataset, discarding the results near the end of the experiment that fell within the boundary effect.
-We would need to ensure that each experiment lasted long enough to leave at least one week's worth of "clean" results to analyse after discarding the results affected by the boundary.
+In an ideal world we'd like to go back and repeat the experiments over much longer durations, capturing several weeks' worth of data.
+We would measure the extent of the boundary effect by truncating the data, as we did above, and then run the reachability analysis for all start times up to the beginning of the boundary effect.
+We would need to ensure that each experiment lasted long enough to leave at least one week's worth of "clean" results unaffected by the boundary.
 This would let us average out the daily and weekly patterns in the clean results, and compare one day to another, without any distortions created by the boundary effect.

 Of course we can't actually repeat the experiments, but we can try to approximate the results of a longer experiment by selecting a week's worth of data from each dataset and looping it - in other words, processing the detection events as though the experiment had lasted for multiple weeks and the same sequence of events had occurred in every week.

-To avoid any distortions caused by suddenly "jumping back in time", we can choose the start and end times of the loop so that they're exactly a week apart and no devices are in contact with each other at the start time or the end time.
+To avoid any distortions caused by suddenly "jumping back in time" at the end of the loop, we can choose the start and end times of the loop so that no devices are in contact with each other at the start time or the end time.

-All of the datasets have suitable start and end times where no devices are in contact.
+All of the datasets have suitable start and end times exactly a week apart where no devices are in contact.
 It's easy to find suitable start and end times for the office dataset, which has a long period every night with no contacts, but the Haggle dataset has fewer suitable times and the village dataset has fewer still.

 The following figures show the part of each dataset that we selected for looping.
@@ -473,40 +514,51 @@ The following figures show the part of each dataset that we selected for looping

 ---

-### Contact Graphs of the Looped Data
+### Static Contact Graphs for Looped Data
+
+The static contact graphs for the looped data are somewhat sparser than the graphs for the full datasets.

-The contact graphs for the looped data are somewhat sparser than the graphs for the full datasets.
+Recall that in a static contact graph, each node represents a device and an edge connects two nodes if the corresponding devices were in contact at any time.
+The thickness of the edge represents the total duration the devices spent in contact, in proportion to the maximum duration any two devices spent in contact.

 ---

 ![haggle-contact-graph-looped.png](uploads/img/haggle-contact-graph-looped.png)

-**Contact graph for the looped Haggle dataset.**
+**Static contact graph for the looped Haggle dataset.**

 ---

 ![office-contact-graph-looped.png](uploads/img/office-contact-graph-looped.png)

-**Contact graph for the looped SocioPatterns office dataset.**
+**Static contact graph for the looped SocioPatterns office dataset.**

 ---

 ![malawi-contact-graph-looped.png](uploads/img/malawi-contact-graph-looped.png)

-**Contact graph for the looped SocioPatterns village dataset.**
+**Static contact graph for the looped SocioPatterns village dataset.**

 ---

 This shows that our technique for simulating a longer dataset by looping a week's worth of data isn't perfect: some devices that make contact during the full dataset don't make contact during the week's worth of data that we selected for looping.

-The contact graph for the looped version of the village dataset has three small compoments that are disconnected from the rest of the graph, whereas the full version of the dataset has only one small compoment.
+The static contact graph for the looped version of the village dataset has three small compoments that are disconnected from the rest of the graph, whereas the full version of the dataset has only one small compoment.

 There are also two devices in the office dataset and one in the village dataset that don't have any contacts during the week's worth of data chosen for the loop.
-These are shown as isolated nodes in the contact graphs above.
+These are shown as isolated nodes in the graphs above.
 The full versions of these datasets have no isolated nodes.

-As discussed above, we prefer to include these isolated nodes and small components in our analysis, as they represent a realistic feature of human mobility patterns with implications for the feasibility of delay-tolerant networks.
+As discussed above, we prefer to include these isolated nodes and small components in our analysis, as they represent a realistic feature of human mobility patterns that has implications for the feasibility of delay-tolerant networks.
+
+Due to these isolated nodes and small components, the mean static reachability is lower for the looped versions of the office and village datasets than for the full versions of the same datasets.
+
+| Dataset | Mean static reachability (full) | Mean static reachability (looped) |
+|---------|---------------------------------|-----------------------------------|
+| Haggle  |                           92.7% |                             92.7% |
+| Office  |                          100.0% |                             95.7% |
+| Village |                           95.4% |                             82.2% |

-## Reachability Analysis with Looped Data
+## Dynamic Reachability Analysis for Looped Data

 TODO