Changes

akwizgran · 0d710bb1
--- a/Analysis-of-Human-Mobility-Datasets.md
+++ b/Analysis-of-Human-Mobility-Datasets.md
@@ -11,10 +11,10 @@ By analysing the contact graphs we can obtain some bounds on the performance ach
 This analysis produces two interesting findings.
 First, the upper bound for *delivery probability* is high: under the conditions recorded in the datasets, a delay-tolerant network can, in principle, deliver a message from any sender to any recipient with high probability.
 Second, the lower bound for *delivery latency* is also high: under the conditions recorded in the datasets, a delay-tolerant network cannot, even in principle, deliver messages quickly between all potential senders and recipients.
-For some sender-recipient pairs, delivery latencies were on the order of days or weeks, rather than the seconds or minutes that users of internet-based communication tools are accustomed to.
+For some sender-recipient pairs, delivery latencies are on the order of days or weeks, rather than the seconds or minutes that users of Internet-based communication tools are used to.
 The consistency of these results across three different environments (university, office building and village) suggests that the findings are robust.
-These findings will inform the Briar team's strategy for enabling people to communicate securely without access to the internet.
+These findings will inform the Briar team's strategy for enabling people to communicate securely without access to the Internet.
 ## Research Goals
@@ -72,7 +72,7 @@ Researchers at the University of Ioannina converted the Haggle Content dataset i
 All contacts with devices not involved in the experiment were ignored.
 This is the version of the dataset that we used for our analysis.
-Specifically, we used the file `haggle-one-cambridge-city-complete.tsv` from the archive `haggle-one-cambridge-city-complete.zip` published at https://ieee-dataport.org/open-access/crawdad-uoihaggle.
+Specifically, we used the file `haggle-one-cambridge-city-complete.tsv` from the archive `haggle-one-cambridge-city-complete.zip`, published at https://ieee-dataport.org/open-access/crawdad-uoihaggle.
 ### SocioPatterns Office Dataset
@@ -89,16 +89,16 @@ A research paper describing the experiment is available at https://doi.org/10.10
 This dataset records contacts between workers in an office building in France in 2013.
 The participants wore devices for 12 days that scanned for nearby devices every 20 seconds.
-The devices were designed so that they would only register face-to-face contacts between the participants at a distance of up to 1.5 metres.
+The devices were designed so that they would only register face-to-face contacts between the participants, at a distance of up to 1.5 metres.
-Unlike the Haggle dataset, the list of devices detected by each scan was published (in anonymised form), leaving us with some flexibility in translating the detection events into contacts.
+Unlike the Haggle dataset, the researchers published the list of devices detected by each scan (in anonymised form), leaving us with some flexibility in translating the detection events into contacts.
 For consistency with the Haggle dataset, we decided to define a contact as a series of consecutive detections of the same device, with the contact duration measured from the time of the first detection to the time of the last detection.
 This means that as with the Haggle dataset, the recorded contact duration should be understood as a lower bound on the actual duration.
 In particular, if a device was only detected in a single scan then the recorded contact duration would be zero.
-However, due to the much shorter scan interval in this experiment, the difference between the recorded contact duration and the true duration should be smaller than for the Haggle dataset.
+However, due to the much shorter scan interval in this experiment, the difference between the recorded contact duration and the true duration should typically be smaller than for the Haggle dataset.
-Our analysis is based on the file `tij_InVS.dat` from the archive `tij_InVS.zip` published by the SocioPatterns project at http://www.sociopatterns.org/datasets/contacts-in-a-workplace/.
+Our analysis is based on the file `tij_InVS.dat` from the archive `tij_InVS.zip`, published by the SocioPatterns project at http://www.sociopatterns.org/datasets/contacts-in-a-workplace/.
 ### SocioPatterns Village Dataset
@@ -110,13 +110,14 @@ http://www.sociopatterns.org/datasets/contact-patterns-in-a-village-in-rural-mal
 This dataset was collected by researchers at the SocioPatterns project.
 It records contacts between individuals living in a village in rural Malawi in 2019-2020.
-The recording method was similar to the SocioPatterns office dataset: the participants wore devices for 14 days that scanned for nearby devices every 20 seconds, registering face-to-face contacts between the participants.
+The recording method was similar to the SocioPatterns office dataset: the participants wore devices that scanned for nearby devices every 20 seconds, registering face-to-face contacts between the participants.
+Individual participants joined and left the study at different times, so the researchers published data for a 14-day period during which all participants were taking part.
 A research paper describing the experiment is available at https://doi.org/10.1140/epjds/s13688-021-00302-w.
 We translated the detection events into contacts in the same way as for the office dataset.
-Our analysis is based on the file `tnet_malawi_pilot.csv` published by the SocioPatterns project at http://www.sociopatterns.org/datasets/contact-patterns-in-a-village-in-rural-malawi/.
+Our analysis is based on the file `tnet_malawi_pilot.csv`, published by the SocioPatterns project at http://www.sociopatterns.org/datasets/contact-patterns-in-a-village-in-rural-malawi/.
 ### HYCCUPS 2012 Dataset
@@ -136,14 +137,14 @@ This makes it difficult to interpret the data or make comparisons with other dat
 One unusual feature of this dataset is that contacts are rarely symmetrical.
 That is, if device `A` records a contact with device `B` then it's uncommon for device `B` to record a contact with device `A` at the same time.
-For the other datasets this is very common, which makes intuitive sense: if two devices remain within detection range of each other for long enough then we'd expect them both to detect each other.
+For the other datasets this is very common, which makes intuitive sense: if two devices remain within detection range of each other for long enough then we'd expect them both to detect each other sooner or later.
 The HYCCUPS dataset considers two devices to be in contact if they're connected to the same Wi-Fi network, rather than if they're within detection range of each other, but this is still a symmetrical condition (if `A` is connected to the same network as `B` then `B` is connected to the same network as `A`), so we'd still expect the recorded contacts to be roughly symmetrical.
 The lack of symmetry may be due to some feature of the AllJoyn library - perhaps if `A` connects to `B` via AllJoyn then `B` stops trying to connect to `A`, for example.
 But without being able to confirm this guess it's difficult to draw meaningful conclusions from the data.
 Another open question about this dataset is how the data relates to the researchers' publications.
-The 2012 paper "Exploring Predictability in Mobile Interaction", which is based on this dataset, mentions that there were about 21,000 encounters via Wi-Fi and about 7,000 via Bluetooth during the experiment.
+The 2012 paper "Exploring Predictability in Mobile Interaction" (https://doi.org/10.1109/EIDWT.2012.29), which is based on this dataset, mentions that there were about 21,000 encounters via Wi-Fi and about 7,000 via Bluetooth during the experiment.
 The dataset includes about 8,400 Wi-Fi contacts.
 We don't know how to reconcile these numbers.
@@ -203,7 +204,7 @@ The office dataset has no contacts at all during the night.
 As well as daily patterns, the Haggle and office datasets show weekly patterns, with weekdays clearly distinguishable from weekends (the Haggle dataset starts at 9:55 local time on a Friday morning, while the office dataset starts at 8:00 local time on a Monday morning).
 The weekly pattern is especially pronounced for the office dataset, which has no contacts at all during the weekend.
-The village dataset, on the other hand, does not show any clear differences between weekdays and weekends (we don't know the day and time when this dataset starts).
+The village dataset, on the other hand, doesn't show any clear differences between weekdays and weekends (we don't know the day and time when this dataset starts).
 | Dataset | Mean contacts per hour |
 |---------|------------------------|
@@ -212,9 +213,9 @@ The village dataset, on the other hand, does not show any clear differences betw
 | Village |                  102.3 |
 Contacts are far more frequent in the village dataset than the other datasets.
-The office and village datasets were collected using the same experimental setup, so the difference is not attributable to the collection methodology.
+The office and village datasets were collected using the same experimental setup, so the difference is not due to the collection methodology.
-This may be due in part to the fact that contacts in the office dataset only occur during office hours (roughly a quarter of all hours in the week).
+The difference may be due in part to the fact that contacts in the office dataset only occur during office hours (roughly a quarter of all hours in the week).
 ### Number of Connected Devices
@@ -300,7 +301,7 @@ Contact durations are calculated using the definition of an undirected contact g
 ---
 The Haggle dataset's static contact graph has a densely connected core and a sparsely connected periphery.
-Within the core are two clusters of nodes connected by thick edges, representing devices that spent a lot of time in contact with each other.
+Within the core there are two clusters of nodes connected by thick edges, representing devices that spent a lot of time in contact with each other.
 The periphery contains about a dozen devices that each made contact with one or two devices from the core.
 There are two isolated nodes, which represent the two devices that didn't make contact with any other participating devices during the experiment.
@@ -320,7 +321,7 @@ In the real world there are many people who are physically isolated from others,
 These people are less likely to be well-served by delay-tolerant networks than people who are more mobile.
 By including isolated nodes and small components in our analysis, rather than analysing only the main component of the graph, we aim to ensure that this aspect of reality is reflected in our results.
-### Static Reachability
+### Measuring Static Reachability
 One way to measure the connectivity of a dataset is to count how many devices in the dataset's static contact graph are reachable from a given source device by following the edges of the contact graph.
 Averaging over all possible source devices gives a metric that we call the *mean static reachability* of the dataset.
@@ -339,7 +340,7 @@ The mean static reachability of the Haggle dataset is below 100% because the dev
 Similarly, the mean static reachability of the village dataset is below 100% because the two devices that belong to the small component of the static contact graph can only reach (and be reached by) each other.
 The mean static reachability of a dataset gives us an upper bound on the delivery performance that we can expect any delay-tolerant network to achieve, given the mobility patterns recorded in the dataset.
-For example, given the mobility patterns recorded in the Haggle dataset, no delay-tolerant network could deliver messages between more than 92.7% of sender-recipient pairs.
+For example, given the mobility patterns recorded in the Haggle dataset, no delay-tolerant network could deliver messages between more than 92.7% of all possible sender-recipient pairs.
 Next we'll see how we can tighten this upper bound by including information about the order in which contacts happened.
@@ -349,7 +350,7 @@ The static contact graph captures information about which devices made contact w
 To capture this information as well, we can construct a *dynamic contact graph* for each dataset.
 To do this, we create a layered graph in which each layer represents the contacts between devices at a given instant.
-Every time there's a connectivity change (ie the start or end of a contact), we add a new layer to the graph representing the new state.
+Every time there's a connectivity change (ie the beginning or end of a contact), we add a new layer to the graph representing the new state.
 We start by creating the bottom layer, which simply contains a node representing each device.
 Then we process the detection events in order of time.
@@ -362,15 +363,15 @@ Then we remove from the new layer the edge representing the contact that has jus
 In both cases, we add a directed edge from each node in the layer below to the node representing the same device in the new layer.
-The result is a graph in which it is possible to follow edges horizontally within each layer, and vertically from each layer to the layer above (later in time), but *not* to the layer below (earlier in time), because the edges between layers are directed.
+The result is a graph in which it is possible to follow edges horizontally within each layer, and vertically from each layer to the layer above (later in time), but *not* to the layer below (earlier in time).
 Horizontal edges represent contacts between devices at a given instant, while vertical edges connect nodes that represent the same device at different instants.
-Thus, for example, if a contact occurs between devices `A` and `B`, and then after this contact has ended there is a second contact between devices `B` and `C`, the graph allows us to see that `C` is transitively reachable from `A`, but not vice versa.
+Thus, for example, if a contact occurs between devices `A` and `B`, and then after this contact has ended there's a second contact between devices `B` and `C`, the graph allows us to see that `C` is transitively reachable from `A`, but not vice versa.
 `A` could transfer a message to `B` during their contact, and `B` could then transfer this message to `C` during their later contact, but the reverse would not be possible, as the contact between `A` and `B` ends before the contact between `B` and `C` begins.
 This asymmetry is an important property of delay-tolerant networks, and the directed edges between layers allow our model to capture it.
-We can use the dynamic contact graph to study the emergent properties of the mobility patterns recorded in the datasets.
+We can use dynamic contact graphs to study the emergent properties of the mobility patterns recorded in the datasets.
-For example, if we start from a given device at a given time, which devices are transitively reachable, and how long does it take to reach them?
+For example, if we start from a given device at a given time, which devices are reachable, and how long does it take to reach them?
 How do these properties vary with the choice of starting device or starting time?
 A dynamic contact graph can be built from directed or undirected contacts, resulting in a graph that has directed or undirected edges within each layer.
@@ -390,7 +391,7 @@ In particular, an event indicating the end of a contact is always processed afte
 The following figures show how to construct a dynamic contact graph for a simple dataset with three devices: `A`, `B` and `C`.
 The dataset starts at time `t1`.
-There's a contact between `A` and `B` beginning at `t2` and ending at `t3`, and there's a contact between `B` and `C` beginning at `t4` and ending at `t5`.
+There's a contact between `A` and `B` beginning at `t2` and ending at `t3`, and later there's a contact between `B` and `C` beginning at `t4` and ending at `t5`.
 ---
@@ -419,10 +420,12 @@ There's a contact between `A` and `B` beginning at `t2` and ending at `t3`, and
 ---
 In the finished graph, we can see that if we start from device `A` at time `t1` it's possible to reach `B` at `t2` and `C` at `t4`.
-On the other hand, if we start from `C` at `t1`, we reach `B` at `t4` and we can't reach `A` at all.
+On the other hand, if we start from `C` at `t1`, we can reach `B` at `t4` and we can't reach `A` at all.
 Thus, unlike a static contact graph, the dynamic contact graph captures the fact that there's a delay-tolerant path from `A` to `C` but not vice versa, because the contact between `A` and `B` happens before the contact between `B` and `C`.
+If we collapsed all the layers of the dynamic contact graph into a single layer then we'd recover the static contact graph.
 ## Dynamic Reachability Analysis
 By building a dynamic contact graph for each dataset we can begin to study which devices are reachable from which others at which times, which tells us which devices would potentially be able to send messages to each other in a delay-tolerant network.
@@ -434,13 +437,13 @@ Thus the reachability analysis provides an upper bound for the delivery performa
 This upper bound is tighter than the one provided by measuring static reachability, and it allows us to study variations in reachability over time.
 This information is useful alongside more realistic simulations of specific network designs, as it tells us how close a simulated network is coming to the upper bound.
-If a simulated network does not manage to deliver all messages then the upper bound provided by the reachability analysis can tell us to what extent this is due to the inherent properties of the mobility data, and to what extent it's due to the specific network design being simulated.
+If a simulated network doesn't manage to deliver all messages then the upper bound provided by the reachability analysis can tell us to what extent this is due to the inherent properties of the mobility data, and to what extent it's due to the specific network design being simulated.
-The upper bound is also useful as a way of assessing whether any delay-tolerant network could achieve useful levels of performance, given the mobility data.
+The upper bound is also useful as a way of assessing whether *any* delay-tolerant network could achieve useful levels of performance, given the mobility data.
 If the upper bound for delivery performance is too low to be useful then we don't need to investigate the performance of specific designs.
 A negative result at this stage could save us a lot of effort designing and deploying real networks, if we can see that no network could possibly achieve the level of performance we need.
-### Dynamic Reachability
+### Measuring Dynamic Reachability
 Just as we measured static reachability in a static contact graph by counting how many devices are reachable from a given source device, in a dynamic contact graph we can measure *dynamic reachability* by counting how many devices are reachable from a given source device, *starting at a given time*.
 Averaging over all possible source devices gives a metric that we call the *mean dynamic reachability* at the given time.
@@ -452,8 +455,6 @@ The figures below show the mean dynamic reachability for each dataset, as a func
 The dashed line in each figure shows the mean static reachability, for comparison.
 Static reachability is always at least as high as dynamic reachability, as static reachability ignores the order in which contacts occur.
-One way to think of it is that static reachability is what we get if we flatten all the layers of the dynamic contact graph into a single layer.
 ---
 ![haggle-reachability.png](uploads/img/haggle-reachability.png)
@@ -492,9 +493,9 @@ The experiment's earlier end would have caused the boundary effect to occur earl
 We can test this hypothesis by cutting off the end of the dataset, effectively ending the experiment earlier, and comparing the results with the full dataset.
-If the hypothesis of a boundary effect is wrong and the observed decline in reachability is *not* due to a boundary effect, then cutting off the end of the dataset will have no effect on the mean dynamic reachability at times before the cutoff time.
+If the hypothesis of a boundary effect is wrong and the observed decline in reachability is *not* due to a boundary effect, then cutting off the end of the dataset will have no effect on the dynamic reachability at times before the cutoff time.
-On the other hand, if the hypothesis of a boundary effect is right then cutting off the end of the dataset will reduce the mean dynamic reachability at times before the cutoff time.
+On the other hand, if the hypothesis of a boundary effect is right then cutting off the end of the dataset will reduce the dynamic reachability at times before the cutoff time.
 By measuring how far back from the cutoff time the change in reachability extends, we'll be able to estimate how far back from the end of the full dataset the boundary effect extends.
 ---
@@ -508,7 +509,7 @@ By measuring how far back from the cutoff time the change in reachability extend
 The results support the hypothesis of a boundary effect.
 Cutting off the end of the dataset reduces the mean dynamic reachability at times before the cutoff time, compared with the full dataset.
-For some cutoff times, reachability drops to zero before the cutoff time, while for others it remains above zero, indicating that some devices are transitively connected to each other when the cutoff time is reached (ie, in the highest remaining layer of the dynamic contact graph).
+For some cutoff times, reachability drops to zero before the cutoff time, while for others it remains above zero, indicating that some devices are connected to each other when the cutoff time is reached (ie, in the highest remaining layer of the dynamic contact graph).
 By looking at the left edge of the figure above, we can see that the dynamic reachability at the beginning of the experiment is affected by changing the cutoff time.
 Even a cutoff time of 11 days slightly affects the dynamic reachability at the beginning of the experiment, indicating that some of the paths that started at the beginning of the experiment took more than 11 days to reach their destinations, and were thus removed when the cutoff time was set to 11 days.
@@ -525,8 +526,8 @@ How can we prevent the boundary effect from distorting our results?
 In an ideal world we'd like to go back and repeat the experiments over much longer durations, capturing several weeks' worth of data.
 We would measure the extent of the boundary effect by truncating the data, as we did above, and then run the reachability analysis for all start times up to the beginning of the boundary effect.
-We would need to ensure that each experiment lasted long enough to leave at least one week's worth of "clean" results unaffected by the boundary.
+We'd need to ensure that each experiment lasted long enough to leave at least one week's worth of "clean" results unaffected by the boundary.
-This would let us average out the daily and weekly patterns in the clean results, and compare one day to another, without any distortions created by the boundary effect.
+This would let us average out the daily and weekly patterns in the clean results, without any distortions created by the boundary effect.
 Of course we can't actually repeat the experiments, but we can try to approximate the results of a longer experiment by selecting a week's worth of data from each dataset and looping it - in other words, processing the detection events as though the experiment had lasted for multiple weeks and the same sequence of events had occurred in every week.
@@ -586,7 +587,7 @@ The thickness of the edge represents the total duration the devices spent in con
 The differences between the static contact graphs for the full and looped versions of the datasets show that our technique for simulating a longer dataset by looping a week's worth of data isn't perfect: some devices that make contact during the full dataset don't make contact during the week's worth of data that we selected for looping.
-The static contact graph for the looped version of the village dataset has three small compoments that are disconnected from the rest of the graph, whereas the full version of the dataset has only one small compoment.
+The static contact graph for the looped version of the village dataset has three small components that are disconnected from the rest of the graph, whereas the full version of the dataset has only one small component.
 There are also two devices in the office dataset and one in the village dataset that don't have any contacts during the week's worth of data chosen for the loop.
 These are shown as isolated nodes in the graphs above.