Assessing visual analytics (VA) methodology for mobility data

Page 1

Master thesis part 2 [3092-1819] Master of Transportation Sciences Hasselt University

Assessing visual analytics (VA) methodology for mobility data

Israel Ketema Elefenh [1643007]

January 2019

Supervisor: Mentor:

Prof. dr. ir. Ansar-Ul-Haque YASAR ir. Wim ECTORS


i


PREFACE Movement is a crucial part of human life. As quoted by Moshe Feldenkrais “movement is life: without movement life is unthinkable.� Especially in the domain of transportation, the subject of movement is fundamental and plays a vital role. Moving from one place to another is an important part of human daily life. But then again understanding this important feature is not easy due to its complex and dynamic nature. However, different essential planning or implementation decisions at different levels entails the clear understanding of movement. So in order to uncover the matter, a deep and well-established knowledge of the movement pattern of people and object is essential. In line with this effort, currently, thanks to new technologies, the advancement of data storage devices and the enhancement of creating and collecting movement data, a huge amount of mobility data is being produced. On the other hand, the ability to make use of this vast amount of data for decision making fails to keep up with the fast-growing data. In the process of addressing this information overload adopting a new combined method of data visualization and algorithmic data analysis was required. And this is precisely where the thesis comes in. This thesis is concerned with the different methods of exploring and assessing movement data using the state-of-the-art of visual analytics; that is using the combined techniques of automated analysis with the

technique of interactive

visualizations in order to achieve an effective understanding, reasoning, and decision making on the basis of dynamic and complex movement datasets. This thesis aims to better understand and investigate different visual analytic methods and options for exploring of mobility data, assessing the quality of mobility data, identifying and matching patterns in mobility data, building

models that help

understand the context of the data and lastly to evaluate the application of visual analytics methods and Method evaluation using a case study. Finally, I hope that this thesis will make worthwhile reading for anyone interested in assessing different visual analytics methods for mobility data.

ii


SUMMARY Movement is a complex and dynamic phenomenon that mainly contains objects, space and time components (Andrienko et al., 2013). Likewise, Movement data also involves these three main components and describes the relationship between the object which is moving, the spatial components and the temporal components in the process of movement. This relationship contains meaningful information which is not only about the object but also about space and time components. So in understanding and fully uncovering meaningful information concealed in movement data, it is vital to explore and assess the data from a different perspective. These different perspective studies include different analysis method such as visual analytics. In this thesis different visual analytics methods of data exploration and assessment are discussed to reveal the movement data. The study begins by exploring the movement data in terms of trajectories. Analyzing the movement data in terms of trajectory was important as trajectories include all the main components of movement. Trajectories are the paths that are made by the moving objects through space where it changes position. In addition, time is a very important feature of trajectory since the path cannot be made suddenly rather, path needs a certain amount of time to be made (Giannotti & Pedreschi, 2008). So in this thesis, a single and multiple trajectory assessments and exploration of sample participant were made in order to understand, explore and assess the data. The single trajectory assessment was relatively easy to make mainly due to the size of the data. The analysis process included projecting position points on a base map using the correct coordinate system, which gives general information about the spatial components of the data. Using the projected point, a line was constructed which represent the trajectory. Still, at this stage, the temporal component is not considered. In order to understand the full story of the trajectory, it is important to consider the spatial and temporal components together. So by using an animated map, it was possible to address space-time related questions like the direction of movement, stops and duration of stops made by the sample participant. In addition, by using the animated map other components found in the dataset like speed and identified stop activities was presented. Afterward, in order to understand the purpose of the trip, the first step

iii


was to identify the stops. The stops were already identified using the stop detection algorithm but in order to show other ways of identifying stops an interactive spatial point clustering was used. After identification of the stops, the next assignment was to identify the stops that are important. In the process of identifying significant places spatial clustering analysis of stops and filtering stops using the duration of activity was considered. The spatial clustering analysis fails to consider the different types of activity stops and also ignore the magnitude of time: duration of activity. Because of these shortcomings identifying significant stops using the duration of activity was preferred. By using a 3D visualization technique it was possible to visualize the activities duration. From this analysis, it was possible to identify two significant stops which are home and social activity location. The activities are identified from the prompt-recall method of data collection. Using the duration of activity to identify significant stops also has some shortcomings. The general assumption of longer activities are important and shorter activities are less important is not always true. Furthermore, the analysis fails to consider the repeated short duration stops that were possible by doing the spatial clustering. In general in identifying significant places it’s important to take results of both methods and this is the part where the human interpretation comes in. As discussed challenges regarding exploring data are addressed by strengthening the synergy between the works of human and computer. Looking at the single trajectory of a single sample participant was helpful in understanding and exploring the dataset. Yet, a single trajectory does not provide the full story of the study and the information provided is limited to give a conclusion about the movement pattern of the sample participant. So in order to understand the full story of the sample participant, it’s important to include multiple trajectories of the sample participant. The exploration and assessment of multiple trajectories of the single object almost followed the same analysis process but some analysis used in the single trajectory were not applicable mainly due to the size of the data. For instance, the process of identification of stops and the identification of significant place was exactly the same with the process of a single trajectory. The analysis process begins with a preprocessing and some data exploration. The data exploration involved projecting each day position points on the map and construct trajectories. This process was essential in assessing the quality of the data. In this dataset, there

iv


were two main data quality issues. During the exploring process it was possible to notice some data gaps and besides the data gaps, the dataset includes stops which are not assigned to a particular type of activity during the prompt-recall method. In addressing the data quality issue the day trajectories with data gap were ignored and the missing data of activity types were identified using a land use data. The process includes the use of a digital land use inventory and the unidentified stop points to derive and assume the type of activity. Furthermore, this kind of missing data errors can also be corrected or enhanced by Looking at the historic or post identified activity of a certain area one can also assume the unidentified points to have the same activity as the neighbor assigned activity. From the identification of significant stops and activities, it was possible to settle that the activity types home, work, social, service and unidentified activities were the places that significant for the sample participant. After the identification of significant stops, the following step was to identify the significant route. In identifying the significant route two methods were used. The first method was the kernel density method. Using this method to identify and analyze significant routes was inconvenient since in this case, the density of position points don’t always tell the frequency of using that route. Sometimes as it is found on the analysis, as the object moves with a relatively slow speed it tends to have more position point records which corrupt the result of the kernel point density analysis. However, in relatively large datasets it helps analysts to get the general image of the significant routes and triggers them to question identified routes further. The second method included the use of an ellipseshaped activity space model. This unique approach makes use of ellipses to contribute to the analysis of a spatial component by identifying activity spaces of sample participants. By using the activity space model it was possible to identify the highly used route. It was noticed that the important route is; the streets “Diestersteenweg” and “Kuringersteenweg”. Despite the heat map or kernel point density analysis, the activity space model was significant in identifying the highly used route. The result obtained from this analysis allows for further analysis of the case study; the iSCAPE project. As the project strives to make a behavioral intervention by means of changing the activity pattern of the urban inhabitants by proposing sustainable activity pattern that suits their traveling behavior. By overlaying the raster data of

v


air pollution data retrieved from Flemish environmental agency on the activity space and the highly used route. From the analysis made, it was seen that the identified significant route of the sample participant was found to be under a highly polluted area. Besides displaying by overlaying the pollution data, some computational operations can be performed in order to really understand the relationship of different variables like exposure time and pollution concentration, mode of transport and pollution concentration, Infrastructure time and pollution concentration etc. As a recommendation, the project can highly benefit by coupling some visual analytics methods and tools to ease understanding and analyzing of the dataset.

vi


TABLE OF CONTENTS ABSTRACT ................................................................................. xi 1

INTRODUCTION ..................................................................... 1

2

PROBLEM STATEMENT ............................................................ 2

3

RESEARCH QUESTION ............................................................ 2

4

OBJECTIVES.......................................................................... 3

5

METHODOLOGY ..................................................................... 3

6

LITERATURE REVIEW .............................................................. 4 6.1

Introduction to Visual analytics ............................................ 4

6.2

Visual analytics data structures ........................................... 6

6.3

Classes of Spatiotemporal data ............................................ 8

6.4

Visual analytic methods of assessing data quality ................... 8

6.5

Visual analytics methods of data exploration ....................... 11

6.6

Pattern finding and matching ............................................ 12

6.6.1 Data mining models ................................................... 13 6.6.2 The predictive data mining model .................................. 13 6.6.3 The descriptive data mining model ............................... 14 6.7 Data transformations techniques for managing large data volumes ................................................................................ 14 6.7.1 Aggregation and clustering ......................................... 14 6.8

Multi-perspective analysis of movement ............................. 15

6.8.1 Trajectories............................................................... 15 6.8.2 Visualizing trajectories ................................................ 16 6.8.3 A single trajectory of a single object ............................. 19 6.8.4 Multiple trajectories of a single object ........................... 21 6.8.5 Identifying significant places from trajectories ................ 23

vii


6.8.6 Identifying popular routes from trajectories ........................ 24 6.9 6.10

7

Visual analytics toolboxes and software stacks ..................... 27 Visual analytics models................................................... 29

6.10.1

Human cognition Model ............................................. 30

6.10.2

The van Wijk Model .................................................. 31

EXAMPLE DATASET .............................................................. 32 7.1

Basic information about the dataset................................... 32

8 VISUAL ANALYTICS METHODS OF EXPLORING AND ASSESSING THE DATASET ........................................................................... 34 8.1

A Single trajectory........................................................... 34

8.1.1 Extraction of significant places .................................... 40 8.2

Multiple trajectories of a single object ................................ 45

8.2.1 Data preprocessing and exploration ............................. 46 8.2.2 Data quality assessment ............................................. 47 8.2.3 Extraction of stops ..................................................... 48 8.2.4 Identification of significant stops.................................. 50 8.2.5 Methods of identifying activities of unidentified stops ..... 55 8.2.6 Identification of significant route.................................. 57 9 CURRENT APPLICATIONS OF VISUAL ANALYTICS APPLIED TO CASE STUDIES.......................................................................... 61 10 DISCUSSION ...................................................................... 64 11 CONCLUSION ..................................................................... 67 12 LIMITATIONS AND POTENTIAL APPLICATIONS ........................ 68 13 ACKNOWLEDGMENT ............................................................ 69 14 REFERENCES ...................................................................... 70

viii


TABLE OF FIGURES Figure 1: Building blocks of visual analytics (Järvinen, Puolamäki, Siltanen, & Ylikerälä, 2009) ............................................................................................. 4 Figure 2: Visual analysis integrates human and machine (Keim et al., 2008) ......... 5 Figure 3: Visual analytics processes (Keim et al., 2008) ...................................... 5 Figure 4: The data mining process (Järvinen et al., 2009) ................................. 12 Figure 5: Example of an interactive map of trajectories (N. Andrienko & Andrienko, 2013 .......................................................................................................... 17 Figure 6: Example of an interactive space-time cube with trajectories (N. Andrienko & Andrienko, 2013) ...................................................................................... 18 Figure 7: Example of a single trajectory (G. Andrienko et al., 2013) ................... 19 Figure 8: Example of multiple trajectories (G. Andrienko, Andrienko, Bak, Keim, & Wrobel, 2013) ............................................................................................. 21 Figure 9: Example of a flow map representing links between places in a geographical context and time (G. Andrienko et al., 2013) ................................ 22 Figure 10: Example of clustering stops to identify significant places (G. Andrienko et al., 2013) .................................................................................................... 24 Figure 11: Example of Identifying most frequently used routes by clustering (N. Andrienko & Andrienko, 2013) ....................................................................... 26 Figure 12: The HCM process (Green, Ribarsky, & Fisher, 2009).......................... 30 Figure 13: The Wijk visualization model (Keim et al., 2008) .............................. 31 Figure 14: Home distribution ......................................................................... 34 Figure 15: GPS-measured positions of sample participants ................................ 35 Figure 16: Projected position points ............................................................... 35 Figure 17: Constructed single trajectory ......................................................... 36 Figure 18: Series of animated map of a single trajectory................................... 39 Figure 19: Speed- time graph of a single trajectory .......................................... 40 Figure 20: Cluster of position records ............................................................. 41 Figure 21: Identified stops based on the duration of activity .............................. 42 Figure 22: Identified stops overlaid on the databases of Point of

ix


interest (OpenStreetMap, 2018) .................................................................... 44 Figure 23: A single-day total duration of activities ............................................ 45 Figure 24: Projection of position records (left) & constructed trajectory from point records (right) ............................................................................................. 46 Figure 25: Identified data gap on 28/06/2017 ................................................ 47 Figure 26: Sample participant staying the whole day at home location................ 48 Figure 27: Spatial point clustering .................................................................. 49 Figure 28: Identified stop points .................................................................... 51 Figure 29: Spatial clustered stop points .......................................................... 51 Figure 30: Multiple day total duration of activities ............................................ 52 Figure 31: Stops with a duration of 3 hours longer ........................................... 53 Figure 32: Identified stops based on the duration of activity .............................. 54 Figure 33: Land use inventories to describe the activity type (Geopunt.be, 2018) 56 Figure 34: Example of unidentified activity points found near to other defined activity of stops ........................................................................................... 57 Figure 35: Kernel point density analysis .......................................................... 58 Figure 36: Activity space model (Li & Tong, 2016) ........................................... 60 Figure 37: Application of ellipse-shaped space activity model ............................ 61 Figure 38: Identified highly used route and NO2 pollution map (Vlaamse Milieumaatschappij, 2019) ............................................................................ 62 Figure 39: Identified highly used route and PM10 pollution map (Vlaamse Milieumaatschappij, 2019) ............................................................................ 63

TABLE OF TABLES Table 1: Visual analytics common classes of data structures ................................ 7 Table 2: visual analytics toolboxes and software stacks .................................... 28 Table 3: Identified activity types .................................................................... 43

x


ABSTRACT The accessibility and advancement of data collecting system and technology make data collection easy more than ever, however, the issues of information overload and the crucial ability to communicate this knowledge in an understandable manner have become major concerns. In addressing this challenge, visual analytics plays a major role in turning information overload into an opportunity by refining the most relevant and valuable information which

helps ease the complexity of

understanding complex and dynamic data, such as movement data. The process primarily takes in the human understanding of the context and the data. Visual analytics (VA) combines the techniques of automated analysis with the technique of interactive visualizations in order to achieve an effective understanding, reasoning, and decision making on the basis of dynamic and complex data sets. In this thesis paper, visual analytics methods with a special focus on movement data are discussed in detail. It mainly includes literature reviews on the visual analytics methods of exploring data, assessing the quality of data, finding and matching patterns, exploring available visual analytics software and tools and visual analytic models. The data analysis mainly combines the joint power of different interactive visual analysis method that is significant in supporting human analyst perception, cognition and reasoning and computational analysis methods that is important for handling complex and dynamic nature of movement data. Furthermore, the analysis explicitly addresses the variables of time, space and their interaction. Lastly, an example dataset is used in order to understand and demonstrate the effectiveness and the synergistic use of the different visual analytics methods.

Keywords: Movement data, trajectory, visual analytics, exploring data, assessing data quality, pattern finding, software tools, models,

visualization, reasoning, computational

analysis.

xi


1 INTRODUCTION At present, a huge amount of mobility data is produced and this rapidly growing amount of data needs to be dealt with in a way that can provide relevant, understandable and usable information. According to TechAmerica Foundation (2012) “Big data is a term that describes large volumes of high velocity, complex and variable data that require advanced techniques and technologies to enable the capture, storage, distribution, management, and analysis of the information.� New technologies, the advancement of data storage devices and the enhancement of creating and collecting data has a significant contribution to the way we manage information today (Keim et al., 2008). On the contrary, the ability to make use of this vast amount of data for decision making fails to keep up with the fast-growing data. The data collecting, storing and processing processes lacks filtering and fine-tuning according to the relevancy and to the question at hand. In order to give meaning to the collected and stored raw data, extracting relevant and appropriate information contained in it plays a major role. It also helps to avoid falling into the obvious trap of the information overload problem which is the current problem of c o m p l e x data processing. The lack of ability to deal with the data volume properly or information overload leads to a loss of time, money, resources, opportunities, etc. In the process of overcoming the problem of information overload, questioning and understanding the relevance of the current working topic, the appropriateness of the procedures and processes in use for decision making, and the way to present resulting information helps in the refining process. As a result, the data becomes usable and relevant. Visual analytics (VA) is one of the human-centric processes (N. Andrienko & Andrienko, 2013) of data analyzing techniques that help in dealing with the problem of information overload. The process primarily takes in the human understanding of the context and the data. Visual analytics (VA) combines the techniques of automated analysis with the technique of interactive visualizations in order to achieve an effective understanding, reasoning, and decision making on the basis of dynamic and complex

datasets (Keim et al., 2008). The primary

1


advantage of visual analytics (VA) is the Visualization techniques in use that make data accessible in a way people can understand. Furthermore, visual analytics support decision makers

in creating more efficient and measurable, decision-

making. This thesis explores the state of the art of visual analytics concerning the analysis of mobility data. The major focus will be on how the resulting information can be analyzed and presented visually so that understandable knowledge can be gained.

2 PROBLEM STATEMENT Digital traces that people leave behind, when interacting with digital systems such as communication networks or social media platforms make the getting hold of raw mobility data is no longer a worrying matter (Sagl, Loidl, & Beinat, 2012), however the issues of information overload and the crucial ability to communicate this knowledge in an understandable manner have become major concerns. In addressing this challenge, visual analytics plays a major role in turning information overload into an opportunity by refining the most relevant and valuable information which helps ease the complexity of understanding complex data (Keim et al., 2008). Nevertheless, there is a number of application and technical challenges that still need to be addressed. Challenges include quality

of data, visual representation,

pattern matching, etc.

3 RESEARCH QUESTION More specifically, the following research questions will be addressed in detail in the research: 

What are the visual analytics methods for exploring data?

What are the visual analytics methods for assessing data quality?

What are the visual analytics methods for finding and matching patterns?

What are the visual analytics methods for building models?

What are the available visual analytics software stacks and toolboxes?

How can current applications of visual analytics methods be applied in case studies?

2


4 OBJECTIVES The objective of the thesis research is to better understand and investigate different visual analytic methods for exploring, assessing quality, identifying and matching patterns in mobility data, and explore visual analytics models that help understand and assess the context of the data. In addition, the findings from investigating different visual analytics methods and techniques will contribute

to the

applications of visual analytics methods and method evaluation using state-of-theart scenarios.

5

METHODOLOGY

In order to satisfy the objective of the thesis, the primary research methods include a literature review and different visual analytics research methods. The study process

majorly

consists of

identifying the classification of visual analytics

methods and inventorizing available software stacks. It is profoundly oriented at investigating the current visual analytic literature with a special focus on the quality of data, exploration of data, finding patterns in data and exploring some models. Identifying available software and toolboxes will also be an important part of the study. In addition, movement data from an example dataset will be used to examine the effectiveness of the application of visual analytics methods and method evaluation using state-of-the-art scenarios. The variables of time, space and their interaction will be explicitly addressed. The thesis is conducted between March 2018 and January 2019. The data analysis mainly makes use of the joint power of different visual analysis and computational analysis.

3


6 LITERATURE REVIEW 6.1 Introduction to Visual analytics “Visual analytics is the science of analytical reasoning facilitated by interactive visual interfaces” (Cook & Thomas, 2005). Visual analytics tools and techniques are helpful in producing information and in developing insight from gigantic and complex data. Furthermore, visual analytics provides sensible, secure, and clear assessments that can facilitate effective communication for action.

Visual analytics is a

multidisciplinary subject (see Figure 1) which is built using different research areas including data mining, information visualization, human perception

and

reasoning, user interface techniques and different mathematical and statistical methods (Järvinen et al., 2009).

Figure 1: Building blocks of visual analytics (Järvinen, Puolamäki, Siltanen, & Ylikerälä, 2009)

Due to current complexity of challenges and the huge size of available data, coupling the power of human’s visual thinking with powerful computing machines plays important role in addressing challenges (N. Andrienko & Andrienko, 2013). Visual analytics join the machine or computational and human-centric process to benefit from the advantage of both sides in order to achieve the most effective results (N. Andrienko & Andrienko, 2013; Cook & Thomas, 2005). Some of the advantages of machine includes the ability to store and process huge data, relatively very little time needed for searching information and processing data, high processing power for high-quality results, etc. on the other hand, the benefits from human includes

4


flexibility, deals well with challenges difficult to formalize, ability to use their previous knowledge and experience, etc. (Gennady Andrienko). Moreover, visual analytics advance the division of task between human and machine (Keim et al., 2008).

Figure 2: Visual analysis integrates human and machine (Keim et al., 2008)

Visual analytics tools and techniques are mainly used to extract useful information from massive, dynamic and complex data. The main objective of visual analytics includes developing visually interactive tools and techniques that help in reasoning, perception, and decision making from huge datasets (Järvinen et al., 2009). The visual analytics process is well illustrated by Daniel Keim (Keim et al., 2008)

Figure 3: Visual analytics processes (Keim et al., 2008)

5


Visual analytics tools help; in demonstrating large quantities of information in limited space, finding a pattern from data, exploration of data, extraction of information, etc. (Järvinen et al., 2009). Visualization is a crucial and famous tool for analyzing movement phenomena and processes that are explained in geographical space (N. Andrienko & Andrienko, 2013). As a definition, the movement is associated with the idea of shifting the physical position of an object with regard to some reference system; the reference system is mostly regarded as the geographical space (Giannotti & Pedreschi, 2008). Likewise, generally, movement data generally comprises of positions of moving entities in a given space furthermore, they represent paths of moving entities through space over time. (G. Andrienko et al., 2013). In the following sections, different characteristics of movement and movement data in relation to visual analytics will be discussed in detail.

6.2 Visual analytics data structures Visual analytics includes a number of common classes of data structures including object referenced data, time referenced data, space referenced data, spatial time series and object time series.  Object-referenced data These types of data mainly comprise attributes observed from the objects. Attributes such as; 

Events: these classes of data mainly contains attributes of time of presence or appearance of the object. The data may take in an exact moment or interval of the happening of the event.

Spatial objects: these classes of data contains attributes of the spatial location of the object and may be in terms of point, area, and volume.

Spatial events: as the name refers to these classes of data comprises attributes of the presence or appearance of the object together with the spatial location of the object.

6


 Time referenced data These types of data mainly comprise attributes observed at different times and it’s also known as the time series.  Space referenced data These types of data mainly comprise attributes observed in a different location and it’s also known as the spatial data.  Spatial time series These types of data mainly comprise attributes observed in different location and time.  Object time series These types of data mainly comprise attributes of objects observed at different times. An excellent example of this class of data structure is trajectories of moving objects. The visual analytics common classes of data structures can be summarized in Table 1. Table 1: Visual analytics common classes of data structures

Attributes Data types

Time

Space

object

Object-referenced data

X

X

X

Time referenced data

X

Space referenced data

X

Spatial time series

X

Object time series

X

X X

7


6.3 Classes of Spatiotemporal data There are different types of data used in visual analytics, the most prominent data type used is spatiotemporal data. Spatiotemporal data as the name implies a time referenced position data that contain references to space and time. These types of data include classes such as: 

Spatial events: e.g. public events, mobile phone calls, etc.



Spatial time series e.g. demographic data from different years, weather etc.



Trajectories, etc. e.g. trajectories of cyclists, vehicles, people, etc.

These data majorly represent references to the real world that come from different geographic measurements, remote sensing, GPS positioning data, etc. (Keim et al., 2008). The

most commonly used data is the GPS positioning data. Important

analysis tasks in spatial data include finding spatial relationships and patterns. In analyzing geospatial data visualization is one of the most used and important technique. In temporal data, the elements in the data are found in terms of time. As mentioned in Keim et al., (2008) the Analysis tasks in temporal data includes: Identification of

patterns in terms of linear or periodical, development and

relationships of patterns over time, etc. like the spatial data visualization plays a major role in the analysis of temporal data. Interactive and dynamic visual representations contribute a lot to understanding spatiotemporal data and related phenomena (N. Andrienko & Andrienko, 2007).

6.4 Visual analytic methods of assessing data quality Nowadays,

different

organizations collect

and

store

large databases

using

advanced technologies to fulfill their analytical processes. However, the production of valuable outcomes is decidedly dependent on the reliability or quality of the data. In order to acquire the desired quality of data, there are different data improving

and

keeping

methods,

procedures,

techniques,

technological

approaches and processes. In the process of identifying which practice is more effective, it is important to identify and assess the present quality of the data.

8


The results from the data quality assessment are key for improving the quality of data and are also a significant component in supporting analytical processes (Josko & Ferreira). Assessing data quality involves prioritizing data regions according to different data quality dimensions such as; the completeness, certainty, consistency accuracy, etc. of data (Lee, Pipino, Funk, & Wang, 2009). In the process of assessing the data quality, understanding the data context and defining data quality in context plays a vital role (Lee et al., 2009). This is where the human role becomes handy; in the process of understanding the context of data and defining the data quality context, human supervision is essential. Visual analytics belongs to t h e supervised approach that associates the interaction of the computational capacities and the human-centric processes (N. Andrienko & Andrienko, 2013; Josko & Ferreira). The success of visual analytics process at large depends upon data quality dimensions such as the completeness and reliability of the raw data, loss of information during filtering, sampling, and other transformations, and the accuracy

and

clarity

of

the

visual

presentation

(Ward,

Xie,

Yang,

&

Rundensteiner, 2011). The visual analytics quality assessment methods help analysts in the process of data selecting, transforming, mapping and refining the quality of data. In addition, it is essential that the tools supporting data analysis to be of high quality in order to be effective. Especially visualization of information needs to be communicated by taking different quality principles or requirements in mind.

The

requirements

comprise

accuracy,

completeness,

intuitiveness, and responsiveness. It is important to understand that

sufficiency, achieving all

the quality principles is increasingly challenging. Furthermore, important quality features like certainty, correctness, confidence, and overload

need to be well

thought in the designing of visual analytics technologies. These challenges for designing visual analytics tools can be addressed in the context of accuracy, completeness, certainty, consistency, or any combination of these (Ward et al., 2011). All these concepts of data quality can be structured and composed in four phases; data collection, data transformation, graphical mapping and display (Ward et al., 2011). In the first phase, meaning during the data collection process, a different range of

9


errors can be detected. Errors may comprise missing of data, measuring error caused by the quality of the device, human and device error in data entry (Pang, 2001), the inconsistency of the dataset (Ward et al., 2011), etc. During the second phase of the transformation process, information can be lost due to, smoothing, filtering, sampling, dimensionality reduction and clustering (Josko & Ferreira). The third phase of graphical mapping includes the transformation of numerical or nominal values into graphical display entities. In this process errors like a limitation of human perception, overutilization of graphics attributes, misinterpretation of visualization, etc. may be found. The last phase which is displaying comprises a graphical rendering of the graphical mapping result. Errors in this phase involve visual cluttering like overlapping of visual entities or a sheer number of entities and that may greatly affect the ability to perceive patterns in the visualization. As discussed above, there are a large number of quality related problems and addressing each of them may be very difficult. In order to structure the process of tackling quality issues, it is important first to identify and prioritize main quality issues and later develop a general methodology. The general methodology enables the exploration process to be more effective and transparent by revealing and incorporating the quality features as part of the display. This process of exploration results in making an informed decision by supporting the decision maker to have good control over the transformation and type of data they work with. Furthermore, these quality measures support analysts in the process of selecting, transforming, and mapping (Ward et al., 2011). According to Ward et al. (2011), they proposed a general methodology that enables the effective exploration of each type of quality. The general methodology consists of five steps that help design “quality-aware visualization�. The first step states about designing and implementing metrics that can help evaluate the quality of visualization. It pose questions like, which metrics can be used? which metrics are best to convey information? Is the metrics validated? Etc. The quality of visualization can be measured by using different techniques. This brings

us

to

the

second

step,

which

is

developing customized

display

techniques. This technique helps convey the quality to the analyst. The third step

10


comprises of assessing measures against the analyst’s judgments of perceived quality. The fourth step includes allowing analysts to interactively modify aspects of the stage to enhance quality, trading off between possibly conflicting quality measures. And finally developing automated methods to modify the stage to enhance the quality, whenever possible.

6.5 Visual analytics methods of data exploration Visualization of data plays an important role in data exploration. Yet, when huge gathered movement data needs to be explored and analyzed, visualization alone becomes unsatisfactory and insufficient. This is because of the technical limitations and the limitation on the natural perception and reasoning of human beings (N. Andrienko & Andrienko, 2007). Thus, in dealing with huge movement

data it is

important to join visualization techniques with computational techniques, which is the main concept of visual analytics. Exploration and analysis of movement data

with

large

movement

datasets

need

to

contain different

database

technologies, computational data processing and computational data analyzing methods. Applicable combination of these different

technologies and methods

coupled with visualization help increase the synergy between computing machine and human (N. Andrienko & Andrienko, 2007). In the process of exploring huge movement data, the ultimate aim is to build a visual a n a l y t i c s p l a t f o r m where

challenges

regarding

exploring

data

are

addressed

by

strengthening the synergy between the works of human and computer. Visualization of information makes use of the predominant image processing competency of the human brain and plays a major role in enhancing human perception and reasoning resources. Likewise, they contribute to easing the quest for information and enrich the recognition of patterns (Järvinen et al., 2009). Currently, it’s a big challenge to extract and explore desired and valuable information from huge movement datasets. Data exploration is mainly concerned with mining useful and valuable information from large datasets. In most literature, the process of data exploration is referred to as data mining. In the process of data exploration visualization of data is the most well-known

mechanism

that

helps extract

11


insights from datasets (Siddiqui, Kim, Lee, Karahalios, & Parameswaran, 2016). Visual analytics help reveal the desired

hidden patterns from large datasets

(Järvinen et al., 2009; Siddiqui et al., 2016). Data exploration or mining remains a tedious process of trial-and-error (Siddiqui et al., 2016) where the analyst or user has to go back and forth of each process for possible modification. As shown in Figure 4, the process of data mining starts with selecting target data from the raw data, preprocessing the data and then transforming the data into an appropriate form. Afterward, the process of patterns creation follows.

Figure 4: The data mining process (Järvinen et al., 2009)

6.6 Pattern finding and matching Patterns are created by running the targeted, preprocessed and transformed data in the data mining algorithm. The result from the data mining algorithm will be evaluated and interpreted by the analyst or user and reexamine the whole iteration process for possible adjustments on the raw data, algorithm and algorithm parameters (Järvinen et al., 2009). The process of finding patterns, while mining a dataset is called analytical modeling. This activity of discovering

12


patterns majorly comprises identifying possible and meaningful relationships within variables. According to N.

Andrienko & Andrienko, (2007) the main possible

meaningful relationships may include similarities like; 

The similarity in characteristics, this might include trajectories of geometric shape, travel distance, and duration, etc.

Co-location in space, this is the coexistence of objects e.g. trajectories on the same location or share some part of the location.

Synchronization in time, this is a parallel change of movement characteristics of objects at the same time or after some delay.

Co-incident in space and time, this is the event where objects reach the same location at the same or after a delay.

6.6.1 Data mining models After identifying the possible and meaningful relationships, these relationships are used to produce data mining models. These models are communicated in terms of formulas or algorithms. The generated formula or algorithm is able to calculate the predicted values or probabilities of each record according to the data values of the records (Leventhal, 2010). There are mainly two types of data mining models in use; prediction and description models (Järvinen et al., 2009; Leventhal, 2010). Each of them will be explained below.

6.6.2 The predictive data mining model This model makes use of variables to predict specifically targeted objects of interest (Järvinen et al., 2009; Leventhal, 2010). The most known and significant predictive modeling techniques are regression and classification (Järvinen et al., 2009). Using the regression techniques values of a continuous variable can be predicted based on other variables in the data. It makes use of linear or nonlinear models. Several techniques are in use comprising multiple regression that is used in predicting value data and logistic regression that helps in predicting response

13


(Leventhal, 2010). On the other hand, classification is a commonly used method. In this method “a model for a class attribute as a function of the values of other attributes (Training set) is created. Then unseen records are assigned to the class. The accuracy of the models is evaluated with a test set” (Järvinen et al., 2009). Different techniques are in use containing Decision tree-based methods, rulebased value (Järvinen et al., 2009; Leventhal, 2010), neural networks, memorybased reasoning, naive Bayes and

Bayesian belief networks, support vector

machines. (Järvinen et al., 2009).

6.6.3 The descriptive data mining model This model is used for finding patterns that help understand the data in general without any particular target variable (Järvinen et al., 2009; Leventhal, 2010). The descriptive modeling includes techniques like clustering analysis that

includes

“grouping a customer database into a segment”, association rule discovery that comprises of identifying a possible relationship between items (Järvinen et al., 2009; Leventhal, 2010), sequential pattern discovery and deviation detection (Järvinen et al., 2009).

6.7 Data transformations techniques for managing large data volumes 6.7.1 Aggregation and clustering Besides detecting patterns using possible and meaningful relationships in movement data, there are also data transformation techniques that help pattern detection of movement data. Data transformation techniques play a major role in changing the complex data into understandable and efficient for users to extract valuable information easily. The process majorly includes cleaning and manipulation of data. The most commonly used data manipulation

methods are aggregation and

clustering. Aggregation refers to generalization and emphasizing features that are relevant to examine. The degree of aggregation mainly depends on the objective of the data analysis. In addition, the aggregation process is not only about simply finding a compromise between simplicity and information lost. Rather, it is important

14


to keep in mind the size of the outcome dataset, amount of information ignored and the scale of the considered data in order to retrieve meaningful results (N. Andrienko & Andrienko, 2007). Tools for visual analysis empowers the analyst different class of aggregation. Aggregation majorly comprises of grouping single data items; meaning allocating data into subsets and retrieving characteristics of the subset from individual

characteristics of their groups. And Characteristics of the subsets are

defined by using different Statistical summaries (N. Andrienko & Andrienko, 2007). The second most commonly used data manipulation method is clustering. Clustering also involves grouping and dividing data to ease pattern detection. Referential components like a set of objects and time are the main components of movement data (N. Andrienko & Andrienko, 2007). In the process of clustering movement data, a division of one or both component of movement data is possible depending on the data and analysis objectives. Time can be divided or grouped on the basis of time intervals like time of the day, days, months etc. and the set of entities can be divided or grouped in terms of the characteristics of the object like demographic data, characteristics of objects movement, etc. the division or grouping can be performed either by using computational methods that are most of the time challenging to interpret or using interactive methods like visualization which is easy for interpreting.

6.8 Multi-perspective analysis of movement 6.8.1 Trajectories The path that is made by the moving object through space where it changes position is referred to as trajectory. Time is a very important feature of trajectory since the path cannot be made suddenly rather path needs a certain amount of time to be made (Giannotti & Pedreschi, 2008). At this point, it is easy to notice that trajectory comprises three important components namely space, time and path. The aggregation of this important features; space-time path is being used as a synonym for trajectories. There is a position which is occupied by the entity for each time moment that is when the path started and ended. In addition, t h e trajectory

15


comprising pairs that involve time and location. These pairs are infinity in number due to the fact that time is continuous but, in practice, trajectories are represented as discrete data of “time-referenced locations”. As mentioned in Giannotti & Pedreschi (2008), In the process of understanding movements and collecting movement data the sequences mentioned above are used in different manners such as: 

Time - based recording: where the position of the moving objects is recorded at a fixed time interval.

Change - based recording: a record is made once the position of a moving object differs from the preceding one.

Location-based recording: records are taken in the case a moving object is close to explicit locations like places where sensors are mounted.

Event- base recording: in this case records of position and time are recorded during certain activities are executed by the moving object like using the mobile phone to call, etc.

Besides these basic approaches, the combination of these several approaches can also be used in the process of understanding movement and collecting movement data (Giannotti & Pedreschi, 2008).

6.8.2 Visualizing trajectories Interactive maps and space-time cubes are the most commonly used display types for visualization of movements of discrete objects. The map display trajectories of movements of discrete objects as static or animated maps and the interactive space-time cubes represent trajectories with linear symbols (N.

Andrienko &

Andrienko, 2013) and as seen in Error! Reference source not found. at the bottom f the space-time cube, the user can interact using the movable plane that corresponds to the time. Space-time cube is a three- dimensional representation of information about the trajectory that comprises two horizontal dimensions representing space components and a vertical dimension that stands for the time component. The vertical axis which is t h e temporal axis is the illustration of the time intervals. Depending on the nature of the movement and objective of the analysis trajectories can be divided in many different ways (G. Andrienko et al., 2013). The following two

16


example Figure 5 and Error! Reference source not found. shows the movements of ships in the North Sea. The data is collected by The Netherlands Coastguard and the Maritime Research Institute of Netherlands used the data to do a safety assessment studies with respect to shipping.

Figure 5: Example of an interactive map of trajectories (N. Andrienko & Andrienko, 2013

17


The trajectories of ships are represented by lines, hollow squares representing the start of a trajectory and filled squares represent the end of the trajectories. The lines are colored according to the ship types furthermore the number in bracket shows the counts of trajectories of each type of ships.

Figure 6: Example of an interactive space-time cube with trajectories (N. Andrienko & Andrienko, 2013)

The interactive space-time cube shows all the ship trajectories. The colors represent the different types of ships. In addition at the bottom of the space-time cube, the movable plane corresponds to the time.

18


6.8.3 A single trajectory of a single object In order to understand the concepts of trajectories and components of trajectories an example of a tracked moving object will be demonstrated. Suppose that a cyclist was tracked using a GPS tracking system. The movement of the cyclist was tracked for a given observation period of only one day. Therefore, the GPS device recorded a number of positions from the cyclist during this one day. From the recorded time-referenced position data, extraction of meaningful information is quite limited since the data provides a number of coordinate points. However, the geographical positions can be projected on a base map so that it can offer a spatial context and this can help interpret the data spatially. Afterward, a line is formed by joining consecutive recorded positions. The line overlaid on the map is the continuous representation of the discrete data and is used to represent the one-day trajectory of the cyclist.

Joining the consecutive positions does not

introduce significant error since the time gap between records is very short.

Figure 7: Example of a single trajectory (G. Andrienko et al., 2013)

19


Furthermore, particular for the example, the generated cyclist line trajectory, bearing in mind GPS accuracy issues may match up to the bike network on the base map. Geographical location of the trip and the network used to perform the trip becomes easy to identify. Identification of the trip can be assisted in this manner. However, the trip that is identified on the map includes the spatial information but then again it misses a n important component that is the temporal component of the data. Determining the time component is quite important to identify the direction of the trip and to detect whether the cyclist stopped or not. Therefore, in the representation

of the two components;

space

and

time, space-time

cube

representations are used (G. Andrienko et al., 2013). With the help of the space-time cube representation additional information like the direction of the

trajectory;

clockwise or anticlockwise and stops made by the cyclist can be easily

retrieved.

Furthermore, the duration of the stops can be understood by the length of the straight vertical line segment, the longer the segment the longer the cyclist stopped. The straight vertical line segment can be generally explained that the spatial position of the cyclist stayed unchanged during the time interval.

Three-dimensional

representation of the space-time cube can be communicative in terms of visualizing the trajectory however, they are not convenient in displaying the space and time component composed at the same time. Therefore, it is recommended to show the spatial components and temporal components apart to ease understanding of the trajectory temporal and spatial information. Hence, maps can be used to express the spatial information and other display methods like time graphs can be used to express temporal and time-dependent information. Using the time graph other temporal information like the variation in instant speed of the cyclist through time can also be retrieved. Since both representations; the map and the time graph alone cannot be used to explore the spatial and temporal data together, it is very important to create a relation between this two different representations (G. Andrienko et al., 2013). From the generated single one-day trajectory it’s difficult to conclude the purpose of the trip. This is due to the reason that single trajectories include limited data. Thus, increasing the length of the observation period can allow to making a better conclusion on the purpose of the trip.

20


6.8.4 Multiple trajectories of a single object

Figure 8: Example of multiple trajectories (G. Andrienko, Andrienko, Bak, Keim, & Wrobel, 2013)

Again we will take the same example of a cyclist that has been tracked using a GPS tracking system. Then again this time consider that the position data was recorded not only for one day but for several observation days. This dataset will contain a single very long sequence of position records that is not possible to discover the data in detail as it can be done for one-day trajectory. Representing this data in visual displays like on a map, space-time cube or time graph doesn’t give valuable and clear information. For instance, if it’s displayed on a map or on a space-time cube the overlapping of lines and the visual clutter makes it very difficult to retrieve any valuable information. So in exploring such large datasets interactive visualizations need to be coupled with machines with high computational processing powers. Computational aggregation is one of the commonly used approaches in dealing with such datasets. To come back to the example of the cyclist movement data, the movement data can be spatially aggregated into the flow to represent the intensity

of

the

movement.

By overlaying the aggregated intensity of the

movement data onto a base map a flow map can be generated. The intensity is calculated using the number of times the cyclist bike through the respective position.

21


A flow map Figure 9 (a) that is representing the links between places in a spatial context is combined with time graphs showing time series of attribute values associated with the links: flow magnitudes Figure 9 (b) and average movement speeds Figure 9 (c)

Figure 9: Example of a flow map representing links between places in a geographical context and time (G. Andrienko et al., 2013)

22


The visual representation can be done using different line thickness and symbols to identify directions. From the flow map, it will be easy to identify where the cyclist moved often or occasionally. Furthermore, from the flow map, the direction where the cyclist moved can also be identified.

6.8.5 Identifying significant places from trajectories In identifying the purpose of the trip the first step is identifying significant places for the cyclist. This significant places can be identified the number of stops made and by the duration of stops. Therefore, the spatial positions of this relatively long stops need to be identified in order to detect the significant places. A stop can be expressed in different ways such as the time gap between successive position records, as a sequence of records with relatively very low-speed value or as a formation of a dense spatial cluster from a successive spatial position. Each identified stop are a spatial event; that is any discrete physical or abstract entity that contains definite position in space and time. (G. Andrienko et al., 2013). From this computational aggregation, it’s possible to extract additional spatial events other than long stop event. The long stop event is represented as a point on the map and in the case, a number of stops happening at the same spot the points tend to overlap. This situation makes it difficult to identify a number of points (stops) on a spot. In order to differentiate the repeated stops from the occasional stops applying a clustering tool is necessary. The clustering tool cluster long stops according to the spatial distance between them. As a result of identifying spatially dense clusters of stops that are represented by points, the significant places can be identified. In addition to that looking the temporal distribution of the stops contribute to finding out which place is what.

Identifying the significant places assist in making a confident

interpretation over the purpose of the trip.

23


Figure 10: Example of clustering stops to identify significant places (G. Andrienko et al., 2013)

6.8.6 Identifying popular routes from trajectories In the previous section, interpreting the purpose of the trip by discovering the significant place was discussed in detail. In this section, the identification process of the routes of the movement will be discussed. Trajectories that are represented with different trips are essential in the process of identifying routes. However, most datasets and also particular to the example of the cyclist recorded dataset, trajectories are with single movement track. In order to address this tracks can be divided by using the identified long stops. Defining the minimum duration of a long stop will help in making the division of tracks uniform. As the defined duration of a long stop is higher than some short during stops, then this short during stops won’t be considered to make the division of trajectories into pieces of trajectories. By splitting trajectories using the defined long stops, additional pieces of trajectories will be obtained. After that clustering analysis can be made to identify the routes that were used frequently by the cyclist. Using the clustering analysis trajectories with matching routes will be clustered together. Clusters are hierarchized with the number of routes with similar routes. After clustering entities that are not

24


allocated to any cluster is called noise (G. Andrienko et al., 2013). Thus, In this case, the noise is the pieces of trajectories that are not similar enough with other pieces of trajectories. Therefore, the clusters of trajectories by route similarity can be visualized on a base map using trajectory lines. Looking at this identified clusters of trajectories together with identified significant places routes can be interpreted in terms of the purpose of the trip. In order to visualize the temporal features of the clusters space-time cube can be used. To retrieve effective information visualizing routes in a short time

interval such as per one day is important. From this

visualization, it’s possible to identify t h e

exact time of trips and duration of

trajectories. In the process of obtaining these knowledge computational processing is coupled with interactive visual interfaces. The obtained knowledge can be related with spatial and temporal contexts that involve previous knowledge and reasoning (G. Andrienko et al., 2013).

25


Figure 11: Example of Identifying most frequently used routes by clustering (N. Andrienko & Andrienko, 2013)

In the above figure, the trajectories of the ships have been clustered according to the destinations.

26


6.9 Visual analytics toolboxes and software stacks Visualization of information is commonly realized by making use of different specialized tools and software that can be found either as an open source or copyrighted. According to Järvinen, Puolamäki, Siltanen, & Ylikerälä (2009) most of the available visual analytic tools and software packages are grouped into eight categories: office tools, business intelligence tools, statistical and mathematical tools, visualization related libraries and software packages, algorithmic tools, visual data mining tools, web tools and packages and scientific visualization tools. Each category will be discussed below accordingly: The first category is Office tools. These visualization tools are the most used and wellknown. It primarily includes Excel that can produce different representations like bar charts, pie charts, etc. In the second visualization tools category, we find business intelligence tools that are mainly concerned in the visualization of current and future business status of companies and usually these tools work hand in hand with the company’s management system. The third category comprises of the statistical and mathematical tools. As the name implies it embraces statistical computing and analysis. Examples include

R and

Matlab that are able to produce time series, plots, bar charts, histogram, etc. This category of visual analytic tools, namely R will mainly be used in the second part of the thesis. The fourth visual analytic category is the visualization of related libraries and software packages. This category largely contains toolkits that produce rich interactive data visualizations, visualization programs that are capable of exploring high dimensional data and interactive multi-dimensional scaling software. Examples in this category include Prefuse, GGobi, XGVis, QGIS, etc. The fifth visual analytic tools category includes algorithmic tools that produce graphs. These tools are usually developed and advanced based on some algorithm by the research communities. A good example can be Graphviz. Visual data mining tools are the sixth visualizing tool category that is concerned with disclosing hidden patterns from data sets by creating visuals.

27


The seventh visualization tool is a web tools and packages. These are tools that are available on the web that are used to create visualizations from different kinds of data. An example can be Many Eyes. And finally, the last category is scientific visualization tools that are mainly concerned with modeling different complex physical phenomena. Table 2: visual analytics toolboxes and software stacks

Category Office tools

Examples The most familiar and used visualization tool is Excel with its bar charts and pie representations.

Business intelligence tools

Offering visualizations of the business status and the future for enterprise management, often connected to the company enterprise management system.

Statistical and mathematical

Statistical analysis has a long history of

tools

visualizing the results as time series, bar charts, plots, and histograms. Examples of tools providing statistical and mathematical visualization are R and Matlab, tools for statistical computing and graphics.

Visualization-related libraries

Prefuse visualization toolkit for creating

and software packages

rich interactive data visualizations, GGobi is an open source visualization program for exploring high-dimensional data, XGV is an interactive multidimensional scaling (MDS) software, QGIS a software package that

comprises

different

visualization

related plugins.

28


Algorithmic tools

Developed by the research communities based on some algorithm. An example is Graphviz19 for drawing graphs.

Visual data mining tools

Visual data mining creates visualizations to reveal hidden patterns from data sets. The need of new methods in data analysis has

the

products

launched are

on

the the

field.

Several

markets,

often

focused on “Business intelligence� such as marketing, risk

analysis,

customer

selling

management.

analyses The

field

and is

closely related to visual analytics. Web tools and packages

An increasing amount of tools is available in the web, either open source packages for download or on-site use. With the tools, users can create more or less fancy visualizations from data. See for example Many Eyes20, an IBM application for social data analysis.

Scientific visualization tools

Tools for modeling some complicated physical phenomenon.

6.10 Visual analytics models At this point, it’s clear that visual analytics addresses the challenge of assessing and processing complex data. Realization of the visual analytics approach requires visualization models that predict how computational processes facilitate human understanding and guide the human reasoning (Green et al., 2009). In this section, the human cognition model and the van Wijk visual analytic Model will be discussed.

29


6.10.1

Human cognition Model

In the human cognition model discovery is the main process where computers or machines present information in an ontological like structure contained in a relevant and human defined context. In order to ease the human cognitive burden with overwhelming data that are difficult to understand, one of the methods is presenting information within a relevant and human defined context. These help the human to interrelate directly with the presented or visualized information. Such intuitive multi-model visualizations facilitate the constructing of knowledge through the formation of new knowledge. During the discovery process, it is possible that the human may find possible new relationships between two ideas or concepts that had no relation. The discovered new relationship between two unrelated ideas can spread and extend the knowledge base of the visualization by interpreting the relationship. Computer-aided discovery can help the machine extend the information discovery capacity (Green et al., 2009). The general process of the human cognition model is presented in the diagram below (see Figure 12).

Figure 12: The HCM process (Green, Ribarsky, & Fisher, 2009)

30


6.10.2

The van Wijk Model

Van Wijk model or the sense-making loop is the most commonly used visualization model (see Figure 13). Van Wijk referred the user as perception and cognition, knowledge and the exploration as shown in Figure 13 below. The whole knowledge discovery process that is supported by visual analytics is structured into the model. After the user perceives the “image”, the user uses different manipulation techniques “specifications” to interact within the visualization. The model explains “perception”

as

providing

knowledge

that

Depending on the nature of the dataset different

initiates interactive exploration. analysis can be made by using

different statistical and mathematical techniques. Then the process can pass through a loop where knowledge of the data can be gained. Which furthermore can be evaluated and measures in terms of gained knowledge.

Figure 13: The Wijk visualization model (Keim et al., 2008)

31


In the following section, a multidisciplinary approach of visual analytics methods is applied for the analysis of movement data by coupling the benefits of a synergy of computation, database, and visualization. This thesis is within the frame of the Track&Know project which is a cooperative project that embraces research and innovation with the main target of introducing original approaches and technologies in the big data-driven environment. Even if the thesis is under the frame of Track&Know project, an example dataset from the iSCAPE project is used in order to understand and demonstrate the effectiveness of the different visual analytics methods discussed in the above sections.

7 EXAMPLE DATASET 7.1 Basic information about the dataset The dataset used for this thesis was obtained from one of the iSCAPE research projects that took place in the city of Hasselt, Belgium. The research objective is to implement and examine the smart alternatives for reducing the impact of air pollution in cities. As part of the alternative ways of addressing the issue of air pollution, the project aims to use behavioral intervention by means of changing the activity pattern of the city dwellers by proposing sustainable activity pattern that suits their purpose of travel. The data collection was conducted on the citizens of Hasselt and the Data collection method, tools, and instruments were developed within the scope of the iSCAPE project. The data collection method includes a web-based questionnaire which includes

different

socio-demographic

and

the

general

perspective

of

pro-

environmental behavior of sample participants. In addition, data concerning individual activity-travel routine are collected by making use of an Android-based smartphone application which is assisted with prompt recall tool. At the very beginning of the behavioral intervention study, 53 sample participants took part however only 33 sample participants of them were able to finish the research. Sample participants were adults who possess a valid driving license and they are also regular car users. In recruiting the sample participant the city of Hasselt played a major role in advertising the project. To acquire the obtained activity travel dairy of individuals,

32


sample participants were tracked through the Smartphone application called SPARROWS. SPARROWS was developed by the transportation research institute IMOB of Hasselt University. The application records the GPS location of individual and the collected GPS data is further processed in terms of stops. In developing a comprehensive activity travel diary sample participants are asked to annotate the stops using a prompt recall method. The annotation process includes the identification of the purpose of the activity, travel mode used, etc. The dataset mainly comprises position data of 33 sample participants for the iSCAPE project which have been tracked from 9th of June 29th of June 2017. The dataset contains 1,048,575 positions. Each record consists of sample participants and GPS point identifiers, dates and times, and geographical coordinates, speed, activity performed, and a few other additional fields. The temporal spacing of records is regular and it is 3 seconds between each record. The positions were recorded also when the sample participants did not actually move. The dataset does not contain explicit trips with specified origins and destinations. There are also no semantically defined places but only geographic coordinates. Therefore, trips and places have to be extracted from the data by means of analysis.

33


8 VISUAL ANALYTICS METHODS OF EXPLORING AND ASSESSING THE DATASET Data exploration starts with selecting target data from the raw data, preprocessing the data and then transforming the data into an appropriate form.

Figure 14: Home distribution

Looking at the home stops demographic data like living location can be explored. From (see Figure 14) we can see the locational distribution of sample participants is evenly distributed in the region. In addition, sample participants were not only from Hasselt but also from some other neighboring cities like Genk, Diepenbeek Alken, Zonhoven, Herk-de-Stad, etc. as well.

8.1 A Single trajectory In order to understand a single trajectory, I will consider one of the days of the observation period and the records observed during this particular day from the whole dataset. Thus, a sample participant and the sample participant’s one day record was selected randomly. Sample participant number 1 was selected and the records of sample participant one on the 9th of June 2017 which consists of 1510 recorded positions; a small number of them are shown (see Figure 15).

34


Figure 15: GPS-measured positions of sample participants

Yet, these recorded numbers don’t give sufficient information. In order to start making sense of the numbers, the geographical positions can be projected on a base map. By doing so, the information gains a spatial context which can help in interpreting the data. It is important to project the geographic positions using the correct projecting coordinate system. In this case, Figure 16 shows the projection of the records using the “Belgian Lambert 72” coordinate system to properly project the position records as points.

Figure 16: Projected position points

35


After projecting the points on the base map the next step was to generate a trajectory of the sample participant. The trajectory is constructed by linking successive points with a line segment. Here it is important to notice that the constructed line is a continuous representation of the discrete data. In constructing the line, the introduction of major errors cannot be found since the time gap between the successive points or position records is very small. This is evident on the map; the constructed trajectory fits well with the existing street network represented on the base map. In Figure 16, the trajectory of the selected sample participant on the 9th of June 2017 is represented on the base map as a line. The start and the end of the trajectory are marked by a hollow red rhombus (

) on the map. Looking at the

trajectory it’s easy to assume that the trajectory corresponds to a round trip. In addition, from this map, it is also possible to identify the geographic location of the trip as well as the street network used during the trip.

Figure 17: Constructed single trajectory

36


Nevertheless, by looking at only the map is not able to provide information other than spatial components, for instance, the temporal components. Therefore, from the map, it’s impossible to answer questions like did the sample participant move clockwise or counter-clockwise? Did the sample participant stopped while performing his or her trip or not? What is the speed of the sample participant? Etc. So it is important to find another alternative method to represent both; space and time components in order to answer these and other space-time related questions. In this thesis, in order to address this issue animated maps were used. The animated map consists of the video that runs using the time stamp of the positions. As shown in the video frames (see Figure 18), the position points are displayed on the base map according to the order of appearance. The sample participant started the one day closed trajectory in the afternoon and end at in the evening. The video frames are taken in an interval of 30 minutes starting from 14.23(GMT+1) when sample participant was at home location and finishing at 21.23 (GMT+1) again at home location. Using the animated map it’s possible to answer space-time related questions. For instance, the direction of movement. Looking at the movements of the point through time, it is easy to notice the direction of movement. The sample participant was moving counter-clockwise.

37


38


Figure 18: Series of animated map of a single trajectory

Furthermore, it was also possible to see if the sample participant stopped during the trip or not. The sample participant has stopped in several places. This was noticed as the spatial positions remain nearly unchanged or points are displayed more or less at the same spatial position during a time interval. Here it is important to note that due to GPS accuracy the position of successive position points is unlikely to appear exactly in the same place. It was also possible to determine the duration of the stop and at what place the stop occurred by looking at the start time of stops and end time of stops. By using an animated map it’s also possible to display other fields present in the dataset like speed. In Figure 18, the speed is assigned to have a magma gradient from 0 meters per sec (black) to 24 meters per sec (yellow). Furthermore, the recorded speed is displayed as the points are displayed on the base map. A series of 30 min interval time frame of the animated map is presented below. The trip started at 14h58m54s in the afternoon and ended at 21h53m36s. The assumption made earlier that the trajectory can be considered as the round trip can be justified since the sample participant started and ended the journey at the same spot which can be at this stage roughly assumed to be his or her home location.

39


09:40:00 10:01:00 10:21:00 10:42:00 11:04:00 11:24:00 11:45:00 12:05:00 12:26:00 12:46:00 13:08:00 13:31:00 13:52:00 14:12:00 14:33:00 14:53:00 15:14:00 15:35:00 15:55:00 16:37:00 16:57:00 17:18:00 17:39:00 18:02:00 18:23:00 18:43:00 19:04:00 19:26:00 19:46:00 21:11:00 21:36:00 21:57:00 22:17:00 22:38:00 23:01:00 23:21:00 23:43:00

TIME

25

SPEED (M/S)

20

15

10

5

0 Figure 19: Speed- time graph of a single trajectory

8.1.1 Extraction of significant places In the process of identifying significant places, the first step is to determine the stops. Besides the animated map, clustering of position records was also used in determining the stops. Using the interactive clustering method it was possible to visualize the stops. This way of visualizing is a quicker way to exploration. The method is interactive in a way that the clustered number of points change as the analyst zooms in and out. On the top of the clustered points, the number of points that are clustered under each cluster points is displayed. So from the clustered position points, two stops are shown with a high number of clustered points. At this point already we can guess that these two locations might be home and work locations. But the guess cannot be much confident since we need to analyze the movement over a longer period to make a more defined conclusion.

40


Figure 20: Cluster of position records

However, the dataset includes detected stops which are determined by using a stop detection algorithm. The stop detection algorithm basically looks at the last few position records and carry out spatial clustering. The recorded position points are treated in two runs. The first run includes finding stops and associating each position records as a stop member or not and further assigning the number of stops it belongs to. And the second run includes cleaning of stops and substituting the clustered set of points by a single representative point (Adnan Muhammad, Ahmed Shiraz, 2017). For the purpose of extracting significant places, different methods can be used. The first one is an aggregation of identified stops. This method works well for multiple observation days or period than for a single day observation. The result from a single day is not reliable since the number of stops identified per one day is small. The second and significant for this assignment is identifying stops based on the duration of activity. Generally, it is believed that the time spent in a particular space implies the significance of that particular space (Andrienko, Andrienko, Bak, Keim, & Wrobel, 2013). Meaning people spend much time on their home and work location and considerable time on other activities like shopping, recreation, using service, etc. however, this general belief is not always seamless. For instance bringing and getting children from school may take a relatively short time but then again standing in traffic may take a somewhat long time. So by understanding the shortcomings of the

41


method, it’s important that the analyst understands identified long duration stops and determine whether they are significant or not. This can be done by overlapping the identified long duration stops on a base map of the area and identify where the stop is located and identify what other objects are in the neighborhood (Andrienko, Andrienko, & Wrobel, 2007).

Figure 21: Identified stops based on the duration of activity

To facilitate this analysis, the first process was to insert an additional field in the dataset which calculates the duration of a stop. Each stop contains the start time and end time of stops. In order to acquire the duration of stops, a simple subtraction of the start time of stop from the end time of stops is performed. In visualizing the result the stop points were displayed on a 3d interactive plane of the base map. A QGIS 3D viewer plugin; QGIS2threejs was used In order to show the duration value of each stop, each stop points were extruded using their own time of duration values. And for visualization purpose, the values are multiplied by 2. From this process, it’s easy to notice two points (extruded red and green points). On the other hand, the same result was gained by filtering the stop durations by extracting stops with more than one hour. Filtering the stop duration is a preferred method in the case of larger datasets like in identifying multiple stops. The generation of 3D interactive representation for multiple trajectories or larger data becomes computationally more demanding with a lot of stops to the plot. In this process, it is visible that there are two identified significant stops. After identifying this significant stops the next process

42


is to identify the purpose of the stop or the activity. As discussed earlier, the data includes identified activities by using the prompt-recall method. The activities are defined as follows; Table 3: Identified activity types

Identified Activity types Home Work Business Bring or get Shopping leisure social Transfer or Waiting Parking Service Same as previous No activity Other None or unidentified

Examples of activity type Watching TV, house chores, sleeping, etc. Working at the office, etc. Customer visit, etc. Drop off children at school, etc. Groceries, etc. Playing sports, etc. Visiting friend, etc. Waiting for public transport, etc. Visit a doctor, etc. Standing for a red light, etc. Refuel, etc.

In the identification of the activity besides the activity that is filled by the sample participants in a prompt-recall method, there are several methods that can be used. These methods include overlaying identified stops on the databases of Point of interest, land use and algorithms to describe the type of activities using the start and end of the stop.

43


Figure 22: Identified stops overlaid on the databases of Point of interest (OpenStreetMap, 2018)

In some cases, the location where people are likely to stop also known as “point of interest� are well defined and also this database is available. In order to detect the significant stops, a database of POI was overlaid on the identified significant stops. However, both stops could not be identified by using the POI database. This might be because of the fact that the concept and perception of POI for an individual may be different from the public point of interest. From this result, it was already clear that the identified significant stops are not the public point of interest. In identifying the significant places the points were projected onto google earth (pro) which gave a clear and interactive image of what the identified significant place is. From the projected identified significant stops it was able to see that both stops took place at a residential place.

44


Activity Type

Home Shopping Social Leisure No activity 0.00 Time (Hours)

4.00

No activity 0.00

8.00 Leisure 0.12

12.00 Social 4.22

16.00

20.00

Shopping 0.37

24.00 Home 16.52

Time (Hours) Figure 23: A single-day total duration of activities

From the prompt-recall data it is also labeled as home (red point) and social (green point) activities. Looking at the single trajectory of a single sample participant was helpful in understanding and exploring the dataset. Yet, a single trajectory of the whole dataset does not provide the full story of the study and the information provided is limited in order to give a conclusion about the trajectories of the sample participant. So in order to understand the full story of the sample participant, it’s important to include multiple trajectories of the sample participant, which will take us to the next section of multiple trajectories of a single object.

8.2 Multiple trajectories of a single object In order to understand multiple trajectories of a single object, the previous sample participant was again chosen from the whole dataset but this time all the study period was taken into consideration. The sample participant was tracked and the data was collected from the 9th of June 2017 until the 29th of June 2017. Data exploration in fine details is less feasible in such relatively large data. By following the same process of data exploration methods used to display single trajectory retrieving valuable information was not possible. For instance, displaying the geographic positions on a base map (see Figure 24: left) doesn’t give useful information. On the other hand, constructing lines from consecutive position points (see Figure 24: right) also doesn’t

45


provide worthwhile information. The overlapping position points and lines do not allow or present convenient findings. They provide quite limited spatial information and no temporal information at all. In order to acquire the space-time component of the dataset, here again, we can use an animated map which provides both the spatial and temporal information at the same time. Before starting the visual analytics some data quality, data preprocessing and data explorations were made.

Figure 24: Projection of position records (left) & constructed trajectory from point records (right)

8.2.1 Data preprocessing and exploration To facilitate the analysis of some initial preprocessing in the database was made. This preprocessing includes extracting the sample participant’s data from the whole dataset to ease data manipulation and increase machine working. Besides extracting the data was split or filtered according to a one-day dataset which again helps data manipulation and exploration. Furthermore, the data was enriched by an additional stop duration field, which involved a simple subtraction of the start time of stop from the end time of stops. This operation may take some time but these operations are done only once. For the data exploration purpose, the first step was projecting the position records on a base map with the right coordinate system and check if the location of the points make sense. After the data preprocessing, it was possible to look at the points records of each day. From the point record projection of each day, it was already possible to imagine the trajectories of the sample participant by connecting the consecutive position points.

46


8.2.2 Data quality assessment While constructing the trajectories of each day of the tracked position data it was possible to see some data gap. In the records collected from five days namely the 13th, 15th, 16th, 20th and 28th of June 2017, the recorded points at some part of trajectories are missing.

One of the examples of these types of data gaps is

presented below (see Figure 25). The reason behind this data could be the power shortage of data collection device, sample participants forgetting to put the location tracker, some technical difficulties, etc. and these trajectories are not considered in the trajectories presented on Figure 24 and also during the kernel density analysis (see Figure 35). In dealing with the issue of data gaps of this kind, observing other complete trajectories and the transportation network can contribute to the data correction.

Figure 25: Identified data gap on 28/06/2017

47


Furthermore, from the data exploration of trajectories, it was able to notice that 2 days (the 23rd and 24th of June 2017) out of the study period the sample participant stayed at home location for the whole day. In general, the precision of position measurements are never faultless, several measurements recorded on the exact same point at a space typically fail to coincide (Andrienko et al., 2013). From the points collected and the constructed trajectories that are all at sample participant’s home location (see Figure 26: right) it can be noticed that the GPS error is not significant. Most of The collected position measurement points stayed within the residential plot (see Figure 26: left).

26:the Sample participant staying the wholestop day at home location As discussed Figure earlier, dataset includes detected points which are identified

using the stop detection algorithm and the activity performed at the detected stop points are identified using the prompt-recall data collection system. On the next sections of the thesis paper, some other visual analytics methods of stop extraction, identification of significance of the detected stops and identification of activities performed at the detected stops and identification of heavily used routes will be discussed in detail. This section also discusses the advantages and limitation of using different visual analytics methods.

8.2.3 Extraction of stops One of the ways of detecting stops is to make a spatial clustering of recorded position points. The occurrence of several position point at a particular space can generally be defined as a stop. In order to perform the spatial point clustering, QGIS software was used. The point clustering function helps aggregate points which are at close

48


proximity to each other. The output of this clustering can be a map which includes clustered points by specifying a fixed distance between points to be aggregated. And on the other hand, the output can also be an interactive map where the number of points clustered varies depending on the zoom in and out. As the analyst zooms out the number of points clustered increase or the clustering distance between points is higher and the vice versa. The interactive feature of the map helps the analyst to detect and analyze stops from different scales and levels. Another benefit of this method is that the analysis is easy to perform and not time taking. In addition, the interactive map presents the number of clustered point on the top of the clustered points.

Figure 27: Spatial point clustering

Using the spatial point clustering method has also some shortcomings. For instance, defining the distance between points to be cluster is difficult. The same problem occurs while using the interactive map as the analyst zooms out the distance between points to be clustered is too big. The map (see Figure 27) shows a spatial point cluster of the entire recorded points from the whole observation period of the selected sample participant. After identification of stops, the next step is to identify significant detected stops.

49


8.2.4 Identification of significant stops 8.2.4.1 Interactive Spatial cluster and filtering by stop duration Identification and interpretation of significant stops or places are one of the common tasks in analyzing movement data (Andrienko et al., 2007). As discussed earlier, analyzing movement data over a long period of time Contributes to draw a reliable conclusion about the significant places (Andrienko et al., 2013). One of the ways of recognizing the significance or importance of a place is by looking at the relatively high frequency of detected stops. On the other hand, the significance of stops can also be identified from the duration of the stop. In doing this analysis there are different methods of identifying significant places. The methods include the spatial clustering of identified stops and the other one is identifying significant stops by identifying temporal gaps between the start time of stop and end time of the stop. Again both methods comprise some advantages and disadvantages. First, we look at the spatial clustering of stops method. The method basically includes aggregating of nearby stop points and frequent stops to form a dense spatial cluster and identify the significance of the detected stops. The main shortcoming of the spatial point clustering method is the fact that it fails to consider the time component. On the other hand, it contributes well in revealing the frequency of stops independent of the duration of the stop. From the identified stop locations (see Figure 28) a spatial point cluster was made and significant stop points were identified. Moreover, from the identified significant places one can also further prioritize the places by looking at the number of points clustered; as the number of clustered detected stopes is higher the significance of the place also is higher and the vice-versa.

50


Figure 28: Identified stop points

Figure 29: Spatial clustered stop points

The second method to identify significant stop is by using the duration of stops. This way of identifying significant places generally assumes that people tend to spend more time on the stops that are significant for them. This assumption has the limitation of considering other significant stops that take less time and significant stops that take less time and happens frequently.

51


In general, home and work stops are the places with the highest frequency of stop and with a long time stop or duration. So in identifying these places retrieving stops with long stops or temporal gap is important. This is where the added field of stop duration during the data preprocessing comes in handy. By making use of the information the total amount of time spent per identified activity was made. From the graph, it’s easy to notice that the home, social and work stops or activities took the longest time. The “none� activity is the activity type which is not identified during the prompt-recall by the sample participant. Based on the assumption that we discussed earlier, we can also say that these stops are significant. Indeed this graph gives a general image of which activities are significant but again it fails to

Activity Types

communicate the spatial component.

No activity Services Transfer/Waiting Social Leisure Shopping Bring/Get Business Work Home None 0 10 20 30 40 50 60 70 80 90 100110120130140150160170180190200210220230240250

Activity Type

None

Home

34.3

240.35

Bring/Ge Shoppin Transfer/ No Leisure Social Services t g Waiting activity 24.65 3.016667 0.583333 3.883333 4.833333 34.05 1.216667 3.65 0.55 Work

Business

Time (Hour) Figure 30: Multiple day total duration of activities

So in order to include the spatial information we go back again to the identified stops and filter stops with longer stops. In most related studies (Andrienko et al., 2013; Andrienko et al., 2007) while filtering the longer time stops, 3 hours is used as a threshold.

The same time threshold; 3 hours was also adopted for this filtering

analysis. Certainly, defining the time threshold is difficult and in some cases filtering results with a longer period or shorter period may result with unexpected outcomes. From the filtered map it is possible to notice that home (blue point), work (green 52


point), social (purple points), service (orange points) and none or unidentified activities (red points) locations are present. But still, from this map, we miss the temporal component. And from looking at the map it is not possible to prioritize the identified significant stops according to the time weight. Furthermore, the stops that are important but are shorter than 3 hours are also missing. So in order to address the challenge related to presenting the spatial and temporal components together an interactive map was made.

Figure 31: Stops with a duration of 3 hours longer

8.2.4.2 Interactive Map Again the interactive map (see Figure 32) was made using the QGIS software; a 3D viewer plugin QGIS2threejs. The interactive map gives the idea of which stops have longer durations and which ones have less. Despite the other two methods; interactive spatial clustering and filtering by stop duration the interactive map show the temporal and spatial components of each significant stops together. Each stop points were extruded from their exact location using the duration of stops values field

53


created during the preprocessing of the data. And for visualization purpose, the values are multiplied by 3. So from the interactive map, it was possible to detect significant stops or activities; home (blue cylinder), work (green cylinder), service (orange cylinder), various social stops (purple cylinders) and unidentified stops (red cylinders).

Figure 32: Identified stops based on the duration of activity

As a conclusion in identifying and understanding the significant places of multiple trajectories of a single object, it’s important to see both spatial and temporal components of the detected stops. From the three methods, we can settle that the activity types home, work, social, service are the places that significant for the sample participant. Furthermore, the “none” activity or the unidentified stops have also been identified as a significant stop in the analyses. Even if they have a significant share it’s difficult to tell which activities they refer to so in the following section methods of identifying activities of unidentified activity detected stops will be discussed.

54


8.2.5 Methods of identifying activities of unidentified stops On the prompt-recall method of identifying the activity of stops, there are stops that are not assigned to a specific type of activity. And these stops are labeled as “none� in the dataset. Identifying this stops contributes to the quality of the analysis outcome. There are different methods that help identify or assume the activity type of the detected stops. These methods include different algorithms that help annotates the type of activity by making use of the start and end time of the stop (Ectors et al., 2017), land use inventories to describe the activity type of the stop (Wolf, Guensler, & Bachman, 2001), etc. The method of using land use inventories to describe the activity type stops will be used to identify the unidentified stops of the sample participant. The process includes the use of a digital land use inventory and the unidentified stop points to derive and assume the type of activity. The geographically referenced land use dataset was found from the geopunt open source database. In order to describe the activity type of the unidentified stops, a spatial analysis was made by overlapping the unidentified stop points onto the digital land use polygon map of the region. The following step was to make a simple spatial join analysis. From the spatial join analysis, the land use descriptions from the polygon based land use inventory are transferred to the point based detected significant stops. In Figure 33, it’s possible to describe the type of activity by looking at the type of Land use where the stop points fall under.

55


Figure 33: Land use inventories to describe the activity type (Geopunt.be, 2018)

This gives the general image of the activity and has also a limitation in terms of being detail. As the land use present the generalized activities of parts of a larger area. And in reality, activities are found spread in different scales. For instance, in the areas defined as a residential area, there are different service, social, shopping etc. activities within them. From the analysis, the unidentified stop points generally are found in the residential and agricultural areas. The yellow, purple, green, orange and blue point falls under the agricultural area, ecologically preserved agricultural area, residential area, and preserved residential area respectively. Moreover, in some cases stop points with unidentified activities are found on the exact position or near to other defined activity of stops. For example (see Figure 34) at the area of the north station in Brussels the sample participant identified some stops denoted by green points as a work activity and there are also unidentified activity at that are that is denoted by red points. Looking at the historic or post identified activity of a certain area one can also assume the unidentified points to have the same activity as the neighbor assigned activity. This type of data exploration helps in identifying data errors and also in data corrections.

56


Figure 34: Example of unidentified activity points found near to other defined activity of stops

8.2.6 Identification of significant route 8.2.6.1 Kernel point density After the identification of significant stop and activities, the next process was to identify the highly used routes by the sample participant during the study period. In the identification of the significant routes, two different methods were used. The first method involves a kernel point density analysis. The analysis makes use of all the position point records to calculate the density of points that fall under the same area using a kernel calculation. From the analysis, it was possible to develop a heat map that defined routes with a color gradient from white to red. The color gradient represents the value of the density of the points. The routes with more position point at the same or nearby location are represented as high point density (red) and the routes with less position point are represented as low point density (white). The produced heat map presents routes used during the sample participant’s trip as

57


important and not important. However, this method of identifying heavily used routes is found to be inconvenient. The results could lead to the wrong assumption. For instance, the route from sample participant’s home location to a transit location (Schulen station) the route is identified to be important and in the other hand, the route from home to work location is not identified as important.

Figure 35: Kernel point density analysis

Transit location

Work location

Home location

The route identified as important in Figure 35; home to transit location was not the actual significant route. By exploring the trajectories from each day it was found that the sample participant used that route only once during the study period. As part of understanding that particular trip and the error an animated map used to explain a

58


single trajectory was performed. From the animated video, it was possible to notice that that particular trip from home to transit was performed with a very low speed which made the number of position points recorded at that particular trip to be high. In other word, since the speed of the sample participant was low the number of position points recorded at that very particular route also became high. Likewise, the kernel point density was also high. We can assume that the trip might be made using a slow mode of transportation, the sample participant was standing in congestion, etc. On the other hand, the trip from home to work was made several times but then again on the heat map it is identified as not important. Again this too has to do with the speed of the sample participant to perform the trip, which in this case was fast. So using this method to identify and analyze significant routes from a movement data is inconvenient but in relatively large datasets it helps analysts to get the general image of the importance of routes and triggers them to question identified routes further.

8.2.6.2 The activity space model So the second method used to identify heavily used routes was an activity space model. This section deals with the application of the use of ellipses in representing activity spaces of individuals. The term activity space exemplifies the geographical representation of the space where the sample participant carries out a group of activities. This unique approach makes use of ellipses to contribute to the analysis of a spatial component by identifying activity spaces of sample participants. “By graphically representing observed minimal activity spaces, and by offering a variety of spatially related variables for analysis, it is possible to discern consistencies in the activity spaces of different types of travelers and to get a clearer picture of the spatial extent and impacts of their trip making behavior� (Newsome, Walcott, & Smith, 1998) . The ellipse comprises different properties. The ellipse is drawn by using the home and work location as the foci of the ellipse. As discussed in Newsome et al. (1998), an individual’s activity space is created using a locus of point locations that maintain a constant distance to the home and work location or the foci. The size of the ellipse is a function of the distance between the two foci. And the edge of the ellipse or the

59


constant distance is drawn through the farthest significant activity chained to the work trip. Taking the farthest significant activity to ensure the inclusion of other significant activities within the ellipse(Li & Tong, 2016).

Figure 36: Activity space model (Li & Tong, 2016)

So coming back to the dataset of the sample participant, the foci of the ellipse are the identified home (blue point) and work location (green point). The next step was to identify the farthest important activity that is chained to the work trip. The farthest important activity shackled to the work trip is the one marked “social activity”. Looking at the overall activity significance (see Figure 30) social activity is one of the significant activity for the sample participant. The edge of the ellipse or the constant distance is drawn through this farthest important activity which is the social activity chained to the work trip. As a result, the space within the drawn ellipse can be considered the activity space of the sample participant. Since the main aim of this analysis is to identify the heavily used routes, now, we can identify the routes connecting the home and work locations. By doing that and observing daily home to work trips, it was noticed that the important route; the streets “Diestersteenweg” and “Kuringersteenweg”. Despite the heat map, the activity space model was substantial in identifying the highly used route. This was also verified by manually looking through the daily trips of the sample participant.

60


Figure 37: Application of ellipse-shaped space activity model

In addition, the ellipse activity space model is considered to be one of the appealing ways of visualizing the activity space (Newsome et al., 1998). In spite of all the advantages of using this model, the ellipse-shaped space activity model has also some shortcomings. Mainly, it fails to fully capture peoples activity spaces since it only considers home to work trips (Li & Tong, 2016). The result obtained from this analysis allows for further analysis. One of the examples of further analysis based on the findings of the activity space model is discussed in the following section.

9 CURRENT APPLICATIONS OF VISUAL ANALYTICS APPLIED TO CASE STUDIES In this section of the thesis, the current applications of visual analytics applied to the iSCAPE project will be discussed. As discussed earlier, the iSCAPE project mainly strives to understand pupil's activity pattern and as part of the alternative ways of addressing the issue of air pollution, the project aims to use behavioral intervention by means of changing the activity pattern of the urban inhabitants by proposing sustainable activity pattern that suits their traveling behavior. In this line, visual analytics can contribute to addressing the issue of air pollution under the objective of the iSCAPE project. The project mainly includes the identification and

61


understanding of important routes and activities of sample participants and after identification, the project sample participants receive sustainable behavioral recommendations that help address the issue of air pollution. The recommendation strives to reduce the exposure and contribution to the issue. As seen in the previous section using different visual analytics methods we were able to identify the highly used route by the sample participant. This adds up to the main research objective of the project which is to implement and examine the smart alternatives for reducing the impact of air pollution in cities. As identified, the home to work trip is mainly performed using the Diestersteenweg and Kuringersteenweg route. After the identification of the activity space and the highly used route in order to understand and analyze the issue of air pollution, we can overlay the raster data of air pollution on the activity space and the highly used route. The air pollution raster data was found in the Flemish environment agency website (Vlaamse Milieumaatschappij, 2019). As the data was not open source the screenshot of the online map was used. The maps show an assessment of the air quality in Flanders. The annual average map shows you the air quality down to street level in the year 2017. The maps below show the annual average of air quality assessment of NO2 (see Figure 38 ) and PM10 (see Figure 39).

Figure 38: Identified highly used route and NO2 pollution map (Vlaamse Milieumaatschappij, 2019)

62


Figure 39: Identified highly used route and PM10 pollution map (Vlaamse Milieumaatschappij, 2019)

This visual analytics method help in identifying if a sample participant’s main activity pattern is under the polluted area or not. From the analysis made it was seen that the identified significant route to the sample participant was under a highly polluted area. Besides displaying by overlaying the pollution data, some computational operations can be performed in order to really understand the relationship of different variables like exposure time and pollution concentration, mode of transport and pollution concentration, Infrastructure time and pollution concentration etc. For instance, visual analytics might couple and pollution concentration: e.g. the absorbed dose can be visualized to answer questions, like does the speed of different modes contribute to the exposure level? For instance, biking or using other active modes through areas with relatively less pollution might be more damaging than driving or using other fast modes through areas with relatively high pollution. One case could be assessing pollution exposure of single trajectories of a single object. Animated maps can be used by using the pollution data as a base map and looking at the speed and mode of transportation of the object we can comment on the exposure level. So to conclude the project can benefit by coupling some visual analytics methods and tools to ease understanding and analysis of the dataset.

63


10 DISCUSSION The visual analytics methods for exploring data primarily started by selecting target data from the raw data, this includes the extraction of the data of a sample participant from the whole dataset, single day records of a particular sample participant, etc. which were relevant for the study and helps to ease data manipulation and increase machine working. The selection of target data was followed by preprocessing of the data. This process mainly includes enriching the selected data with an additional variable. The data was enriched by an additional stop duration field, which involved a simple subtraction of the start time of stop from the end time of stops. This operation may take some time but these operations are done only once. After the preprocessing the next data exploration step was the data transformation. The transformation process mainly comprises the projecting of position records on a base map using the “Belgian Lambert 72� coordinate system and check if the location of the position points makes sense (see Figure 16 & 24). In addition to projecting, in doing the animated map of single trajectories (see Figure 18) the format of the time stamp of each record needed to be transformed into a QGIS; Time Manager Plugin accepted formats. Moreover, the data exploration process played a crucial role in assessing the quality of data (see Figure 25). In the process of assessing data quality using visual analytics methods, understanding the context of the data was fundamental, and in doing so, the supervision of the human analyst is crucial. As discussed (see section 6.4), the assessment of data quality can be structured in four phases; data collection, data transformation, graphical mapping, and display phases. In the iSCAPE project, the data quality issues were observed only in the first phase; the data collection phase. Different range of data quality issues was detected in the data collection phase. It’s mainly data quality issues concerned with missing of data and measuring error. The data collection included identification of the purpose of stops made by sample participant by using a prompt-recall method. In that data collection, there were detected stops where their activity was not assigned to particular activity types. Besides that, there were a data gap quality issues that were detected during the exploring process (see Figure 25). Some the day trajectories misses position records or includes gaps. This error can be the result of measuring error caused by the quality 64


of the device or human and device error in data collection and entry. In addressing the data quality issues, the day trajectories with data gap were ignored throughout the whole analysis and the missing data of activity types were identified using a land use data.

The process includes the use of a digital land use inventory and the

unidentified stop points to derive and assume the type of activity (see Figure 33). Furthermore, this kind of missing data errors can also be corrected or enhanced by looking at the historic or post identified activity of a certain area one can also assume the unidentified points to have the same activity as the neighbor assigned activity (see Figure 34). After the selecting targeted data, data preprocessing, data transforming and assessing the quality of the data the next step was to find and match patterns using visual analytic methods. The process of discovering patterns principally comprises identifying possible meaningful relationships between features of the object(s). The visual analytics method of pattern finding includes the construction of trajectories (see Figure 17 & 24). The meaningful possible co-incident in space and time of position points when appearing in a near location with small time gap was used to construct the trajectory line. Furthermore, different spatial point clustering and point density were used in order to identify patterns of stops (see Figure 20 & 27), significant stops (see Figure 27) and significant route (see Figure 35). The visual analytics methods of clustering and point density were done by identifying meaningful possible co-location in space which is the coexistence of position points in a defined distance to each other in some part of the location. Besides, co-incident in space and time and co-location in space, the similarity in characteristics was used to find patterns. In the identification of significant stops with defined stops duration (see

Figure 21, 31 & 32) was used. Using the point clustering and using the duration of stop to identify the significant activity stop it was possible to conclude that the

activity types home, work, social, service, and unidentified activities were the significant places for the sample participant. In identifying significant route and activity space a visual analytics model was used. The ellipse-shaped activity space model (see Figure 36) is a unique approach that makes use of ellipses to contribute to the analysis of a spatial component by identifying activity spaces of sample participants. By using the activity space model 65


it was possible to identify the highly used route (see Figure 37). It was noticed that the important route is; the streets “Diestersteenweg” and “Kuringersteenweg”. Despite the heat map or kernel point density analysis, the activity space model was substantial in identifying the highly used route. This was also verified by manually looking through the daily trips of the sample participant. The available visual analytics software stacks and toolboxes are discussed and summarized in section 6.9. In the process of assessing and exploring the example dataset, Excel and QGIS software were mainly used. Excel Office tool was used in some data preprocessing, data transformation and some mathematical operations. In addition to that, it was also used to visualize some findings in bar charts (see

Figure 23 & 30). Initially, the aim was to use the visual analytics tool used by Andrienko (2013) but accessing the tool was not possible. Thus, in order to perform the same kind of visual analytics task, different versions of QGIS (QGIS Desktop 2.18.17 & 3.4.1) and different QGIS plugins were used in order to perform different analysis and visualization. The need to use the later version of QGIS was due to the fact that some important plugins were not functional in a newer version of the software. The different plugins used comprises the “Point to Paths” plugin which is used to construct the line of trajectories (see Figure 17 & 24: right), the “Time Manager” plugin which is used to make the animated map (see Figure 18), the “Qgis2threejs” which is used to make the 3D interactive map to identify stops using the activity duration (see Figure 21 ) and finally in the kernel point density analysis (see Figure 35) the “Heat map” plugin is used. Additionally, in the construction of the ellipse-shaped activity space model (see Figure 37) and for better visualization of some images, Adobe Illustrator and Adobe Photoshop are used.

Lastly, the current application of visual analytics methods was applied to an example dataset from the iSCAPE project as a case study. The result obtained from using the model allows for further analysis of the iSCAPE project. As the project strives to make a behavioral intervention by means of changing the activity pattern of the urban inhabitants by proposing sustainable activity pattern that suits their traveling behavior. By overlaying the raster data of air pollution data retrieved from Flemish environmental agency on the activity

66


space and the highly used route. From the analysis made, it was seen that the identified significant route of the sample participant was found to be under a highly polluted area. As a recommendation, the project can highly benefit by coupling some visual analytics methods and tools to ease understanding and analyzing the dataset.

11 CONCLUSION Challenges addressed by visual analytics are huge in nature. Using the iSCAPE project dataset the different visual analytic methods for enabling a human analyst to uncover and make sense from movement data that primarily lacks any semantics is presented. In the process of exploring and assessing the dataset, the sense appears as the human analyst perceive the information, associated with the analyst’s prior knowledge and findings from other related references. In supporting the process of sense-making of the movement data interactive visual displays were substantial. However interactive visual displays become insufficient when it comes to analyzing complex and dynamic data like movement data. Hence, in addition to using the interactive displays, the use of different computational techniques and tools are complementary. The computation process enables basic data preprocessing, processing, extraction, and abstraction of important objects and features. These aggregation and summarization of objects and features enable and ease the visualization of valuable information from the complex dataset. The visualization assists human cognition and reasoning which later contributes to further analysis by means of using different visual analytics techniques and tools. So by making use of the joint features; visualization and computation of visual analytics different decision making and reasoning from huge datasets can be supported. Furthermore, Visual analytics research involves the contribution of different research domains such as; information visualization, data exploration, and management, statistics and mathematics, human perception and reasoning, etc. and each specific research area helps make a range of approaches to encourage multi-disciplinary solutions. The collaboration of visual analytics researchers with other disciplines helps in developing analyzing methods for complex data and complex real-world problems. In addition,

67


Visual analytics eases and facilitates mitigation approaches for information overload problem. It displays the link between different results that helps engage the human in the knowledge generation process.

12 LIMITATIONS AND POTENTIAL APPLICATIONS

In doing this analysis there were some limitations regarding computation power, availability of supporting data and issues regarding the quality of data. The first limitation regarding computational power was a limitation in doing further analysis using the whole dataset. Due to that, it was not possible to do the analysis of the simultaneous movement of multiple objects. The second limitation was regarding the availability of supporting documents. Most supporting documents used in the thesis were available in form of open source but other available documents such as environmental pollution data were not possible to access. In addition, the identified data quality issues during the data collection such as missing data and measuring errors were ignored in the analysis. This might impact the interpretation of the findings of some analysis. As discussed in the thesis, different visual analytics methods assist human analysts to uncover and assess complex and dynamic data. The visual analytics methods presented in this thesis is applicable to diverse movement datasets. Accordingly, it can be used in understanding and assessing the movement behavior of a different individual or multiple entities. This kind of studies can be helpful in the study areas that include various movement-related decision making. The study areas include urban planning, traffic and transportation planning, service planning, business planning and many others.

68


13 ACKNOWLEDGMENT I would like to express my deep sense of gratitude for my thesis supervisors Prof. dr. ir. Ansar-Ul-Haque YASAR for his endeavor approach and outstanding supervision. I would also like to express my special gratitude to my thesis mentor ir. Wim ECTORS who have provided me with constructive advice and guidance throughout the research process and providing me with important documents that contributed a lot to my work. Finally, I am also grateful to Shiraz Ahmed for his cooperation and directions in getting the necessary documents.

69


14 REFERENCES 

Andrienko, G., Andrienko, N., Bak, P., Keim, D., & Wrobel, S. (2013). Visual analytics of movement: Springer Science & Business Media.

Andrienko, N., & Andrienko, G. (2007). Designing visual analytics methods for massive collections of movement data. Cartographica: The International Journal for Geographic Information and Geovisualization, 42(2), 117-138.

Andrienko, N., & Andrienko, G. (2013). Visual analytics of movement: An overview of methods, tools and procedures. Information Visualization, 12(1), 3-24.

Adnan Muhammad, Ahmed Shiraz. (2017). Report on Environmental Effects of Behavioural Actions.

Gennady Andrienko, N. A. (n.d.). A Primer on Visual Analytics Special focus: movement data. Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS.

Cich, G., Knapen, L., Bellemans, T., Janssens, D., & Wets, G. (2016). Threshold settings for TRIP/STOP detection in GPS traces. Journal of Ambient Intelligence and Humanized Computing, 7(3), 395-413.

Cook, K. A., & Thomas, J. J. (2005). Illuminating the path: The research and development agenda for visual analytics.

Data, D. B. (2012). A Practical Guide to Transforming the Business of Government. TechAmerica Foundation‟ s Federal Big Data Commission.

Ectors, W., Reumers, S., Lee, W. D., Choi, K., Kochan, B., Janssens, D., . . . Wets, G. (2017). Developing an optimised activity type annotation method based on classification accuracy and entropy indices. Transportmetrica A: Transport Science, 13(8), 742-766.

Gennady Andrienko, N. A. A Primer on Visual Analytics Special focus: movement data.

Geopunt.be. (2018). Geopunt Vlaanderen. [online] Available at: https://www.geopunt.be/ [Accessed 9 Dec. 2018].

Giannotti, F., & Pedreschi, D. (2008). Mobility, data mining and privacy:

70


geographic knowledge discovery. Berlin: Springer. 

Green, T. M., Ribarsky, W., & Fisher, B. (2009). Building and applying a human cognition model for visual analytics. Information Visualization, 8(1), 1-13. doi:10.1057/ivs.2008.28

Järvinen, P., Puolamäki, K., Siltanen, P., & Ylikerälä, M. (2009). Visual analytics.Final report.

Josko, J. M. B., & Ferreira, J. E. Data quality assessment of very large database through visualization system. Paper presented at the 29th Brazilian Symposium on Database.

Keim, D., Andrienko, G., Fekete, J.-D., Görg, C., Kohlhammer, J., & Melançon, G. (2008).

Visual analytics:

Definition,

process,

and

challenges Information Visualization (pp. 154-175): Springer. 

Lee, Y. W., Pipino, L. L., Funk, J. D., & Wang, R. Y. (2009). Journey to data quality: The MIT Press.

Leventhal, B. (2010). An introduction to data mining and other techniques for advanced analytics. Journal of Direct, Data and Digital Marketing Practice, 12(2), 137-153. doi:10.1057/dddmp.2010.35

Li, R., & Tong, D. (2016). Constructing human activity spaces: A new approach incorporating complex urban activity-travel. Journal of Transport Geography, 56, 23-35.

Newsome, T. H., Walcott, W. A., & Smith, P. D. (1998). Urban activity spaces: Illustrations and application of a conceptual model for integrating the time and space dimensions. Transportation, 25(4), 357-377.

OpenStreetMap. (2018). OpenStreetMap. [online] Available at: https://www.openstreetmap.org/#map=19/50.93258/5.33452 [Accessed 20 Dec. 2018].

Pang, A. (2001). Visualizing uncertainty in geo-spatial data. Paper presented at the Proceedings of the Workshop on the Intersections between Geospatial Information and Information Technology.

Sagl, G., Loidl, M., & Beinat, E. (2012). A visual analytics approach for extracting spatio-temporal urban mobility information from mobile network traffic. ISPRS International Journal of Geo-Information, 1(3), 256-271.

71


Siddiqui, T., Kim, A., Lee, J., Karahalios, K., & Parameswaran, A. (2016). Effortless data exploration with zenvisage: an expressive and interactive visual analytics system. Proceedings of the VLDB Endowment, 10(4), 457- 468. doi:10.14778/3025111.3025126

Vlaamse Milieumaatschappij. (2019). Luchtkwaliteit in je eigen omgeving. [online] Available at: http://www.vmm.be/data/luchtkwaliteit-in-je-eigenomgeving [Accessed 15 Dec. 2019].

Ward, M., Xie, Z., Yang, D., & Rundensteiner, E. (2011). Quality-aware visual data

analysis.

Computational

Statistics,

26(4),

567-

584. doi:10.1007/s00180-010-0226-0refrences 

Wolf, J., Guensler, R., & Bachman, W. (2001). Elimination of the travel diary: Experiment to derive trip purpose from global positioning system travel data. Transportation Research Record: Journal of the Transportation Research Board(1768), 125-134.

72


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.