7 minute read
Change detection in semi-structured documents
WAYNE BONNICI | SUPERVISOR: Dr Joseph G. Vella COURSE: B.Sc. IT (Hons.) Software Development
In digital forensics investigations it is sometimes required to sift and analyse a large number of documents submitted as evidence. A computer system could assist an investigator in establishing the exact derivation of changes in a document by summarising tracked changes between two documents. This project was concerned with target documents encoded in a docx format ‒ the default file format of Microsoft Word ‒ and assumed that the documents had text and track change annotations.
Advertisement
A tool, named DocxDiff, was developed in order to compare documents. The developed tool is easy to use and requires minimal user interaction. DocxDiff utilises a changedetection algorithm that is specifically designed to work on documents that are internally represented as hierarchical XML trees, as found in docx files. The algorithm is efficient and is capable of providing accurate deltas. The change-detection algorithm ignores irrelevant deltas, such as nodes that are represented differently, but semantically equivalent. XML nodes offer a shorthand writing notation known as self-closing tags. If they are not considered equal to the longer version, they have to be worked on at a later stage. This increases the size of the output delta. DocxDiff uses the change detection algorithm to only inspect the parts of the documents that are flagged as ‘modified’. This avoids the need to wholly traverse the document, thus significantly improving its performance when large documents would be the target. DocxDiff uses the unique paths extracted from a delta to access changed nodes in the document, and determine their behaviour between the previous and a later document revision. DocxDiff classes the track changes into four categories, namely: inserted text, deleted text, approved/rejected text marked for insertion, and approved/rejected text marked for deletion.
It would be necessary for the investigator to select two documents suspected of being related to one another. DocxDiff would perform a number of preprocessing steps to ensure that the selected documents have a relatively high chance of being related. This step removes the need for the investigator to manually open and compare the content of the documents to identify similarity. Since the investigator would not need to know the sequence of events, it would be possible to input the documents in any order. DocxDiff would determine the sequence of events automatically and, if necessary, swap the supplied order. DocxDiff encapsulates the delta of the two selected documents and presents any altered track-change state to the investigator in a way that would be easy to understand and manage. This is a key factor, as an investigator might need to obtain a summary of other similar documents.
The investigator would be expected to use the presented summary to determine whether to shortlist the documents for further analysis. During testing, DocxDiff achieved a very good track-change detection accuracy. Detected track changes were always classified into the right category. Due to the limitations present in the docx format itself, some documents that were resaved at a later stage tended to cause the preprocessing validation to fail. This caused DocxDiff to mark these documents as being non-related.
Figure 1. A sample document and associated before and after tracked changes
Real-time EEG emotion-recognition using prosumer-grade devices
FRANCESCO BORG BONELLO | SUPERVISOR: Dr Josef Bajada COURSE: B.Sc. IT (Hons.) Artificial Intelligence
Electroencephalography (EEG) is a biomedical technology that measures brain activity. Apart from being used to detect brain abnormalities, it has the potential to be used for other purposes, such as understanding the emotional state of individuals. This is typically done using expensive, medical-grade devices, thus being virtually inaccessible to the general public. Furthermore, since most of the currently available studies only work retrospectively, they tend to be unable to detect the person’s emotion in real time. The aim of this research is to determine whether it might be possible to perform reliable EEG-based emotion recognition (EEG-ER) in real time and using low-cost prosumer-grade devices such as the EMOTIV Insight 5.
Figure 1. The EMOTIV Insight 5 Channel Mobile Brainwear device
This study uses a rolling time window of a few seconds of the EEG signal, rather than minutes-long recordings, making the approach work in real time. The information loss between the number of channels of a medical-grade device and those of a prosumer device is also analysed. Since research has shown that different persons experience emotions differently, the study also analysed the difference between generic subject-independent models and subject-dependent models, fine-tuned to each specific subject.
Different machine learning techniques, such as supportvector machines (SVM) and 3D convolution neural networks (3D-CNN) were used. These models were trained on classifying 4 emotions ‒ happy, sad, angry, relaxed ‒ utilising the dataset for emotion analysis using physiological signals (DEAP). The best models achieved a 67% accuracy on the subjectindependent case, rising to 87% for subject-dependent models, in real-time using only 5 channels. These results compare well with the state-of-the-art benchmarks that use the full medical-grade 32-channel data, demonstrating that real-time EEG-ER is feasible using lower-cost devices, making such applications more accessible.
Figure 2. Brain activity of individuals experiencing one or more emotions
Scaling protein-motif discovery using tries in Apache Spark
ETHAN JOSEPH BRIFFA | SUPERVISOR: Mr Joseph Bonello COURSE: B.Sc. IT (Hons.) Software Development
The field of bioinformatics applies computational techniques to biology. This study focuses in particular on proteins, which are large molecules that have specific functions in organisms. Understanding proteins requires identifying fixed patterns called motifs in protein sequences, as motifs are indicative of a protein’s structure and function.
This research attempts to improve the speed of finding motifs by comparing unknown protein sequences with known protein domains as classified in the CATH hierarchy. The approach adopted in this study uses the multiple sequence alignment (MSA) from proteins found in CATH functional families. Each MSA contains motifs having sequence regions that have been preserved through evolution, known as conserved regions. The representative sequences for the functional families are stored as a suffix trie, which would then be used to find potential structures.
To improve the efficiency of the search, the suffix trie is implemented using the Apache Spark framework, which is generally used to process large amounts of data efficiently. The Spark architecture offers processing scalability by distributing the process over a number of nodes, thereby speeding up the search.
The method subsequently determines the best match through a scoring algorithm, which ranks the output based on the closest match to a known structural motif. A substitution matrix is also used to consider all possible variations of the conserved regions.
This system was finally compared against a library of hidden Markov models, where the results were compared to determine the speed of its process whilst ensuring that the correct results would be produced.
Figure 1. Suffix trie for an example motif “GVAV”
Figure 2. Architecture of the technology
Discovery of anomalies and teleconnection patterns in meteorological climatological data
LUKAN CASSAR | SUPERVISOR: Dr Joel Azzopardi COURSE: B.Sc. IT (Hons.) Artificial Intelligence
Climate change has become a growing problem globally, and the analysis of datasets to identify patterns and anomalous behaviour with regard to the climate is more crucial than ever. However, analysing such datasets could prove to be overwhelming, as these datasets tend to be too large to inspect manually. As a result, the need has arisen for techniques to efficiently scour and manipulate such extensive data. These are generally referred to as data-mining techniques.
The research for this project involved using different data-mining algorithms to extract anomalies and teleconnections from a dataset of monthly global air temperatures, covering a period of 72 years (1948-2020).
Anomaly detection is a significant step in data mining, and is primarily concerned with identifying data points that deviate from the remainder of the data, and hence are considered anomalies. The purpose of anomaly detection in climate data is to identify any spatial (across space), temporal (across time) or spatial-temporal (across both space and time) anomalies within the dataset. They are crucial in understanding and forecasting the nature of the ecosystem model of planet Earth. The anomalies are detected using three algorithms, namely: k-nearest neighbors (k-NN), k-means clustering, and density-based spatial clustering of applications with noise (DBSCAN).
Teleconnections are recurring and persistent patterns in climate anomalies, and connect two distant regions to each other. Their significance is due to the fact that they reflect large-scale changes in the atmosphere and influence temperature, rainfall, and storms over extensive areas. As a result, teleconnections are often the culprits in the event of anomalous weather patterns occurring concurrently over widespread distances. The teleconnections are detected using three associationmining techniques ‒ Apriori, FP-growth, and Generalized Sequential Pattern (GSP) ‒ over the spatial-temporal anomalies identified previously.
The extracted anomalies and teleconnections, as obtained from the previously mentioned algorithms, have been represented in interactive graphs and heat maps.
Figure 1. Spatial anomalies plotted on an interactive heat map Figure 2. Teleconnections plotted on an interactive map