Embark Volume 2 by College of Engineering

EMBARK NORTHEASTERN UNDERGRADUATE ENGINEERING AND APPLIED SCIENCES REVIEW

VOLUME II SEPTEMBER 2018

PAPERS FLUX CALIBRATION OF TI-BALL TI SOURCE FOR PRECISION DEPOSITION OF BARIUM TITANATE Dalton Cox, Student, Katherine Ziemer, Professor ......................................................................... 1 PEDOT:PSS-DVS CROSSLINKING REACTION MONITORED VIA ATR-FTIR FOR AIR CATHODE APPLICATION IN MICROBIAL FUEL CELLS Maria Jennings, Undergraduate Student, Ian Kendrick, Post-Doctoral Research Associate, Clive Green, Graduate Student, Steve Lustig, Associate Professor .......................................................... 4 DESIGN AND PROGRAMMING OF A REMOTE IOS CONTROLLER AND GATEWAY FOR UNDERWATER ACOUSTIC NETWORKS Andrew Fish, Student, Yashar A. Aval, Researcher, Stefano Basagni, Professor ........................... 9 GAUSSIAN MIXTURE MODELS FOR DYNAMIC MALWARE CLUSTERING Alexander M. Interrante-Grant, Student, David Kaeli, Professor .................................................. 15 A FAST PARALLEL LEVEL SET SEGMENTATION ALGORITHM FOR 3-D IMAGES Benjamin Trapani, Student, Julian Gutierrez, Student, David Kaeli, Professor ............................ 21

ON THE COVER Displayed on the cover is a confocal laser scanning microscopy image of fluorescently stained B16-OVA melanoma tumor cells seeded on a Hypoxia-Inducing Cryogel (HIC) for a cell viability assay. In the photo, the red is a Rhodamine staining of the polymer backbone, the green is a Phallodin Alexa Fluor-488 staining of the actin fibers of the cells’ cytoskeletons, the blue is a DAPI staining of all cell nuclei, and the yellow is a live/dead fixable staining of all dead cells. The low amount of dead cells present compared to large number of total cells attached to the scaffold show that the HIC exhibits excellent overall cell viability. The polymer scaffold is composed of methacrylated hyaluronic acid with different enzymes grafted to the polymer scaffold via polyethylene glycol (PEG), giving the scaffold incomparable biocompatibility, excellent robustness, and powerful functionalization. In this project, Hypoxia-Inducing Cryogels (HIC) are being designed in order to develop a more accurate in-vitro 3D Tumor model to be used by cancer researchers and drug developers. This new model will directly address the two largest shortcomings of current in-vitro tumor models: (1) a failure to recapitulate the tumor extracellular matrix, (2) a failure to induce a stable hypoxic oxygen gradient. These two factors are heavily linked to cancer proliferation, therapeutic resistance, metastasis, and immunosuppression. Therefore, the fact that current cancer models can’t recapitulate these two physiological factors has led to an inability for cancer researchers and drug developer to generate reliable, accurate data. As a result, there exists a fatal inability to predict a drug’s efficacy, which more lives being lost to cancer, and more money being wasted by companies for fruitless research efforts. Under this new platform, the influence of hypoxia on tumor cell biology, metabolism, motility, and cytotoxic effects of chemotherapeutics are under investigation. This in-vitro tumor model has tremendous potential to advance personalized medicine, cancer research, and drug development. (James Sinoimeri, 2018)

IN THIS VOLUME Flux Calibration of Ti-Ball Ti Source for Precision Deposition of Barium Titanate Dalton Cox, Student, Katherine Ziemer, Professor A Ti-Ball titanium sublimation source was calibrated to atomic flux for use in molecular beam epitaxy (MBE) growth of thin films. Precise calculations of atomic flux are necessary for growth of crystalline barium titanate (BTO) films as well as measurements of the sticking coefficients (σ) to better understand the mechanism of crystal growth. Two sources used for flux calculations, film growth rate and manufacturer supplied total sublimation rate, disagreed by an order of magnitude, requiring additional inquiry into the source of error; however, both calculations agreed that σTi >> σBa in growth of BTO.

iii

PEDOT:PSS-DVS Crosslinking Reaction Monitored via ATR-FTIR for Air Cathode Application in Microbial Fuel Cells Maria Jennings, Undergraduate Student, Ian Kendrick, Post-Doctoral Research Associate, Clive Green, Graduate Student, Steve Lustig, Associate Professor Microbial fuel cells (MFCs) intended for at-risk communities lacking sources of clean water and electricity could be more economically produced by the implementation of biomolecular air cathode technology: an encapsulated enzyme within an electrically conductive nonwoven spunbound polymer such as PEDOT:PSS. While PEDOT:PSS may be spunbound and boasts high conductivity, it is prone to delamination and redispersion due to its water solubility. We report the analysis of crosslinking reactions with divinyl sulfone (DVS) in order to improve the durability and insolubility of conductive PEDOT:PSS fibers. Analytical characterization of crosslinking PEDOT:PSS with DVS using time-lapsed ATR-FTIR spectroscopy in concentrated solutions permits a clear characterization of PEDOT:PSS-DVS crosslinking kinetics and structure.

Design and Programming of a Remote iOS Controller and Gateway for Underwater Acoustic Networks Andrew Fish, Student, Yashar A. Aval, Researcher, Stefano Basagni, Professor The purpose of this research is to explore mobilizing the control of SmartBuoyDuo devices that provide access to the nodes of an underwater acoustic network. The research entails the creation of a smartphone-based ultra-portable system to control basic functionalities of SmartBuoyDuos including their relays, sensor readings, and sleep cycles. The Teensy platform pair with a Bluetooth LE module and an XBee S3B are used to create a remote control gateway device capable of sending commands to, and receiving responses from, SmartBuoyDuos. This system is paired with an iOS application developed in the Swift 3 language using Apple's CoreBluetooth framework. Prototyped on a breadboard, then finalized on a soldered protoboard, the remote control gateway also integrates an OLED display and a LiPo battery with charge monitoring.

Gaussian Mixture Models for Dynamic Malware Clustering Alexander M. Interrante-Grant, Student, David Kaeli, Professor As the number of unique malware samples grows at a rapidly increasing rate, analysts are having trouble tracking the evolution of existing malware and identifying new malware in an everchanging threat landscape. It can take hours for a malware analyst to evaluate a single sample, so they are increasingly turning to methods of fast, automated malware analysis to identify trends across and attributes of newly observed malware. One goal is to identify the author of a new piece of malware. Various approaches have been proposed, applying machine learning to various concise program representations. Because ground truth labels for malware samples are notoriously

difficult to find, most machine learning approaches rely on unsupervised learning (i.e. clustering) methods. In this paper, a number of recently proposed clustering approaches using co-occurrence matrices of system calls are evaluated. In addition to applying previously-proposed clustering algorithms to this program representation, this work applies Gaussian mixture model (GMM) clustering â&#x20AC;&#x201C; an approach that had not been evaluated by the research community for dynamic malware clustering. Our results show that GMM clustering outperforms other clustering approaches, achieving a Fowlkes-Mallows score two times better than the state-of-the-art on a dataset of realworld malware curated from the VirusShare malware corpus.

A Fast Parallel Level Set Segmentation Algorithm for 3-D Images Benjamin Trapani, Student, Julian Gutierrez, Student, David Kaeli, Professor Image segmentation is one of the key analysis tools in biomedical imaging applications. Although level set segmentation algorithms have been explored thoroughly in the past, these approaches are non-scalable due to their inherent data dependencies. Algorithms with large corresponding data dependency graphs that contain many small cycles are difficult to parallelize, prohibiting these algorithms from effectively leveraging modern highly parallel compute devices. Given that the resolution of medical imaging hardware has continued to increase each year, and CPU performance has not kept pace, there is a need to explore parallel solutions for processing medical images. Prior work described an efficient level set segmentation algorithm designed for parallel architectures for segmenting 2D images. The algorithm segments an input image into four components based on an initial curve. The prior 2-D level set segmentation algorithm is extended, providing a solution for 3-D images. The algorithm is improved by examining adjacent voxels at each step, versus visiting adjacent pixels instead. The initial curve is a user-provided sphere that is defined parametrically to reduce copy overhead to the compute device. The efficiency of the 2D algorithm is preserved in the conversion, enabling the resulting algorithm to perform on the order of ten times faster than existing GPU-accelerated 3-D level set segmentation implementations. The implementation presented in this work supports real-time segmentation of 7T MRI images, leveraging the computational power of a NVIDIA Tesla K20 GPU to reduce execution time. This image segmentation algorithm supports identification of tumors, tissue volume measurements, and surgery planning at the rate required by radiologists today.

Flux Calibration of Ti-Ball Ti Source for Precision Deposition of Barium Titanate Dalton Cox, Student, Katherine Ziemer, Professor Department of Chemical Engineering Northeastern University, Boston, MA Abstract â&#x20AC;&#x201D; A Ti-Ball titanium sublimation source was calibrated to atomic flux for use in molecular beam epitaxy (MBE) growth of thin films. Precise calculations of atomic flux are necessary for growth of crystalline barium titanate (BTO) films as well as measurements of the sticking coefficients (Ď&#x192;) to better understand the mechanism of crystal growth. Two sources used for flux calculations, film growth rate and manufacturer supplied total sublimation rate, disagreed by an order of magnitude, requiring additional inquiry into the source of error; however, both calculations agreed that Ď&#x192; Ti >> Ď&#x192;Ba in growth of BTO.

INTRODUCTION

Barium Titanate (BTO) is a multifunctional oxide material with ferroelectric and thermoelectric properties, among others [1]. These properties make it an important layer for developing multilayer heterostructure in next generation multifunctional sensors [2]. Crystalline thin film growth of BTO is important for holding the desired ferroelectric or thermoelectric properties; however, without tight control over the atomic fluxes of both Ba and Ti to assure a 1:1 atomic ratio within the lattice, thin films grown by molecular beam epitaxy (MBE) will form amorphously [3]. Ba is deposited through the use of an effusion cell, whose surface flux is easily calculated by Equation 1, seen below. đ?&#x203A;ˇ=

3.51 â&#x2C6;&#x2014; 1022 đ?&#x2018;&#x192;đ??´

II.

METHODS

Ge was cleaned by degreasing in heated trichloroethylene, acetone, and methanol followed by drying with an argon gas source. After this cleaning procedure, the Ge substrate was adhered to a molybdenum puck using silver paste, then the puck was inserted into the vacuum chamber. Once a pressure of 10 -9 Torr was achieved, X-Ray Photoelectron Spectroscopy (XPS) was performed using a magnesium anode to determine the initial atomic surface characteristics. A full sweep from 1100-0 eV was performed first, followed by tight scans around the characteristic binding energy peaks for Ge 3d5 (39-25 eV), C 1s (293-274 eV), and Ti 2p3 (470-449 eV).

Ti Growth

Ti growth was performed in a growth chamber with the substrate at room temperature (to assure Ď&#x192;Ti~1), an initial pressure of 10-9 Torr, and the substrate placed 7â&#x20AC;? (17.78 cm) away from the Ti-Ball. Ti was grown at five current values spanning the nominal operating range of 35-45 A. Higher current values were operated for shorter time lengths to reduce any risk of the Ge signal disappearing entirely. Table 1: Growth Time for Applied Currents Ti-Ball Applied Current (A)

Growth Time (min)

35.0 37.5 40.0 42.5 45.0

20 20 15 10 5

Eq. (1) B. Analysis Upon completion of the growth, XPS was again performed on Where ÎŚ is atomic flux (atoms/cm2-s), P is the vapor pressure the substrate to determine any change in surface atomic of Ba (Torr), A is the aperture area (cm2), L is the distance from composition. As the Ti film layer increases in thickness, the the source to the substrate (cm), M is the molecular weight signal for the Ge base will become smaller. This decrease in (g/mol), and T is the temperature (Ë&#x161;C) [2]. The Ti is deposited signal, or attenuation, was calculated using Equation 2, shown by sublimation from a hemispherical mini Ti-Ball source whose below. sublimation rate is only controllable by applied current. While đ??´đ?&#x2018;?đ?&#x2018;&#x153;đ?&#x2018; đ?&#x2018;Ą đ?&#x2018;&#x201D;đ?&#x2018;&#x;đ?&#x2018;&#x153;đ?&#x2018;¤đ?&#x2018;Ąâ&#x201E;&#x17D; the manufacturer of the Ti-Ball (Varian Vacuum Technologies) đ??´đ?&#x2018;Ąđ?&#x2018;Ąđ?&#x2018;&#x2019;đ?&#x2018;&#x203A;đ?&#x2018;˘đ?&#x2018;&#x17D;đ?&#x2018;Ąđ?&#x2018;&#x2013;đ?&#x2018;&#x153;đ?&#x2018;&#x203A;(%) = â&#x2C6;&#x2014; 100% Eq. (2) provides an approximate sublimation rate for a current level, an đ??´đ?&#x2018;?đ?&#x2018;&#x;đ?&#x2018;&#x2019; đ?&#x2018;&#x201D;đ?&#x2018;&#x;đ?&#x2018;&#x153;đ?&#x2018;¤đ?&#x2018;Ąâ&#x201E;&#x17D; atomistic flux function must be determined to fully control growth [4]. In addition to greater control over the crystallinity, Where A is the area under the curve of the XPS scan. All areas an exact value for atomic flux will allow for calculations of were measured using the Mulitpak software. Using the NIST sticking coefficients (Ď&#x192;) of Ti during BTO growth, which will EAL13 software, the attenuation of Ge through a layer of Ti help determine the mechanistic properties of growth [5]. These was directly related to film thickness through the kinetic energy experiments aim to calculate the atomic flux by growth of a Ti of the Ge photon, the asymmetry factor of Ge, and the thin films on a cleaned germanium (Ge) sample. absorption characteristics of Ti [6]. Dividing the film thickness đ?&#x153;&#x2039;đ??ż2 â&#x2C6;&#x161;đ?&#x2018;&#x20AC;đ?&#x2018;&#x2021;

2 by the growth time then gave the growth rate of each current level. To calculate the flux at the surface, several assumptions were made: First, due to the lower temperature Ď&#x192;Ti=1; second, flux was uniform across the area of the sample; third, all Ti growth was crystalline. To calculate flux, growth rate must be multiplied by the atomic density of the film; because crystalline Ti is hexagonal close packed in structure with 6 atoms per unit cell and a unit volume of 1.06*10-22 cm3, crystalline Ti has an overall atomic density of 5.65*1022 atoms/cm3 [7]. Multiplying each of the growth rates by this atomic density allowed for a curve-fitting function to be developed to map the atomic fluxes within the operating range of the Ti-ball.

III.

đ?&#x203A;ˇ=

đ?&#x2018;&#x2026;đ?&#x2018; đ?&#x2018;˘đ?&#x2018;? â&#x2C6;&#x2014; đ?&#x2018; đ??´ â&#x2C6;&#x2014; đ?&#x2018;&#x20AC;đ?&#x2018;&#x160; â&#x2C6;&#x2014; 3600

Eq. (5)

2đ?&#x153;&#x2039;đ?&#x2018;&#x;đ?&#x2018;&#x153;2

Where Rsub is the sublimation rate (in g/hr), NA is Avogadroâ&#x20AC;&#x2122;s number, and MW is the molecular weight of Ti. The expected flux was graphed against the calculated flux on a logarithmic scale.

RESULTS

The atomic flux fit is shown in Figure 1:

Fig. 2: Calculated Ti-Ball flux severely below the estimated expected values from manual

The magnitude in difference between the calculated flux and the estimation for the expected values shows that at least one assumption during either flux calculation was not valid.

Fig. 1: A power fit of the flux best predicts the data

It shows that ÎŚTi=5.67*10-11(I)13.889 within the Ti-Ball operating range, where I is the applied current in A.

Generalization of Flux

Evaluation of Ď&#x192; Values During BTO Growth

Because of the order of magnitude difference between the calculated and expected results, Ď&#x192; evaluation was performed using the calculated results as a lower bound and the expected results as an upper bound. While the exact values of Ď&#x192; cannot be determined without precise knowledge of the number atoms of each element in BTO, a relationship between Ď&#x192;Ti and Ď&#x192;Ba can be determined using the molar fraction of Ba to Ti in the film and the fluxes of each to find a ratio.

Assuming a steady state system, no secondary source of Ti generation, and a purely hemispherical surface with no significant defects allows for one to equate the surface flux between two surfaces, as represented by Equation 3, shown below.

đ?&#x153;&#x17D;đ?&#x2018;&#x2021;đ?&#x2018;&#x2013; đ?&#x203A;ˇđ?&#x2018;&#x2021;đ?&#x2018;&#x2013; = đ?&#x2018;&#x2026;đ?&#x2018;&#x161; đ?&#x153;&#x17D;đ??ľđ?&#x2018;&#x17D; đ?&#x203A;ˇđ??ľđ?&#x2018;&#x17D;

Eq. (6)

Where Rm is the molar ratio present in the BTO film. Using data collected by Northeastern Doctoral Student Sue Celestin during BTO growth, a substantial range of sticking coefficient Eq. (3) ratios were observed utilizing the same substrate temperature, đ?&#x153;ąđ?&#x2018;¨ đ?&#x2019;&#x201C;đ?&#x;?đ?&#x2018;¨ = đ?&#x153;ąđ?&#x2018;Š đ?&#x2019;&#x201C;đ?&#x;?đ?&#x2018;Š as summarized in Table 2. Combining Equation 3 with the value of ÎŚTi gives Equation 4. đ?&#x203A;ˇ(đ??ź, đ?&#x2018;&#x;) =

đ?&#x203A;ˇđ?&#x2018;&#x153; (đ??ź)đ?&#x2018;&#x;đ?&#x2018;&#x153;2 đ?&#x2018;&#x;đ?&#x2018;&#x153;2 = 5.67 â&#x2C6;&#x2014; 10â&#x2C6;&#x2019;11 đ??ź13.89 ( 2 ) 2 đ?&#x2018;&#x; đ?&#x2018;&#x; đ??ź 13.89 = 1.792 â&#x2C6;&#x2014; 10â&#x2C6;&#x2019;8 ( 2 )

Eq. (4)

đ?&#x2018;&#x;

Where r is the distance from the center of the Ti-Ball to the substrate (cm), ro is the distance used in experimentation (17.78 cm), and ÎŚ is the Ti flux (atoms/cm2-s).

Comparison to Provided Values

The sublimation data provided by Vernier was translated into a flux by assuming a perfect hemisphere. At r=ro the Ti would be evenly distributed along a hemisphere of radius ro, thus

Table 2: Values to Calculate Sticking Coefficient Ratios [8]

ÎŚBa

ÎŚTi,Calc 2

ÎŚTi,Expect

Ď&#x192;Ti/Ď&#x192;Ba

atom/cm -s

atom/cm2-s

Ti/Ba

Max

Min

1.30E+15

4.46E+12

3.96E+13

0.51

147

9.40E+15

4.46E+12

3.96E+13

1.20

2539

286

8.00E+14

4.46E+12

3.96E+13

1.67

299

1.30E+14

4.46E+12

3.96E+13

3.33

These show that even when considering ÎŚTi at the maximum flux value, Ti is orders of magnitude more likely to stick than Ba.

IV.

CONCLUSIONS

Several assumptions must be audited in order to fully understand the root cause in the large difference between the expected and calculated flux levels. The first, and most likely cause in the disparity, is that the sublimation rate is not constant over both the area of the substrate and over the time of the growth. Time difference in the flux, for instance during a warmup period, would greatly reduce the average flux experienced by the higher current growth periods due to the smaller time frame. This error could be reduced by increasing the time of all growths to a standard time period. Another assumption in need of reevaluation is that Ti grows in its traditional crystal structure during the growth periods. The unit cell for Ti is 0.468 nm tall; however, some growths produced only 0.2-0.3 nm of film thickness. Non-crystalline growth would cause all flux calculations to be inaccurate. Using RHEED to monitor the crystallinity and growing for longer periods of time would give more credit to this assumption. The third assumption to further investigate is the uniform and perfectly hemispherical sublimation by the Ti-Ball. While this assumption was made knowing there would be some variance between the calculated and actual results, any serious deformations or sublimation points not located on the hemispherical tip of the Ti-Ball could heavily skew the conversion from mass sublimation rate to atomic flux. While the exact flux value will require further investigation, the sticking coefficient ratios confidently show that Ba flux must be much greater than Ti flux in order to achieve a 1:1 atomic ratio within crystalline BTO.

ACKNOWLEDGMENT This research was done through Northeastern Universityâ&#x20AC;&#x2122;s Interface Engineering Laboratory in the Chemical Engineering Department. Special thanks to Sue Celestin, Negar Golshan, and Prof. Katherine Ziemer.

BIOGRAPHY Dalton Cox is a fifth year Chemical Engineering and Physics student with particular interest in Materials Science and Energy Technology. After graduation, he will be pursuing his doctorate in Materials Science. Email: cox.da@husky.neu.edu

REFERENCES [1]J. Remeika and W. Jackson, "A Method for Growing Barium Titanate Single Crystals", Journal of the American Chemical Society, vol. 76, no. 3, pp. 940-941, 1954. [2]T. Goodrich, Atomistic Investigation into the Interface Engineering and Heteroepitaxy of Functional Oxides on Hexagonal Silicon Carbide through the Use of a Magnesium Oxide Template Layer for the Development

of a Multifunctional Heterostructure, 1st ed. Boston: Northeastern University, 2008. [3]T. Goodrich, "BTO on MgO/SiC by MBE", 2008. [4]Mini Ti-Ballâ&#x201E;˘ Titanium Sublimation Source, 1st ed. Italy: Agilent Technologies, 2011. [5]A. McNaught and A. Wilkinson, IUPAC compendium of chemical terminology, 2nd ed. [Cambridge, England]: Royal Society of Chemistry, 2000. [6]C. Powell and A. Jablonski, NIST Electron EffectiveAttenuation-Length Database, 1st ed. Gaithersberg, MD: National Institute of Standards and Technology, 2011. [7]R. Ahuja, J. Wills, B. Johansson and O. Eriksson, "Crystal structures of Ti, Zr, and Hf under compression: Theory", Physical Review B, vol. 48, no. 22, pp. 1626916279, 1993. [8]S. Celestin, "Experimental Report", Northeastern University, 2016.

PEDOT:PSS-DVS Crosslinking Reaction Monitored via ATR-FTIR for Air Cathode Application in Microbial Fuel Cells Maria Jennings, Bachelor of Science, Ian Kendrick, Post-Doctoral Research Associate, Clive Green, Master of Science Candidate, Steve Lustig, Associate Professor Department of Chemical Engineering Northeastern University, Boston, MA Abstract â&#x20AC;&#x201D; Microbial fuel cells (MFCs) intended for at-risk communities lacking sources of clean water and electricity could be more economically produced by the implementation of biomolecular air cathode technology: an encapsulated enzyme within an electrically conductive nonwoven spunbound polymer such as PEDOT:PSS. While PEDOT:PSS may be spunbound and boasts high conductivity, it is prone to delamination and redispersion due to its water solubility. We report the analysis of crosslinking reactions with divinyl sulfone (DVS) in order to improve the durability and insolubility of conductive PEDOT:PSS fibers. Analytical characterization of crosslinking PEDOT:PSS with DVS using time-lapsed ATR-FTIR spectroscopy in concentrated solutions permits a clear characterization of PEDOT:PSS-DVS crosslinking kinetics and structure.

immediate environment, in turn generating non-toxic waste waters [4]. The microbial fuel cell (MFC) can potentially make use of bioremediation to clean contaminated water while generating electricity, thus, presenting a self-powered water purification system [5]. An MFC operates similarly to a battery. Microbial digestion of waste water in an anode compartment produces carbon dioxide, protons and electrons. The electrons then travel by direct contact through bacterial pili - a hair-like structure located on the surface of many bacteria which function as the fuel cell anode. From the anode, the electrons travel to an electron load onto the cathode. The cathode captures these electrons and protons from the anode as it reduces oxygen from the air to form water [6]. A general diagram of a typical amicrobial fuel cell is provided in Fig. 1.

I. INTRODUCTION In 2015, the Joint Monitoring Programme sponsored by the World Health Organization/UNICEF announced that one in ten persons is without access to clean water accounting for 663 million people (319 million of which were located in SubSaharan Africa) of an estimated 7.3 billion at the time of study [1]. Furthermore, UNICEF has reported that 6,000 children die daily of water-related diseases accounting for an overwhelming fraction of 2.2 million diarrhea-caused fatalities annually [2]. In conjunction with a lack of safe drinking water, impoverished areas also tend to lack access to electricity. The International Energy Agency asserted that in 2014, 634 million persons in Africa lacked access to electricity, of which 632 million resided in Sub-Saharan Africa. In total, 1.2 billion persons globally lacked electricity as of 2014 [3], a basic amenity that powers essential daily processes such as cooking. The lack of clean drinking water presents a global public health crisis with a distinct call to action: first to limit water expenditure, and second to advance water remediation technology to provide an inexpensive, user-friendly, selfpowered, and reliable method of providing clean water. One method of wastewater treatment, bioremediation, is the use of bacteria to digest otherwise harmful compounds in their

Fig. 1. An MFC comprises of anode and cathode compartments [7]

Many current cathode technologies utilize cost prohibitive precious metals as cathode catalysts [8-10]. Such technologies would be beyond the economic reach of families in underdeveloped communities, such as Sub-Saharan Africa [11]. Hence, we have interest in investigating the viability of renewable laccase enzyme cathode catalysts due to recent costdiminishing manufacturing scale economy. Laccase catalyzes the reduction of a dioxygen molecule using four electrons and four protons to create two water molecules [12-15]. đ?&#x2018;&#x201A;2 + 4 đ??ť + + 4 đ?&#x2018;&#x2019; â&#x2C6;&#x2019; â&#x2020;&#x2019; 2 đ??ť2 đ?&#x2018;&#x201A;

(1)

The current state of the art MFC technology demonstrates laccase catalysis in air cathodes, yet the technology is neither efficient nor economically viable [12-14]. An ideal air cathode features a redox active enzyme, e.g. laccase, encapsulated in an electrically conductive polymer processed into a nonwoven fiber to permit efficient reduction of atmospheric oxygen. However, such a system has not yet been demonstrated. A reliably potential polymer is Poly(3,4ethylenedioxythiophene)-poly(styrenesulfonate) (PEDOT:PSS). Since PEDOT:PSS films are susceptible to delamination and dispersion, interest for various applications exists in crosslinking the polymer that not only results in an insoluble film, but further a film with greater conductivity than the non-crosslinked control [16-18]. Dr. Mantione and coworkers, of the University of Basque Country, claim room temperature crosslinking of PEDOT:PSS films using divinyl sulfone (DVS) [16]. Reacted films provide visibly greater resilience to re-dispersion than unreacted films, but do not identify the reaction mechanism. Changes in 1H NMR spectra are consistent with more than one reaction chemistries, possibly involving PSS and/or diethylene glycol. The latter would not result in chemical crosslinks with PEDOT:PSS. Additional characterizations using UV-Vis-NIR spectroscopy, Raman spectroscopy, and FTIR spectroscopy show little difference from non-crosslinked controls [16]. This work re-examines the reaction between PEDOT:PSS and DVS using time-resolved, 2-D correlation ATR-FTIR spectroscopy. Using both experimental evidence and quantum chemical calculations, we find that PSS and DVS can participate in a crosslinking reaction.

II. MATERIALS AND METHODS A. Materials Poly(3,4-ethylene dioxythiophene)-poly(styrene sulfonate) (PEDOT: PSS) and divinyl sulfone (DVS), see Fig. 2, were purchased from Sigma Aldrich. The 1.3 wt.% PEDOT:PSS dispersion in water contains 0.5 wt. % PEDOT content and 0.8 wt. % PSS content. DVS is specified at 97% purity containing less than 650 ppm hydroquinone as an inhibitor. Reactions between PEDOT:PSS and DVS are initiated by adding 2.0% v/v DVS and 1.3 wt.% PEDOT:PSS.

Fig. 2. Chemical structures of PEDOT:PSS (left) and DVS (right).

B. Attenuated Total Reflection-Fourier Transform Infrared Resonance Spectroscopy (ATR-FTIR) ATR-FTIR spectrum measurements were collected using a Bruker Vertex 70 FT-IR spectrometer equipped with a MIRacle ATR stage. Spectra were analyzed using OPUS 6.5 software. 2D correlation analysis was performed using a custom JAVA application written by and available from Prof. Lustig. Separate ATR-FTIR spectra of 2.0 wt.% DVS and 1.3 wt. % PEDOT:PSS aqueous compositions were collected relative to pure water reference spectra from 550 cm-1 to 4000 cm-1. For time-lapsed PEDOT:PSS-DVS reaction studies, a reference spectra was collected while the ATR crystal was covered with water. These aqueous dispersions were sealed over the ATR crystal to prevent water evaporation. Each spectrum was collected by averaging 24 interferograms over ca. 20 seconds. Spectra were collected every ten minutes over 16.5 hours. C. Quantum Thermochemical Calculations All calculations were performed using the density functional theory electronic structure program DMol3 [19, 20] with graphical displays generated with Materials Studio [21]. The Perdew-Burke-Ernzerhof (PBE) non-local correlation [22] was used for the exchange and correlation potentials with restricted spin polarization and fine DNP (loosely defined as doublenumeric + polarization) numerical basis sets [19] with 4A cutoff. Atomic cores are described with the all-electron treatment. self-consistent field iterations were considered converged with 10-6 Ha tolerance and no thermal smearing was used in the orbital occupancy. Molecular geometries were refined via energy optimizations in which convergence tolerance criteria included energy changes less than 10-5 Ha, maximum forces less than 0.002 Ha/Å and maximum displacements less than 0.005Å. Thermodynamic properties such as entropy, enthalpy, and Gibbs energy are computed at finite temperatures after fine geometric optimization and vibrational analysis or Hessian evaluation. Vibrational, rotational, and translational contributions to the molecular partition function are computed according to the standard statistical mechanics in the ideal gas approximation [23]. The geometry of molecular structures is optimized by minimizing the total energy.

III. RESULTS AND DISCUSSION Evidence for reaction between PEDOT:PSS and DVS is shown through a series of time-lapsed ATR-FTIR spectra. Variances in the spectra caused by both disappearance of DVS and changes to polystyrene sulfonate can be easily elucidated using the 2D correlation method [24, 25]. Both the synchronous, , and asynchronous, , correlation intensities of the dynamic spectra are shown in Fig. 3. The synchronous correlation intensity indicates the degree of coherence between two signals (wavelengths and their associated intensities) that are measured simultaneously. Peak positions at 1249, 1301, and 1388 cm-1 associated with aqueous DVS are synchronously correlated (share positive ) as they decrease concomitantly through the time series. Meanwhile the appearance of a peak at 1263 cm-1 is oppositely correlated with the disappearance of DVS (shares negative  cross correlations). The asynchronous correlation intensity between two peaks represents independent or mutually

6 out of phase time dependent intensity of the dipole-transition moments. Since cross peaks do occur, the 1263 cm-1 and DVS absorptions are decoupled.

in Fig. 4 as the DVS vinyl group is reactive with the PSS sulfonic acid. Quantum chemical calculations of this structure’s vibrational states and molecular partition function [19-23] provide consistent evidence for this hypothesis.

Fig. 4. Vinyl sulfone ethyl sulfonate fragment indicating the expected product from the reaction between DVS and PSS.

The structure is predicted to have significant vibrational absorptions at 1205 cm-1 (CH2 wagging at C4), 1267 cm-1 (CH2 twisting at C5), 1304 cm-1 (in phase CH2 twisting at C4 and C5), 1320 cm-1 (in phase CH2 twisting at C5 and OSO scissoring at S7), and 1384 cm-1 (CH2 scissoring at C1). Each of these absorbances are detected in Fig. 3. Most significantly, the match between 1263 cm-1 (observed) and 1267 cm-1 (predicted) is within accepted accuracy for the theoretical method. Statistical thermodynamic predictions are consistent with this reaction product. At room temperature the ideal phase reaction Gibbs energy is -108 kcal/mol from which the reaction entropy is -112 cal/mol-K and the reaction enthalpy is -142 kcal/mol. Thus the reaction is energetically favored at room temperature despite a small loss in entropy due to combining two molecules.

IV. CONCLUSION

Fig. 3. Synchronous (top) and asynchronous (bottom) intensities as functions of wavenumber, cm-1, (abscissa and ordinate). Positive intensity is denoted by red color and negative intensity is denoted by blue color. Peak positions common to aqueous DVS control spectra are marked by black arrows. Axis spectra contain the first spectrum (black) and the last spectrum (blue) in the time series.

Since the cross peaks in  and  for the 1263 cm-1 and the DVS absorptions are opposite in sign, the dipole intensity increase at 1263 cm-1 occurs faster than the DVS dipole intensity decrease. The shape symmetry of the  and  peaks indicates these are independent dipole transition moments, not due to environment-based wavenumber shifting or broadening validating the claim that reaction of interest is, in fact, occurring. The appearance of the new 1263 cm-1 absorbance indicates it is signature of the observed reaction product. We expect the reaction to form a vinyl sulfone ethyl sulfate structure, shown

The study undertaken provides valuable insight into the reaction between PEDOT:PSS and DVS that can result in physical crosslinking. Changes in the infrared spectra occur exclusively in regions that correspond to either alkene or sulfonic groups. Since PEDOT:PSS has no reactive alkene groups, and DVS has no sulfonic acid groups, it is reasonable to conclude that the changes in the spectra are a result of an interaction between the two molecules. The proposed product structure enables a crosslinked network between PSS macromolecules. The data collected provides material for future study in crosslinking kinetics as well as opportunity to generate a more durable, and insoluble, PEDOT:PSS fiber for incorporation into air cathodes as per the purpose of the literature.

ACKNOWLEDGEMENT Financial support for this project was provided by Northeastern University’s College of Engineering as part of the start-up funds for Prof. Steve Lustig.

AUTHOR BIOGRAPHIES Maria Jennings is an undergraduate chemical engineering student within Northeastern University’s Honors College scheduled to graduate in May 2018. She attends on full scholarship, has received Dean’s List recognition each semester, and is a member of nationally-recognized engineering honor societies Tau Beta Pi and Omega Chi Epsilon. She serves as a staff tutor for the First Year Engineering Tutoring Center and has completed a variety of independent research projects through engineering internships at PSEG, Ingredion Inc., and Pfizer Inc. During her spare time, she conducts independent research under the supervision of Professor Steven Lustig. Email: jennings.ma@husky.neu.edu Ian Kendrick is a postdoctoral researcher at the Northeastern University Center for Renewable Energy Technology. He holds a PhD in Chemistry and Chemical Biology from Northeastern University where he developed novel electrochemical reactors that allowed for the acquisition of infrared and Raman spectra under normal operating conditions. He is currently developing electrochemical catalysts for the oxidation and reduction of hydrogen as well as new catalysts for the reduction of carbon dioxide to commodity chemicals. Email: i.kendrick@northeastern.edu Clive Green is a master’s candidate in chemical engineering at Northeastern University scheduled to graduate in 2018. He received his bachelor of art in biochemistry and molecular biology from Clark University in 2014. He is a member of the National Society of Black Engineers (NSBE) and exhibits research interests in biotechnology and sustainable engineering methods. He originates from Portmore, Jamaica, and hopes to eventually implement novel sustainable technologies in his home country. Email: green.cl@husky.neu.edu Steve Lustig is an Associate Professor in the Department of Chemical Engineering at Northeastern University in Boston, MA. He holds a Ph.D. in Chemical Engineering from Purdue University in West Lafayette, IN. He was a principal investigator in the Central Research & Development division of DuPont and an adjunct professor at the University of Delaware in the Department of Chemical & Biomolecular Engineering and in the Department of Materials Science and Engineering. His research interests include molecular design by inverse statistical thermodynamics, anti-ballistic materials, and biomolecular cathodes. Email: s.lustig@northeastern.edu

V. REFERENCES [1]

[2] [3]

World Health Organization. Progress on sanitation and drinking water: 2015 update and MDG assessment. World Health Organization, 2015. Unicef. "Child survival fact sheet: Water and sanitation." UNICEF Fact Sheet (2004). “Energy Access Database.” WEO Energy Access Database, International Energy Agency, 2017.

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

Rabaey, Korneel, and Willy Verstraete. "Microbial fuel cells: novel biotechnology for energy generation." TRENDS in Biotechnology 23.6 (2005): 291-298. Allen, Robin M., and H. Peter Bennetto. "Microbial fuelcells." Applied biochemistry and biotechnology 39.1 (1993): 27-40. Du, Zhuwei, Haoran Li, and Tingyue Gu. "A state of the art review on microbial fuel cells: a promising technology for wastewater treatment and bioenergy." Biotechnology advances 25.5 (2007): 464-482. Logan, Bruce E., et al. "Microbial fuel cells: methodology and technology." Environmental science & technology 40.17 (2006): 5181-5192. Li, W., et al. "Dynamic behaviour of interphases and its implication on high-energy-density cathode materials in lithium-ion batteries." Nature Communications 8 (2017): 14589. Ludwig, Jennifer, et al. "Morphology-controlled microwave-assisted solvothermal synthesis of highperformance LiCoPO 4 as a high-voltage cathode material for Li-ion batteries." Journal of Power Sources 342 (2017): 214-223. Garsuch, Arnd, et al. "Active cathode material and its use in rechargeable electrochemical cells." U.S. Patent No. 9,698,421. 4 Jul. 2017. Lakner, Christoph, and Branko Milanovic. "Global income distribution from the fall of the Berlin Wall to the Great Recession." Revista de Economía Institucional 17.32 (2015): 71-128. Okuzaki, Hidenori, Yuko Harashina, and Hu Yan. "Highly conductive PEDOT/PSS microfibers fabricated by wet-spinning and dip-treatment in ethylene glycol." European Polymer Journal 45.1 (2009): 256-261. Barrière, Frédéric, et al. "Targetting redox polymers as mediators for laccase oxygen reduction in a membraneless biofuel cell." Electrochemistry Communications 6.3 (2004): 237-241. Tarasevich, M. R., et al. "293-Electrocatalysis of a cathodic oxygen reduction by laccase." Bioelectrochemistry and Bioenergetics 6.3 (1979): 393403. Schaetzle, Olivier, Frédéric Barrière, and Uwe Schröder. "An improved microbial fuel cell with laccase as the oxygen reduction catalyst." Energy & Environmental Science 2.1 (2009): 96-99. Mantione, Daniele, et al. "Low temperature cross-linking of PEDOT: PSS films using divinylsulfone." ACS Applied Materials & Interfaces 9 (21) (2017): 1825418262. Ghosh, Soumyadeb, Johan Rasmusson, and Olle Inganäs. "Supramolecular Self‐Assembly for Enhanced Conductivity in Conjugated Polymer Blends: Ionic Crosslinking in Blends of Poly (3, 4‐ ethylenedioxythiophene)‐Poly (styrenesulfonate) and Poly (vinylpyrrolidone)." Advanced Materials 10.14 (1998): 1097-1099. Rahimnejad, Mostafa, et al. "Microbial fuel cell as new technology for bioelectricity generation: a review." Alexandria Engineering Journal 54.3 (2015): 745-756.

8 [19] Delley, Bernard. "An all‐electron numerical method for solving the local density functional for polyatomic molecules." The Journal of chemical physics 92.1 (1990): 508-517. [20] Delley, Bernard. "From molecules to solids with the DMol 3 approach." The Journal of chemical physics 113.18 (2000): 7756-7764. [21] “Biovia Materials Studio 2016.” Dassault Systemes Biovia, 16.1.0.21, Dassault Systemes, 2016. [22] Perdew, Perdew, John P., Kieron Burke, and Matthias Ernzerhof. "Generalized gradient approximation made simple." Physical review letters 77.18 (1996): 3865. [23] Sandler, Stanley I. An introduction to applied statistical thermodynamics. John Wiley & Sons, 2010. [24] Noda, Isao. "Two-dimensional infrared (2D IR) spectroscopy: theory and applications." Applied Spectroscopy 44.4 (1990): 550-561. [25] Park, Yeonju, Isao Noda, and Young Mee Jung. "Novel developments and applications of two-dimensional correlation spectroscopy." Journal of Molecular Structure 1124 (2016): 11-28.

Design and Programming of a Remote iOS Controller and Gateway for Underwater Acoustic Networks Andrew Fish, Student, Yashar A. Aval, Researcher, Stefano Basagni, Professor Department of Electrical and Computer Engineering Northeastern University, Boston, MA Abstract â&#x20AC;&#x201D; The purpose of this research is to explore mobilizing the control of SmartBuoyDuo devices that provide access to the nodes of an underwater acoustic network. The research entails the creation of a smartphone-based ultraportable system to control basic functionalities of SmartBuoyDuos including their relays, sensor readings, and sleep cycles. The Teensy platform pair with a Bluetooth LE module and an XBee S3B are used to create a remote control gateway device capable of sending commands to, and receiving responses from, SmartBuoyDuos. This system is paired with an iOS application developed in the Swift 3 language using Apple's CoreBluetooth framework. Prototyped on a breadboard, then finalized on a soldered protoboard, the remote control gateway also integrates an OLED display and a LiPo battery with charge monitoring.

monitored. This research has culminated in a system called the SmartBuoyDuo, a buoy that floats on the surface of the water while the acoustic modem is tethered beneath (see Fig. 1, below). The SmartBuoyDuo contains a BeagleBone Black microprocessor, an array of batteries, an XBee 802.15 radio, a long-range 802.11 WiFi access point, and the ability to connect various sensors. This system is currently in its testing phase at Northeastern University's Marine Science Center in Nahant, MA.

I. INTRODUCTION Underwater acoustic networking has received relatively little attention in the research field, despite its many applications. This could primarily be attributed to a strong market demand for more powerful and robust terrestrial networks that can be easily deployed in any environment or location [1]. Despite its less significant reputation in the research community, the applications for underwater networking are extensive, ranging from environmental monitoring to defense communication, energy production, and replacement of aging underwater cable infrastructure [2]. The importance of a totally connected planet has become more apparent in recent years, with headlines like Googleâ&#x20AC;&#x2122;s deployment of a $300 million underwater connection to Japan [3]. It has become increasingly apparent that in order to fully connect continents, researchers need to shift their focus from above sea level to beneath it, recognizing the energy efficiency, capacity for redundancy, and expandability of underwater acoustic networks [4]. The Northeastern University Marine Observatory Network (NU MONET) project received National Science Foundation funding in 2014 with the stated mission of developing a testbed for underwater acoustic networking to advance existing protocols and develop new ones. Built around the Teledyne Benthos SM 975 acoustic modem, the project's researchers have spent the last few years developing a control system that allows acoustic devices to be deployed and

Fig. 1. A complete deployment of the NU Marine Observatory Network consists of a network comprised of buoys, modems, and terrestrial communication links.

The crux of the SmartBuoyDuo is its XBee radio, which is responsible for switching relays that control the BeagleBone Black among other components. However, control of the XBee module requires a computer with a serial monitor installed. The problem the NU MONET team must address is the lack of a mobile control solution for the SmartBuoyDuos, through their terrestrial XBee radios. This control system requires a robust feature set, including control of onboard relays, access to sensors, and sleep mode management. In response, the NU MONET team has developed a two-part solution: a physical, mobile router that connects to XBee radios aboard SmartBuoyDuos, and a companion iOS app that provides the interface to send commands to the SmartBuoyDuos via the router. This system is designed to sit atop the existing architectural stack of the NU MONET (see

10 Fig. 2, seen on the next page). As a result this paper reports on the development and design of the iOS app and physical router. The SmartBuoyDuo technology and associated iOS control system aim to advance underwater acoustic networking technology by providing an expandable testbed for protocol development, something that has not been attempted in the past [5]. This novel technology will allow future researchers to design, test, and implement Media Access Control (MAC) protocols that are specifically targeted for underwater networks. Because platforms like this have not been developed in the past, the NU MONET project has taken into account accessibility and expandability in every facet of its development. The network can easily be expanded by deploying more SmartBuoyDuo devices and adjusting protocol accordingly, while the iOS control system makes NU MONET mobile and accessible to researchers who may be testing the network in an environment where a desktop PC or laptop is not easily accessible—think rain, wind, and waves. While the NU MONET project aims to provide a testbed for researchers, the software and protocol developments that it is designed to enable could affect many industries that rely on unconventional network structures, including the defense industry, environmental researchers, telecom companies, and the energy sector [2].

Fig. 3. The breadboard contains all main components of the remote control gateway system, as well as a remote XBee with LEDs connected to the four digital pins coinciding with relays in the remote SmartBuoyDuos.

A. Bluetooth Low Energy The remote control gateway design is based on a Bluetooth Low Energy (BLE) module for several reasons: low energy consumption, strong signal strength, sufficient data transmission rate, and readily available libraries for Linux, iOS, and Arduino [6]. These attributes are ideal for the controller’s embedded system because it relies on battery, assumes a certain degree of user mobility, and requires fast data transmission. The decision to use Adafruit's Bluefruit Low Energy (LE) Universal Asynchronous ReceiverTransmitter (UART) Friend module (see Fig. 4, below), was based on its well-maintained libraries, compact size, full customization via AT commands, and use of the universally recognized UART protocol for communication [6]. When combined, these attributes allow the module to be tailored to send specific types of packets, while parameters like signal strength and device identifier can be fine-tuned.

Fig. 2. The full stack of the NU MONET network relies on an iOS device and router to control remote smart buoys.

This paper discusses the design decisions and implementation of a novel iOS-based control system for the NU MONET. In Section II, the design and implementation of the remote control gateway hardware is discussed. Section III profiles the software design and implementation of the remote control gateway and iOS app, while Section IV recounts a field test of the controller’s Bluetooth technology. In Section V, future plans and improvements are discussed.

II. REMOTE CONTROL GATEWAY HARDWARE The remote control gateway is built around three main components: a Teensy microcontroller, a Bluetooth LE radio for connection to iOS devices, and an XBee radio for connection to SmartBuoyDuos. The first prototype was developed on a breadboard (see Fig. 3, above), then finalized on a protoboard.

Fig. 4. The Adafruit Bluefruit LE UART Module is a fully customizable device used to connect the routing system to a compatible iOS device [6].

B. Microcontroller For the microcontroller at the heart of the remote control gateway, the NU MONET team originally intended to use a BeagleBone Black board because of its use in other devices in NU MONET project and its support for full Debian Linux. However, the team ultimately decided to choose a more basic microcontroller, since only a fraction of the Beaglebone’s processing power would be used, its form factor was large, and its power consumption was high. In addition, the team found that the BLE module had limited compatibility with the BeagleBone Black, as the BLE module’s Linux libraries did not have the same feature set as the Arduino library. Taking advantage of the strong background in the Arduino platform of several members of the NU MONET research team, a Teensy 3.2 module (see Fig. 5, below) was chosen. The board offers four serial ports, two I2C ports, several digital I/O pins, as well as a high clock speed and low power consumption. These connectivity options enable the BLE module, XBee module, an Organic Light Emitting Diode (OLED) display, and the Lithium Polymer (LiPo) battery board to all be connected and utilized concurrently.

Fig. 5. The Teensy 3.2 microcontroller hosts four UART ports, two I2C ports, SPI support, as well as support for many more digital and analog pins, and is used as the routing system’s processor [7].

C. XBee Radio Consistent with the rest of the NU MONET project, an XBee S3B module is used to transmit and receive information to and from SmartBuoyDuos. This module is seated on a “Sparkfun Regulated XBee Explorer,” which provides access to all the XBee’s pins, including UART I/O, thus allowing the XBee to be connected to the Teensy (see Fig. 6, below). The XBee module is connected to a 900 mHz dipole antenna, which can be mounted directly on the radio or positioned on a case with an SMA connector extension. The Teensy is responsible for all control and configuration of the XBee module.

Fig. 6. An XBee S3B and Sparkfun Regulated XBee Explorer board are responsible for wireless communication with SmartBuoyDuos, and can be completely configured and controlled from the Teensy [8].

D. I/O and Other Electronics In order to fulfill the portability requirement of remote control gateway, the team integrated a rechargeable LiPo battery that can power the 3.3V Teensy system for a theoretical 10 hours at an approximate constant current of 100 mA [9]. In addition, the team implemented a LiPo management system (Sparkfun’s “Battery Babysitter”), which is responsible for regulating power, reporting the remaining charge of the battery over SPI, and charging the battery. This system ensures uninterrupted power, meaning that the board’s integrated circuit can switch dynamically between the battery and charger, preventing signal loss if a charger is connected while the BLE or XBee module is transmitting. It also provides monitoring of several battery characteristics, including battery voltage, current draw, and battery health, accessible through an I2C interface. A 1aH 3.3V LiPo cell is used for the battery (see Fig. 7).

Fig. 7. The “Sparkfun Battery Babysitter” board [10] and 1aH LiPo battery [9] can power the router system uninterrupted for 10 hours at a 100 mA current. The Battery Babysitter also regulates power, charges, reports battery status to the Teensy, and can provide uninterrupted power to the system if a USB cable is connected.

Finally, the team decided that some sort of feedback mechanism on the actual device would be necessary, ultimately settling on a micro OLED display, a product distributed with a breakout board from Sparkfun Electronics (see Fig. 8, below). Although it is small (0.66 inches diagonally), this display has a high pixel density, capable of showing several lines of text drawing minimal power. It is ideal for displaying battery capacity, the mode of the remote control gateway, and connection status of the Bluetooth radio.

Fig. 8. A Sparkfun Micro OLED display, used to show current mode, battery status, and signal strength [5].

A. Teensy Program Two latching pushbuttons were also integrated, one to switch between USB and Bluetooth mode, and the other to switch power on the Battery Babysitter board. In addition, two debug LEDs were added to the final circuit board for future testing and expansion.

E. Final Circuit Board The circuitry was finalized and laid out on a protoboard (see Fig. 9, below). The following assembly was then soldered. The final circuit board includes the Teensy, XBee Explorer, Bluetooth LE Module, Battery Babysitter, JST connectors to connect to the mounted pushbuttons for power and mode, and a row of header pins to connect a ribbon cable to the OLED display. The micro-USB connector on the Teensy is responsible for programming, while the identical port on the Battery Babysitter is responsible for charging. In a future iteration of the board, these two ports can easily be combined into one by re-routing the power and data lines respectively.

Fig. 9. The final circuit board of the NU MONET router includes all components soldered into place to ensure stability and reliability.

III. REMOTE CONTROL GATEWAY SOFTWARE The software to drive the remote control gateway system relies on two primary components:  An iOS application written in Apple’s Swift 3 language with the CoreBluetooth framework, which is responsible for sending commands to and receiving data from the Teensy, as well as displaying this information to the user.  An Arduino application running on the Teensy, which is responsible for interpreting commands from the iOS device, sending status messages back, and handling errors. This relies on the following libraries: XBeeArduino, SPI, Adafruit Bluefruit, Sparkfun OLED, Sparkfun BQ27441, and Vector.

The Teensy application is written in the Arduino language. This application uses the XBee-Arduino library and Adafruit's Bluefruit library for Bluetooth LE. The Teensy application begins with an initialization cycle of the XBee, BLE UART, OLED display, I/O pins, the Serial monitor (for debugging), and the LiPo board. The Teensy then waits until an iOS device is connected. Once a device is connected, the Teensy checks for incoming data from the Bluetooth module. If the Teensy detects incoming data, the program fills a buffer with the bytes until it reads a semicolon. The Teensy then converts this data into a String and checks it against a list of known commands. These commands include toggling DIO pins connected to relays on the SmartBuoyDuos, getting a list of all available SmartBuoyDuos, and polling remote SmartBuoyDuos for analog sensor data. The Teensy program uses several objects defined in the XBee-Arduino library: Local AT Command, Remote AT Command, Local AT Response, Remote AT Response, and ZigBee AIO Data Sample Response. If the Teensy receives a known command, it creates a local or remote AT command using the XBee-Arduino library and sends it to the locally connected XBee. Based on the command it receives, the local XBee can choose to act on the AT command itself or send the command to a remote XBee. The Teensy then waits for a response indicating success or failure. If the Teensy receives an error message from the XBee, it can send the specific error code to the iOS device. For some commands, such as control of DIO pins, the process of receiving a command response is relatively simple. However, for more complex operations such as finding neighboring XBees (this is the discovery process for finding remote SmartBuoyDuos), the operation is a bit more complex. In this case, Local AT Commands are employed, and the remote control gateway must wait a set time-out period to ensure that there are no other neighboring XBees left to be discovered. In addition, the data must be converted to a string, then parsed to find the friendly name and address of the remote XBees. Another example of a more complex operation is the polling of analog pins on the XBee in a SmartBuoyDuo. In order to get a reading from these pins, polling must be enabled, a reading must be taken, then polling must be disabled. This operation requires several responses in a row to ensure the operation succeeded. The remote control gateway is also responsible for creating brief, parseable messages that the iOS app can interpret. These messages make use of special character delimiters such as colons, semicolons, and underscores. These allow for separation of different pieces of data that the iOS app knows how to parse. When sending the current relay statuses (i.e. reading four digital pins) to the iOS device, for example, colons act as the delimiter between DIO pin number and the corresponding value, while underscores indicate the start of the next pin being read.

B. iOS Program An iOS application written in the Swift 3 language uses Apple's CoreBluetooth framework to get data from the remote control gateway and communicate it to the user. This framework uses the well-known, Bluetooth LE paradigm of services and characteristics. In the case of the Bluefruit Bluetooth LE module, UART acts as a service, and it has both a TX and an RX characteristic. Another example of a characteristic is the information service on the Bluetooth module. This service has a number of properties like module name and firmware revision, among many others. The iOS app works by waiting for the RX characteristic to be updated with new data, and can publish updates to the TX characteristic which will be received on the Teensyduino. The app is also responsible for buffering data, as the UART messages can come in randomly sized pieces. The app concatenates these chunks of data by looking for a semicolon end delimiter. Also, the app is responsible for handling errors on the XBee side of things. It subclasses Apple's Error class, which allows a dictionary of different error codes from the XBees and their corresponding meanings to be deeply integrated into the app. Finally, the app can dynamically update the interface for different Bluetooth and XBee states. The interface is clear and straightforward, displaying all commands front and center for the user (see Fig. 10, below). Custom commands can also be sent from a text field.

IV. FIELD TESTING One of the primary differences in the new iOS gateway communication system from the previous method of controlling the NU MONET network (via PC and XBee) is the introduction of Bluetooth LE. This second layer of connectivity on top of the existing XBee link adds another degree of complexity, communication delay, and potential for error. It also offers increased flexibility, untethering the operator from a computer. To both test the reliability of the Bluetooth connection and the maximum range between operator and gateway, the gateway was positioned at one end of a wide open, approximately 300 foot long field. At 2 foot increments, the signal strength between the iOS device and modem (in dB) was recorded (See Fig. 11, below) from the iOS device. It was unnecessary to record XBee signal strength, as this data already exists and is not a new development in the project. The experiment was conducted using an iPhone 6 Plus with Bluetooth 4.0 LE, running the NU MONET Controller app.

Fig. 11. A graph shows the Bluetooth signal strength against the distance walked from the gateway. The trend is logarithmic, as expected when looking at decibel values.

It was concluded from this experiment that at any distance over 50 feet, the Bluetooth signal would begin to disconnect and reconnect, and thus no accurate dB values could be gathered. It is safe to say that the maximum range for the gateway is about 50 feet. While the theoretical Bluetooth LE range is 100 meters, it is not surprising that the gateway system has a significantly lower range, considering the antenna is effectively a trace on the BLE Friend board [6]. Furthermore, the BLE Friend can be configured at a slightly higher transmit power via AT commands, but power consumption increases if this option is used. In the future, the design of the router could adopt a Bluetooth board that allows for an external antenna if signal strength is a priority.

V. CONCLUSIONS AND FUTURE PLANS Fig. 10. A screenshot of the NU MONET iOS app shows the ability to select different remote SmartBuoyDuos, send custom commands, control relays, poll analog pins, and receive responses. In this specific case, one of the analog pins is reading the response from a voltage divider circuit using the SmartBuoyDuoâ&#x20AC;&#x2122;s battery.

This paper presented the design of a system for the mobile control of an underwater acoustic network. This development entailed design of hardware, embedded controller firmware, and a mobile app. While developing this system, a number of issues arose with interfacing of multiple protocols, specifically

14 in the design of the Teensy firmware’s handling of XBee and Bluetooth protocols. If the project were started again it would be advisable to look at other options besides procedural C++. This experiment demonstrates that, while possible to design firmware managing two different kinds of radios without multi-threaded code, it is not necessarily ideal. While this iteration of the project is usable in the field, there are still a number of areas where the reliability, efficiency, and user experience of the iOS controller and modem can be improved. Plans for future development include the following:      

Enhance error handling from both iOS and Teensy software components Increase number of commands that can be sent, and present them in an intuitive way Streamline iOS user interface with emphasis on modularity and expandability Modify Swift code to handle XBee commands and responses as objects Improve user friendliness and optimize app for different sized devices Create custom fabricated PCB to reduce size and increase reliability

This research contributes to the greater NU MONET project by providing a new and innovative method for communicating with the underwater network. This technology is usable by other researchers to refine or design entirely new protocols, tailored to everything from defense, to high speed telecomgrade communication, to plate tectonics research. By combining the new iOS controller and modem with existing NU MONET technology, the project has become more accessible and easier for researchers to use.

ACKNOWLEDGMENT AND AUTHOR BIOGRAPHIES This project was supported in part by grant NSF CNS 1428567. Andrew Fish was supported by an NSF REU supplement and by a Northeastern University Office of Undergraduate Research and Creative Endeavor grant. Finally, Andrew Fish wishes to thank Dr. Yashar Aval and Dr. Stefano Basagni for their constant support, encouragement, and mentorship. Andrew Fish is an undergraduate studying Computer Engineering at Northeastern University with intent to graduate in 2021. He serves as workshop coordinator for Northeastern’s IEEE chapter, and is completing a cyber engineering co-op at Raytheon Cyber in Austin, TX. In his free time, Andrew enjoys beekeeping, cinema, and live music. He can be reached at fish.a@husky.neu.edu. Dr. Yashar Aval is a post-doctoral researcher for the NU MONET project at Northeastern University with a concentration in signal processing. In his free time, Yashar enjoys riding his motorcycle and outdoor adventures in New England. He can be reached at y.aval@northeastern.edu. Dr. Stefano Basagni is an associate professor in the Electrical and Computer Engineering Department at Northeastern University. Dr. Basagni is esteemed in the wireless

networking community, has been published numerous times, and has received awards from Northeastern University. In his free time, Dr. Basagni enjoys art and travel. He can be reached at basagni@ece.neu.edu.

REFERENCES [1] M. Agiwal, A. Roy and N. Saxena, “Next Generation 5G Wireless Networks: A Comprehensive Survey,” in IEEE Communications Surveys & Tutorials, vol. 18, no. 3, pp. 1617-1655, 2016. [2] E. Niiler, “The Ocean’s Robots May Soon Enjoy Highspeed Internet,” Wired, Nov. 3, 2016. [Online]. Available: https://www.wired.com/2016/11/oceans-robots-maysoon-enjoy-high-speed-internet/. [Accessed: Mar. 13, 2018]. [3] A. Chowdhry, “Google Invests in $300 Million Underwater Internet Cable System to Japan,” Forbes, Aug. 12, 2014. [Online]. Available: https://www.forbes.com/sites/amitchowdhry/2014/08/12/g oogle-invests-in-300-million-underwater-internet-cablesystem-to-japan/#2edd75f61617. [Accessed: Mar. 13, 2018]. [4] J. Partan, J. Kurose, and B. N. Levine, “A survey of practical issues in underwater networks,” SIGMOBILE Mob. Comput. Commun. Rev., vol. 11, pp. 23–33, Oct. 2007. [5] Andrew Tu, Brian Wilcox, Mark German, Yashar M. Aval, and Stefano Basagni. "Programming Acoustic Modems for Underwater Networking," Embark: Northeastern Undergraduate Engineering Review, 2016. [6] Adafruit, “Adafruit Bluefruit LE UART Friend Bluetooth Low Energy (BLE)”, 2016. [Online]. Available: https://www.adafruit.com/product/2479. [Accessed: Dec. 8, 2017]. [7] Pjrc, “Teensy USB Development Board”, 2017. [Online]. Available: https://www.pjrc.com/store/teensy32.html. [Accessed: Dec. 8, 2017]. [8] Digi, “Digi XBee-PRO 900HP”, 2016. [Online]. Available: https://www.digi.com/products/xbee-rfsolutions/sub-1-ghz-modules/xbee-pro-900hp. [Accessed: Dec. 8, 2017]. [9] Sparkfun, “Lithium Ion Battery – 1Ah”, 2015. [Online]. Available: https://www.sparkfun.com/products/13813. [Accessed: Mar. 21, 2018]. [10] Sparkfun, “SparkFun Battery Babysitter - LiPo Battery Manager”, 2015. [Online]. Available: https://www.sparkfun.com/products/13777. [Accessed: Dec. 8, 2017]. [11] Sparkfun, “SparkFun Micro OLED Breakout”, 2015. [Online]. Available: https://www.sparkfun.com/products/13003. [Accessed: Dec. 8, 2017]. [12] S. Bertuletti, A. Cereatti, U. Della, M. Caldara and M. Galizzi, “Indoor distance estimated from Bluetooth Low Energy signal strength: Comparison of regression models,” 2016 IEEE Sensors Applications Symposium (SAS), Catania, 2016, pp. 1-5.

Gaussian Mixture Models for Dynamic Malware Clustering Alexander M. Interrante-Grant, Student, David Kaeli, Professor Department of Electrical and Computer Engineering Northeastern University, Boston, MA Abstract â&#x20AC;&#x201D; As the number of unique malware samples grows at a rapidly increasing rate, analysts are having trouble tracking the evolution of existing malware and identifying new malware in an ever-changing threat landscape. It can take hours for a malware analyst to evaluate a single sample, so they are increasingly turning to methods of fast, automated malware analysis to identify trends across and attributes of newly observed malware. One goal is to identify the author of a new piece of malware. Various approaches have been proposed, applying machine learning to various concise program representations. Because ground truth labels for malware samples are notoriously difficult to find, most machine learning approaches rely on unsupervised learning (i.e. clustering) methods. In this paper, a number of recently proposed clustering approaches using co-occurrence matrices of system calls are evaluated. In addition to applying previously-proposed clustering algorithms to this program representation, this work applies Gaussian mixture model (GMM) clustering â&#x20AC;&#x201C; an approach that had not been evaluated by the research community for dynamic malware clustering. Our results show that GMM clustering outperforms other clustering approaches, achieving a Fowlkes-Mallows score two times better than the state-of-the-art on a dataset of real-world malware curated from the VirusShare malware corpus.

I. INTRODUCTION Large-scale malware analysis continues to be increasingly challenging, given that the number of unique malware signatures grows exponentially. As a result of this growth, manual malware analysis is becoming infeasible, requiring an unbounded amount of an analyst's time. Instead, researchers have turned to automated analysis methods to characterize patterns in large corpora of malware to reduce analysis time. These automated analysis methods have been shown in previous work to be useful in identifying similarities and trends among malware samples that cannot feasibly be identified at by manual analysis. These insights can help to more quickly identify and mitigate newly emerging threats and potentially lead to attribution of new malware to the particular actors involved in creating it. This project evaluates a number of recently published clustering approaches on a dynamic program representation derived from

co-occurrence matrices of system calls which is discussed in more detail in later sections. Clustering approaches include kmeans and agglomerative clustering using Jaccard distance, Euclidean distance, and hamming distance metrics as well as Gaussian Mixture Model (GMM) clustering. These approaches are evaluated on a corpus of malware from VirusShare, which are labeled using majority voting of various commercial antivirus solutions. This provides a side-by-side comparison of automated malware analysis methods, assessing their relative merits and detriments in assisting malware analysts on a single corpus of real-world malware.

II. DYNAMIC PROGRAM REPRESENTATION A. Background and Previous Work Behavior-based, dynamic program analysis has become a growing need in the malware research and program analysis communities. Dynamic program analysis is a comparatively rich data source when contrasted with static program analysis. Dynamic program analysis necessarily reveals program artifacts (files modified, system configuration changes, etc.) that may be obfuscated (as is often the case with malware) or difficult to discern with conventional static analysis methods. However, because of the scale and quantity of dynamic program artifacts, concise dynamic program representations and automated analysis techniques are required to efficiently compare and contrast different programs. A number of dynamic program representations have been described by prior work â&#x20AC;&#x201C; some of these approaches have shown promise in condensing dynamic program representations to a more concise format, while losing only minimal semantic meaning. Jang et al. proposed a method of hashing program features into a binary vector on which traditional mathematical similarity measures, like Jaccard distance, can be applied [1]. Shu et al. proposed a matrix representation of call (function call, syscall, or otherwise) ordering and, again, used simple mathematical matrix comparison operations to discern similarity [2]. Fredrikson et al. proposed an efficient graphbased representation of dynamic malware behavior [3]. Rieck et al. used a representation which they name the Malware Instruction Set (MIST) based on a computer instruction set [4].

B. Implementation This project focuses on a representation of dynamic program behavior based on co-occurrence matrices of system calls introduced by Shu et al. [2]. This representation was used to

16 determine a measure of similarity between two malware samples behavior, expanding on Shu et al.â&#x20AC;&#x2122;s original work which used this representation purely for the identification of malware from benign applications. Further, this program representation can be trivially converted to a set of features for a machine learning algorithms, and was used as input for GMM clustering. Shu et al.'s program representation consists of two matrix primitives which can be computed from a trace of â&#x20AC;&#x153;callsâ&#x20AC;?. A â&#x20AC;&#x153;callâ&#x20AC;?, here could be at any layer of abstraction and could represent anything from specific functions called by the program, to specific operating system API functions, to system calls. Shu et al. define a co-occurrence matrix as an đ?&#x2018;&#x161; Ă&#x2014; đ?&#x2018;&#x161; binary matrix đ?&#x2018;&#x201A;, where

Using these representations, a trace of system calls can be converted into a matrix. Then, simple mathematical operations can be performed directly on the matrix for clustering. This paper used system calls as the atomic unit of co-occurrence and occurrence frequency matrices. Since system calls are the fundamental method by which unprivileged user-mode applications access critical machine resources through the kernel, a trace of system calls will necessarily contain all of the potentially security-related behavior that a program accomplishes. Additionally, system calls can be easily traced with minimal overhead for a running process.

đ?&#x2018;&#x153;đ?&#x2018;&#x2013;.đ?&#x2018;&#x2014; = đ?&#x2018;&#x2021;đ?&#x2018;&#x;đ?&#x2018;˘đ?&#x2018;&#x2019; if call đ?&#x2018;&#x2013; occurred before call đ?&#x2018;&#x2014;, else đ??šđ?&#x2018;&#x17D;đ?&#x2018;&#x2122;đ?&#x2018; đ?&#x2018;&#x2019; Further, they then define an occurrence frequency matrix as an đ?&#x2018;&#x161; Ă&#x2014; đ?&#x2018;&#x161; matrix đ??š where đ?&#x2018;&#x201C;đ?&#x2018;&#x2013;,đ?&#x2018;&#x2014; = count(occurrences of call đ?&#x2018;&#x2013; before call đ?&#x2018;&#x2014;) [2]. As an example, assume a program makes the following series of Windows system calls (representing a common file read operation): NtOpenFile NtSetInformationFile NtReadFile NtReadFile NtReadFile NtWriteFile NtCloseFile This program can be represented as the co-occurrence matrix depicted in Figure 1 or the occurrence frequency matrix shown in Figure 2.

Figure 2. Occurrence frequency matrix for a given trace of system calls. Note that the cell for NtReadFile:NtReadFile is a two in this case since a call to NtReadFile is immediately followed by another call to NtReadFile in the call trace.

III. CLUSTERING METHODS A. Background and Previous Work Machine learning concepts have been applied with some success to the problem of dynamic program similarity. Rieck et al. clustered similar malware and classified unknown malware by embedding n-grams of MIST instructions in a highdimensional vector space [4]. In addition to simple mathematical analyses, Shu et al. applied a one-class support vector machine to their call frequency matrix program model to detect anomalous malware behaviors [2]. Although not strictly a machine learning approach, other projects have utilized traditional distance metrics in determining malware similarity. Jaccard distance, a measure of the similarity of two sets as the size of the intersection of those two sets divided by the size of their union, is very widely used in the malware analysis research community to determine similarity between malware samples [1, 5].

Figure 1. Co-occurrence matrix for the given trace of system calls. Note, as an example, that the cell for NtOpenFile:NtSetInformationFile is a one because a call to NtOpenFile occurs before a call to NtSetInformationFile. Similarly, the cell for NtOpenFile:NtOpenFile is a zero because there does not exist a call to NtOpenFile immediately followed by another call to NtOpenFile in the call trace.

Shu et al. attempted to solve a classification problem in their work by using a Support Vector Machine (SVM) classifier. Instead, this work attempts to cluster malware by behavioral similarity. While malware classification can be useful for identifying malware and differentiating malware execution from expected system behavior, clustering can be useful for identifying commonalities between particular samples of malware and can be used for identification and attribution of new malware. To test the effectiveness and generalizability of

17 this program representation, this project examined three clustering algorithms along with three distance metrics to see which performed the best. Selected algorithms and metrics are discussed in the next section.

B. Implementation

Table 1. Selected clustering algorithms and corresponding distance metrics.

Algorithm K-Means

Agglomerative

The three clustering methods evaluated using the previously mentioned program representation as input data were: K-Means Clustering clusters samples by alternating between assigning samples to clusters by computing the minimum distance from the cluster centroid (or mean) and recalculating cluster centroids based on the new clustering until convergence [6]. Agglomerative Clustering is a form of hierarchical clustering where clusters are recursively merged based on minimizing some distance metric until the desired number of clusters is achieved [7]. Gaussian Mixture Model (GMM) Clustering clusters samples by estimating the parameters of the distributions of each of the classes, assuming those distributions are normal [8].

GMM

Distance Metric Jaccard Hamming Euclidian Jaccard Hamming Euclidian N/A

To evaluate each of the distance-based clustering algorithms (kmeans and agglomerative clustering), the distances between cooccurrence matrices were taken to avoid the problem of normalizing occurrence frequency matrices across various samples. GMM, however, can be performed directly on the occurrence frequency matrix of a sample by treating each pair of system calls as a feature, effectively collapsing the occurrence frequency matrix into a vector of features. Feature selection is necessary here to reduce the requisite computational resources, so a simple minimum variance threshold was used.

IV. EVALUATION AND RESULTS A. Dataset

For all of the above clustering methods, the number of clusters was given as the number of unique malware families in the test corpus. While both k-means and agglomerative clustering have been widely used in the malware analysis literature, GMM clustering has not been applied to this type of data and, to the best of our knowledge, is a new contribution from this paper. With kmeans and agglomerative clustering, the following distance metrics were evaluated: Jaccard Distance measures the distance between any two sets and is the inverse of jaccard similarity which is defined as the cardinality of the setsâ&#x20AC;&#x2122; intersection divided by the cardinality of their union [9]. Hamming Distance measures the distance between any two ordered sets as the number of positions in each that differ [10]. Euclidian Distance measures the distance between two ndimensional sets as square root of the sum of squared distances [11]. A table summarizing all selected algorithms and corresponding distance metrics is provided in Table 1. These algorithms were evaluated using Scikit Learn's implementations [12].

To evaluate the effectiveness of each algorithm and distance metric, a large corpus of malware was required. Most of the papers referred to previously did not specify what dataset they used for testing or referenced a private corpus of malware maintained directly by them. Given that no publicly available and well-labeled corpus of malware commonly used for research projects like this exists, a corpus of samples from some of the most recently uploaded Windows malware samples on VirusShare [13] was cultivated as a part of this project. After removing samples that would not run due to compatibility issues with our target analysis environment, the remaining 159 samples were actuated in a sandbox environment. System call traces were gathered using DrStrace [14] and converted into cooccurrence and occurrence frequency matrix representations. Ground truth labels were generated by a majority voting scheme of popular commercial antivirus using VirusTotal [15], resulting in a total of ten unique malware families. The antivirus labels of this dataset and the relative representation of each are depicted in Table 2.

18 Table 2. Malware test corpus composition by antivirus label.

Antivirus Label Backdoor.Turkojan.DQ Gen:Variant.FakeAV.21 Trojan.Jevafus.A Trojan.Crypt.Delf.G Trojan.Spy.BZub.NHN Trojan.SMSHoax.X Backdoor.PcClient.TEV Backdoor.Optix.Pro.1.3.Dam.2 Gen:Variant.Graftor.18403 Gen:Variant.Graftor.Elzob.14614 Total

Count 47 22 21 14 14 13 9 7 6 6 159

B. Results Clustering was performed on the malware dataset depicted in Table 2 using the previously mentioned algorithms and distance metrics, contained in Table 1. Relative performance was calculated in terms of the following three metrics:

Figure 3. ARI performance of clustering algorithms. GMM clustering significantly outperforms all other techniques when considering the number of samples assigned to the same or different clusters between the produced clustering and the true labels, adjusted for chance.

Adjusted Rand Index (ARI) measures similarity between two clusterings by counting the pairs of samples that are assigned to the same or different clusters, adjusted for chance [16]. Ranges from -1 to 1, higher values indicate greater similarity. Adjusted Mutual Information Index (AMI) measures the similarity between two clusterings as the relative entropy of each clustering, adjusted for chance [17]. Ranges from -1 to 1, higher values indicate greater similarity. Fowlkes-Mallows Index (FMI) measures the similarity between two clusterings as the geometric mean of precision and recall [18]. Ranges from 0 to 1, higher values indicate greater similarity. Table 3. Clustering performance results.

Algorithm K-Means

Agglomerative

GMM

Distance Metric Jaccard Hamming Euclidian Jaccard Hamming Euclidian N/A

ARI

AMI

FMI

-0.0037 0.0107 -0.0031 0.0007 0.0107 -0.0033 0.1953

-.0085 -0.0251 -0.0229 -0.0240 -0.0251 -0.0222 0.4018

0.1236 0.1724 0.1527 0.1453 0.1724 0.1479 0.4054

The results are depicted in Table 3 along with Figure 3, Figure 4, and Figure 5.

C. Analysis These results indicate that the clustering algorithms and distance metrics widely used in prior work for program similarity in malware analysis do not perform very well when tested on this VirusShare dataset. In contrast, a clustering technique which has not yet been widely used in this context, GMM clustering, performs significantly better in terms of all three metrics.

Figure 4. AMI performance of clustering algorithms. GMM clustering again significantly outperforms all other techniques when considering the relative entropy between the produced clustering and the true labels, adjusted for chance.

19 dissecting the sample. In many cases the analyst assigns the malware to a class simply based on the suspected author of a piece of malware rather than its actual behavior. Previous research has evaluated the performance of antivirus labels in comparing malware families and found them to be significantly lacking [19, 20, 21]. Unfortunately, there is no better source of ground truth labels for real-world malware. While some welllabeled synthetic malware datasets do exist, many are not publicly available and the results using such a dataset are unrealistic. Gathering a corpus of malware with high-quality behavior-based labels remains a difficult problem. To some extent, this may account for the mediocre performance of most clustering approaches in this domain.

B. Cluster Poisoning Figure 5. FMI performance of clustering algorithms. GMM clustering provides a more than two times performance increase when considering the precision and recall of the produced clustering as compared to the true labels.

It is worth noting that these results are constrained to this particular program representation (co-occurrence and occurrence frequency matrices of system calls) and when the mathematics behind each clustering approach are taken into consideration, this result is exactly as expected. Because of the nature of how system calls are used during the execution of a particular program, there are a number of system calls whose occurrence is likely very highly correlated with others. For example, if a program calls the NtOpenFile system call, it is very likely that it will then call NtReadFile or NtWriteFile immediately afterward, to operate on the file that they just opened. GMM clustering is well equipped to handle this covariance since it attempts to estimate the probability densities of each parameter directly. K-Means and agglomerative clustering, on the other hand, weigh each feature equally which is likely detrimental to clustering performance. In terms of required computational resources, both k-means and agglomerative clustering approaches were significantly more performant than GMM clustering given the incredibly high dimensionality of the data. Across all actuated malware samples, 408 unique system calls were observed, resulting in 4082 total features. In order to keep the dimensionality low, a simple minimum variance feature selection was used to decrease the dimensionality of the data in order for GMM clustering to be feasible. This relatively simple approach to feature selection could likely be improved by using a more complex feature selection algorithm (eg. principle component analysis). At a higher level of abstraction (e.g., Windows API calls, MIST instructions, or some other semantic interpretation of program behavior), distance-based clustering approaches may actually be more feasible as the dimensionality increases and the covariance between features decreases.

V. LIMITATIONS A. Dataset Labels Antivirus labels are notoriously lacking in substantive information about the behavior of individual malware samples and are often arbitrary, selected at the whim of the analyst

One major problem with all machine learning approaches to security problems is their susceptibility to an adversary with full knowledge of algorithm and training state. By inserting specifically designed behavioral chaff in the code of a piece of malware, the attacker can alter the clustering and classification decision, potentially evading detection or attribution entirely and reducing the effectiveness of malware analysts tools. This approach has been demonstrated by Biggio et al. [22].

VI. CONCLUSION This work evaluated the effectiveness of Gaussian Mixture Models (GMMs) for behavioral malware clustering using occurrence frequency matrices of system calls. The results have shown that GMM clustering outperforms distance-based clustering algorithms like k-means and agglomerative clustering on a dataset of recent, real-world malware using antivirus labels as ground truth. The relative performance of these clustering algorithms shows that GMM clustering significantly outperforms widely-used distance-based clustering. This assessment of existing malware clustering approaches and proposal of GMM clustering as a novel contribution by this paper shows that automated analysis methods can be very effective in identifying trends and correlations in and between various families of malware. These types of automated analysis methods help malware analysts stay ahead of the ever-growing body of malware active on the internet and ultimately lead to a more secure internet.

ACKNOWLEDGEMENTS Special thanks to Ryan Whelan for contributing to the initial research question and providing technical insight and support throughout this work.

AUTHOR BIOGRAPHIES Alexander M. Interrante-Grant is a recent Northeastern EECE alumnus (class of 2018) and is presently at MIT Lincoln Laboratory, researching automated static and dynamic malware analysis. During his time at Northeastern, he participated in various cybersecurity-related research projects at the university and served as the captain of the Northeastern Cyber Defense Team. This work began through a directed study opportunity and was completed as a final project for Machine Learning (EECE 5644) during his final undergraduate semester at

20 Northeastern. The referenced work was performed in collaboration with MIT Lincoln Laboratory and the Northeastern TeSCASE group. alexander.interrante-grant@ll.mit.edu David Kaeli received a BS and PhD in Electrical Engineering from Rutgers University, and an MS in Computer Engineering from Syracuse University. He is the Associate Dean of Undergraduate Programs in the College of Engineering and a Full Processor on the ECE faculty at Northeastern University, Boston, MA. He is the director of the Northeastern University Computer Architecture Research Laboratory (NUCAR). Prior to joining Northeastern in 1993, Kaeli spent 12 years at IBM, the last 7 at T.J. Watson Research Center, Yorktown Heights, NY. Dr. Kaeli has published over 300 critically reviewed publications, 7 books, and 11 patents. His research spans a range of areas including microarchitecture to back-end compilers and database systems. His current research topics include hardware security, graphics processors, virtualization, heterogeneous computing and multi-layer reliability. He serves as an Associate Editor of the IEEE Transactions on Parallel and Distributed Systems, ACM Transactions on Architecture and Code Optimization, and the Journal of Parallel and Distributed Computing. Dr. Kaeli an IEEE Fellow and a Distinguished Scientist of the ACM. kaeli@ece.neu.edu

REFERENCES [1] Jiyong Jang, David Brumley, and Shobha Venkataraman. “BitShred: Feature Hashing Malware for Scalable Triage and Semantic Analysis.” Proceedings of the 19th ACM Conference on Computer and Communications Security. [2] Xiaokui Shu, Danfeng Yao, and Naren Ramakrishnan. “Unearthing Stealthy Program Attacks Buried in Extremely Long Execution Paths.” Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. [3] Matt Fredrikson, et al. “Synthesizing Near-Optimal Malware Specification from Suspicious Behaviors.” Proceedings of the 2010 IEEE Symposium on Security and Privacy. [4] Konrad Rieck, et al. “Automatic Analysis of Malware Behavior Using Machine Learning.” Journal of Computer Security. vol. 19, no. 4, Dec. 2011. [5] Dhilung Kirat and Giovanni Vigna. “MalGene: Automatic Extraction of Malware Analysis Evasion Signature.” Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. [6] J MacQueen. “Some methods for classification and analysis of multivariate observations.” Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol 1, 1967. [7] Lior Rokach, and Oded Maimon. "Clustering methods." Data mining and Knowledge Discovery Handbook. Springer US, 2005.

[8] A P Dempster, N M Laird, and D B Rubin. "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal Statistical Society. vol. 39, no. 1, 1977. [9] Elena Deza and Michel Marie Deza. “Encyclopedia of Distances.” Springer US, 2009. p. 299. [10] R W Hamming. “Error detecting and error correcting codes.” The Bell System Technical Journal. vol. 29, no. 2, Apr. 1950. [11] Elena Deza and Michel Marie Deza. “Encyclopedia of Distances.” Springer US, 2009. p. 94. [12] scikit-learn, “scikit-learn: Machine Learning in Python”, Oct. 2017 [13] virusshare.com, “Virusshare.com”, 2017. [Online]. Available: http://virusshare.com/. [Accessed: Dec- 2017]. [14] Dr. Memory, “Dr. Memory: System Call Tracer for Windows.” Aug. 2016 [15] virustotal.com, “Virustotal.comhttps://www.virustotal.com/ [16] Lawrence Hubert and Phipps Arabie. “Comparing Partitions.” Journal of Classification. vol. 2, no. 1, Dec. 1985. [17] Nguyen Xuan Vinh, Julien Epps, and James Bailey. “Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance.” Jounral of Machine Learning Research. vol. 11, Oct. 2010. [18] E. B. Fowlkes and C. L Mallows. “A Method for Comparing Two Hierarchical Clusterings.” Journal of the American Statistical Association. vol. 78, no. 383, Sep. 1983. [19] Domhnall Carlin, et al. “The Effects of Traditional AntiVirus Labels on Malware Detection using Dynamic Runtime Opcodes.” IEEE Access, vol. 5, Sep. 2017 [20] Alex Kantchelian, et al. “Better Malware Ground Truth: Techniques for Weighting Anti-Virus Vendor Labels.” Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security. [21] Aziz Mohaisen and Omar Alrawi. “AV-Meter: An Evaluation of Antivirus Scans and Labels.” Proceedings of the 2014 International Conference on Detection of Intrusions and Malware and Vulnerability Assessment. [22] Battista Biggio, et al. “Poisoning Behavioral Malware Clustering.” Proceedings of the 2014 Workshop on Artificial Intelligence and Security.

A Fast Parallel Level Set Segmentation Algorithm for 3-D Images Benjamin Trapani, Student, Julian Gutierrez, Student, David Kaeli, Professor Dept. of Electrical and Computer Engineering Northeastern University, Boston, MA Abstract Image segmentation is one of the key analysis tools in biomedical imaging applications. Although level set segmentation algorithms have been explored thoroughly in the past, these approaches are non-scalable due to their inherent data dependencies. Algorithms with large corresponding data dependency graphs that contain many small cycles are difficult to parallelize, prohibiting these algorithms from effectively leveraging modern highly parallel compute devices. Given that the resolution of medical imaging hardware has continued to increase each year, and CPU performance has not kept pace, there is a need to explore parallel solutions for processing medical images. Prior work described an efficient level set segmentation algorithm designed for parallel architectures for segmenting 2D images. The algorithm segments an input image into four components based on an initial curve. The prior 2-D level set segmentation algorithm is extended, providing a solution for 3-D images. The algorithm is improved by examining adjacent voxels at each step, versus visiting adjacent pixels instead. The initial curve is a userprovided sphere that is defined parametrically to reduce copy overhead to the compute device. The efficiency of the 2D algorithm is preserved in the conversion, enabling the resulting algorithm to perform on the order of ten times faster than existing GPU-accelerated 3-D level set segmentation implementations. The implementation presented in this work supports real-time segmentation of 7T MRI images, leveraging the computational power of a NVIDIA Tesla K20 GPU to reduce execution time. This image segmentation algorithm supports identification of tumors, tissue volume measurements, and surgery planning at the rate required by radiologists today.

I. INTRODUCTION Image segmentation is the process of separating an input image into a set of disjoint components, thereby aiding analysis and simplifying further processing. Efficient image segmentation is critical to the delivery of real-time computer vision systems which require low and deterministic latency, and high throughput. One field that demands accurate and efficient image segmentation is medicine. Biomedical imaging applications use image segmentation algorithms to enable radiologists and doctors to isolate tumors, tissue classes and organs for further analysis. In a typical 8-hour workday, radiologists are required to analyze approximately one image

every four seconds [1]. Performing an accurate analysis under this time constraint is difficult, especially when existing image segmentation systems are not capable of processing the image under analysis in four seconds or less. The typical workflow for the user consists of uploading raw images into the system, viewing a raw image, selecting regions of interest in the image and running the image segmentation algorithm to gain detailed regions that highlight the desired biological features. The result of the segmentation process is a set of refined spatial regions in the image that are likely part of the regions of interest specified initially. The initial regions of interest are specified using two pieces of data: a shape drawn in the input image, and an intensity range. Both properties of the input range serve as hints to the image segmentation system about the location of the desired biological features. The image segmentation system should extract detailed connected regions such that the extracted regionsâ&#x20AC;&#x2122; intensity ranges mostly match the corresponding input range and the shape mostly matches the corresponding input shape. Parts of the input image included in the segmented region that do not fall within the input shapes or do not fall within the input intensity ranges should only be selected because they are connected to a shape that is definitely part of the desired region. Existing implementations that segment 3-D images do so by numerically solving partial differential equations [2]. These algorithms fail to achieve realtime performance on larger inputs due to data dependencies and high computational requirements. Previous research by Gutierrez et al. [3] details an algorithm for image segmentation that is optimized for data-parallel architectures. A 2-D implementation of the algorithm was presented, and its performance evaluated using the Compute Unified Device Architecture (CUDA) run on a NVIDIA graphics processing unit (GPU). Gutierrez et al. improved the performance by at least 5.5 times that of each previous implementation benchmarked. Their algorithm evolves an input set of curves given an intensity range by examining adjacent pixels and modifying the state of each central pixel based on the state of adjacent pixels. Extending the algorithm to three dimensions is a logical extension of the prior work, but poses implementation challenges, particularly in preserving the performance of the 2-D implementation. The performance goal guiding the development and optimization of the 3-D version is meeting or exceeding the per-pixel throughput of the 2-D algorithm, which is 2.6 x 108 pixels/s [3].

22 Although modern data-parallel computer architectures, such as GPUs, can provide high throughput, maximizing the performance of a GPU implementation requires tuning of the associated memory access patterns. The prior implementation [2] coalesces all global memory accesses by reordering input data on the CPU using Morton codes for each pixel, resulting in adjacent pixels in 2-D space being collocated in a onedimensional data array on the GPU. Optimizing memory accesses in 3-D space when using a one-dimensional data array can be accomplished by reordering the input data per the 3-D Morton code at each index [4]. However, performing this reordering introduces an additional performance cost that is an extra linear term with respect to the number of input voxels. A voxel is a 3D cube analogous to a pixel in 2D space. To offset the performance cost of the initial reordering, Equation 1, seen below, must be satisfied. Mb and Ma are the relative memory bandwidths before and after the optimizations respectively. Vr and Vi are the number of voxels read from memory and the number of voxels in the input image respectively. đ?&#x2018;&#x2030;đ?&#x2018;&#x; đ?&#x2018;&#x20AC;đ?&#x2018;?

đ?&#x2018;&#x2030;đ?&#x2018;&#x2013; + đ?&#x2018;&#x2030;đ?&#x2018;&#x; đ?&#x2018;&#x20AC;đ?&#x2018;&#x17D;

(1)

Since typical medical images can be segmented with approximately twenty full reads from global memory (see results), other optimizations are pursued due to the relative difficulty of improving global memory efficiency by more than 5%. These optimizations are further necessitated by and adapted for the order of magnitude growth in data size due to the additional image dimension and rapidly increasing medical image sizes. This paper presents a careful performance analysis of the adaptation of the algorithm to 3-D space.

II. ALGORITHM A. Definitions The level set segmentation algorithm uses a 3D array of the same shape as the input image to store the current state of each voxel in the original image. The coded state of each voxel is used by the algorithm to iteratively compute the next state until convergence. The possible voxel state codes and associated meanings are summarized in Table 1, seen below. Table 1. The relative location, relative to true image, denoted for each voxel. Voxel state -2 -1 1 2

Location Not in image Outside border of image (might be part of image) Inside image Inside border of image (might not be part of image)

The following variables define the sets and accessors for these sets that are used in the level set segmentation algorithm pseudocode:

ď&#x201A;ˇ ď&#x201A;ˇ ď&#x201A;ˇ ď&#x201A;ˇ ď&#x201A;ˇ ď&#x201A;ˇ

N(x): The set of voxels adjacent to x excluding those at positions diagonal to x. L: label shapes V: All voxels in the input image I: Intensities of input image I(x): intensity of voxel x R: valid intensities per label

The level set segmentation algorithm implemented is divided into three stages. The first initializes the voxel states based on the intensity of the initial image and the position of the label curve. The second stage updates the state of each voxel based on adjacent voxel states. This stage runs iteratively until convergence. The final stage assigns final voxel codes that are consistent with the converged voxel states and the meaning of each code.

B. Stages Stage 1. K=ď&#x192;&#x2020; For ď Ź ď&#x192;&#x17D; L: For v ď&#x192;&#x17D; V: If I(v) ď&#x192;&#x17D; R(ď Ź): If v ď&#x192;&#x17D; ď Ź: I(v) = 1 Else I(v) = -1 K=Kď&#x192;&#x2C6;v Else: If v ď&#x192;&#x17D; ď Ź I(v) = 2 K=Kď&#x192;&#x2C6;v Else I(v) = -2 Stage 2. P=ď&#x192;&#x2020; While K ď&#x201A;š ď&#x192;&#x2020; ď&#x192;&#x2021; P ď&#x201A;š I: P=I For v ď&#x192;&#x17D; K: prevVI = ď&#x192;&#x2020; While I(v) ď&#x201A;š prevVI: prevVI = I(v) If I(v) == -1 ď&#x192;&#x2021; z ď&#x192;&#x17D; N(v) ď&#x192;&#x2021; I(z) == 1, I(v) = 1 Else if I(v) == 2 ď&#x192;&#x2021; z ď&#x192;&#x17D; N(v) ď&#x192;&#x2021; I(z) == -2, I(v) = -2 If I(v) ď&#x192;&#x17D; {-2, 1), K = K ď&#x192;&#x2021; ~v Stage 3. For v ď&#x192;&#x17D; V: If I(v) ď&#x192;&#x17D; {1, 2}: I(v) = 2 If z ď&#x192;&#x17D; N(v) ď&#x192;&#x2021; I(z) ď&#x192;&#x17D; {-1, -2}, else I(v) = 1 Else: I(v) = -2 If z ď&#x192;&#x17D; N(v) ď&#x192;&#x2021; I(z) ď&#x192;&#x17D; {-1, -2}, else I(v) = -1

III. IMPLEMENTATION Four different implementations of the algorithm are built to evaluate the kernel and total execution times of a baseline implementation. The performance of the baseline implementation is compared against the performance of three optimizations. The target device for all versions is a NVIDIA Tesla K20 GPU. The Tesla K20 is a mid-range workstation GPU [5]. It is representative of the devices that medical image segmentation systems would typically run on. All implementations share a common pipeline structure with stages in the pipeline matching the stages in the algorithm pseudocode detailed in Section II. All implementations also use shared memory in an identical fashion. The label image representation, number of voxels processed per thread and alignment of global memory are adjusted for each implementation. The baseline implementation launches one thread per voxel, uses a dense representation of label curves and does not align global memory. The three optimizations process multiple voxels per thread, use parametric label definitions and align global memory.

A. Kernel Pipeline A parent kernel is launched from the host CPU, sending a single one-dimensional block of data that is equal in size to the number of label curves supplied by the user. The parent kernel is responsible for launching three other kernels, which perform the three stages in the algorithm. Leveraging NVIDIAâ&#x20AC;&#x2122;s dynamic parallelism technology to instantiate child kernels from the parent kernel, unnecessary memory transfers between the CPU and GPU can be eliminated during successive runs of stage 2, which has been shown to increase the performance of the algorithm [3]. Since each thread in the parent kernel operates on independent sections of the working and output sets, it is safe to evolve all label curves in parallel. All child kernels are instantiated with a 3-D grid and block dimensions. The block size is user-specified and constant across each dimension. Assuming a block dimension size of K and a number of voxels processed per thread V, the number of blocks per grid dimension for a grid dimension with x voxels is computed according to Equation 2, seen below: đ?&#x2018;&#x201C;(đ?&#x2018;Ľ) = 1 +

(đ?&#x2018;Ľâ&#x2C6;&#x2019;1) đ??žâ&#x2C6;&#x2014;đ?&#x2018;&#x2030;

(2)

K=8 is used for these experiments. A block size of 8 x 8 x 8 is used because it is the configuration with power of two side lengths that results in the number of threads per block being closest to the optimal threads per block discovered in the 2D implementation, which contains 1024 [3]. V=1 and V=2 are evaluated in these experiments. To evaluate the second iteration condition in stage 2, at least one state of the evolving curve has to have been modified in the last iteration. Parallel writes are then issued to global memory addresses shared among all threads for the given label. An array of integers of size equal to the number of label curves is allocated in global memory and initialized to 0 prior to starting each pass through the outer loop in stage 2. A flag located in shared memory is used to determine if any thread in a block was updated as part of curve evolution, during the current pass through the outer loop. If at least one

thread in the current block updated the state of a voxel in the evolving curve, the thread in the lower front-left corner of the block will write a 1 to the global memory address for the current label, indicating that the current label curve needs to continue to evolve. Filtering the threads that perform the evolution logic, shown in stage 2, is performed at a block granularity, versus a thread granularity. An array of integer flags is used to indicate whether blocks have converged. Block granularities are used because filtering at a thread granularity would have required much more memory for the flags and would have resulted in warp divergence, negating the potential performance improvement gained by reducing the number of scheduled threads that did not modify the evolving curves. Blocks converge when at least one voxel is in state -1 or 2. The array is allocated in global memory, initialized in stage 1, and updated after each execution of stage 2. A variable in shared memory, indicating whether at least one thread in the current block is in state -1 or 2, is used to update the global block indicator. The thread operating in the lower front left corner of each block updates the global block indicator stored in the shared value computed for the current block.

B. Shared Memory and Block Borders Shared memory is used in all child kernels to minimize the number of global memory reads. Since the second and third stages in the algorithm require that each thread examines adjacent voxels, threads within a block will collectively read the same location in global memory at least nine times. Shared memory is located on the streaming multiprocessor where a block is scheduled to run, whereas global memory is located off the chip. Higher memory bandwidth can be achieved by using shared memory to avoid redundant reads to global memory [6]. A device function, called by the kernels for the second and third stages, loads data equal in size as the input image into shared memory prior to executing the algorithm defined in each subkernel. Each thread fetches the data, corresponding to the threadâ&#x20AC;&#x2122;s location in 3-D space, into shared memory. If the thread is on the border of the block, the thread fetches the voxel one sequential location across the boundary nearest it. If the number of voxels processed per thread is one, the thread simply loads the voxel one unit across the border. If there are two voxels processed in each dimension, the thread loads the voxel one unit across the border if it is on the lower boundary and loads the voxel two units from its position (one unit across the upper border) otherwise. In the version which processes eight voxels per thread, each thread loads a 2 x 2 x 2 cube of voxels into shared memory. The coordinates of the thread in 3-D space, multiplied by 2, define the lower front left corner of the cube.

C. Parametric Labels To reduce the overhead incurred by copying data to the GPU, parametrically defined label curves are evaluated. The initial 3D implementation, based on the 2-D implementation by Gutierrez et al. [3], operates on label curves defined in the same format and resolution as the input intensity image. Although this implementation gives users the ability to define more complex labels, which have the potential to converge in fewer iterations, the enhanced flexibility comes at a significant performance cost when accounting for time spent copying data

24 to the GPU. An alternative parametric definition of labels is evaluated to reduce the amount of data copied to the GPU. The second iteration of the image segmentation system supports parametric label spheres. Any point inside or on the border of a given label sphere is defined to be within the label in stage 1 of the algorithm. The amount of data copied to the GPU per label is 128 bits, consisting of four 32-bit floats which define the origin and the radius. For an intensity image that is at least 3 x 3 x 3 voxels in size, the parametrically defined label sphere requires less data than the original label format. When using parametrically-defined label spheres, additional overhead is incurred in performing the square root operation to compute the distance between each voxel and the center of each label sphere. Extending the implementation to support other parametricallydefined shapes as required by different use cases is a trivial extension of the existing implementation, although performance characteristics may not be the same for other shapes.

D. Eight Voxels Per Thread Performance of the 2-D implementation improves when 2 x 2 pixel blocks are processed by each thread [3]. If the ratio between threads and streaming multiprocessors (SMs) exceeds 1, unnecessary overhead is incurred when mapping/scheduling the resulting lightweight threads to SMs. Although optimizing the ratio between threads and SMs is system and data set specific, the method for implementing this optimization on the 3-D extension is detailed so that future users can tweak this implementation to best suit their needs. In stage 1, 2 x 2 x 2 chunks of data, which are contiguous in 3-D space, are processed on each thread. The 2 x 2 x 2 sets of elements, separated from each other by one block in each dimension, are processed in kernel stages 2 and 3. Blocking is used to avoid bank conflicts when operating on data in shared memory. When writing data back to global memory in stages 2 and 3, each thread writes a 2 x 2 x 2 chunk that is contiguous in the 3-D space.

E. Aligned Global Memory It is possible to improve global memory performance by coalescing the reads issued by threads in each warp into one memory transaction. Sequential accesses to global memory will be coalesced via 32, 64, or 128-byte transactions, aligned on the size of each block [7]. Threads are assigned to warps sequentially along the x-dimension. To help ensure that reads issued by threads in a warp are likely to be coalesced, allocations are padded along the x-dimension such that the size of the x-dimension and the size of the xy-plane are multiples of one of the memory transaction sizes listed above. CUDA provides a function that queries the supported memory transaction sizes for the current device and allocates a padded chunk of memory given the dimensions of the 3-D image, as provided by the user [8]. The input, working and output images are padded along the x-axis, using the best value, prior to being copied to global memory. The parent kernel continues to instantiate sub-kernels across each dimension of the input image. Sub-kernels are modified to rely on an additional set of width parameters for each image, instead of using the dimensions of the grid. No additional computational overhead is introduced by aligning global memory. Three additional

kernel parameters are required, resulting in 192 additional bits of data to be copied to the GPU per execution. The maximum number of wasted bytes w due to padded allocations in an input image can be computed using Equation 3, seen below: đ?&#x2018;¤ = 127 â&#x2C6;&#x2014; â&#x201E;&#x17D; â&#x2C6;&#x2014; đ?&#x2018;&#x2018;

(3)

As shown in Equation 3, the wasted bytes, w, is a function of the product of image height, h, and depth, d. 127 bytes are wasted per row of voxels if the GPU has the maximum global memory alignment of 128 bytes and the image width modulo 128 is one.

IV. RESULTS A. Benchmarks All benchmarks are run on synthetic spheres that are of uniform intensity. Four implementations are evaluated, one for each optimization. Each implementation is tested on images of identical size in each dimension. The radius of the synthetic sphere is equal to one fourth of the size of a single dimension of the input image. Images of size 1283, 2563, 5123 and 10243 voxels are evaluated. For each synthetic input sphere, label spheres of both half the radius and twice the radius of the input sphere are used as the initial curve, enabling the measurement of the difference in performance between inward and outward evolutions of the initial curve. For each unique tuple of version, image dimension and label radius, twenty benchmarks are generated and processed. The execution performance results for the largest image are summarized below in Table 2. Table 2. Kernel execution time and total time to segment the input image, as a function of the input image size. image dim (voxels) 1024

label radius (voxels) 128

avg. kernel exec time (ms) 508.38

avg. total exec time (ms) 1440.839

8vx-aligned

1024

basic

1024

128

476.178

1354.368

128

5102.493

6298.697

param-labels

1024

128

5313.346

6293.904

8vx

1024

512

929.29

1852.648

8vx-aligned

1024

512

882.841

1742.711

basic

1024

512

5913.594

7147.313

param-labels

1024

512

6087.566

7049.704

Version 8vx

The first four rows in Table 2 indicate both the kernel and total performance of the proposed system when the initial curve specified by a user is half the size of the target object. The second four entries represent the corresponding performance data when the initial curve fully encloses the target object and is twice as large. The kernel execution time is the time elapsed between the invocation of the parent kernel from the CPU and the completion of this kernel. The total execution time includes the time required to create GPU resources, transfer data to and from the GPU and free the allocated resources. Since radiologists and other users of the image segmentation

Log Kernel Exec Time (ms)

1.00E+04 1.00E+03 1.00E+02

1.00E+01 1.00E+00 1.00E+06 1.00E+07 1.00E+08 1.00E+09 1.00E+10

Log Voxels 8vx

8vx-aligned

basic

param-labels

Figure 1. Average kernel execution time (log10) as a function of voxel count (log10) in input image for each implementation when label sphere has a radius equal to half that of the input sphereâ&#x20AC;&#x2122;s. The corresponding kernel performance for inward evolution when segmenting the largest synthetic sphere is summarized in Figure 2, seen below.

Log Kernel Exec Time (ms)

1.00E+04 1.00E+03

number of non-converged blocks per pass through stage 2 of the algorithm is much larger when evolving inward, as compared to evolving outward, due to the larger number of voxels on the boundary of the evolving curve. The voxels on the boundary change state frequently, due primarily to being outside of the intensity range provided, requiring each block with at least one voxel on the boundary of the evolving curve to be processed in full. The logic that examines the state of the adjacent voxels per thread in stage 2 is skipped for converged blocks, resulting in the observed performance improvement when the initial curve is smaller than the target and evolves outward. The ratio between inward and outward evolution performance is larger for the 8vx and 8vx-aligned versions because they process eight times as many voxels per block compared to the other implementations. Processing more voxels per block results in an increase in the number of voxels that are processed unnecessarily on the border of the evolving shape due to filtering converged regions at block granularity. To compare the overall performance of each version, the total time required to allocate GPU memory, copy data to the GPU, perform the algorithm, copy the results back to the CPU and free up used GPU memory is summarized in Figure 3, seen below.

Log Average Total Exec Time (ms)

system could specify the initial curve such that it is either larger or smaller than the desired region, the proposed system should efficiently identify the desired region in both cases. The kernel performance for the outward evolution of the initial label sphere when segmenting the largest synthetic sphere is summarized in Figure 1, seen below.

1.00E+04 1.00E+03 1.00E+02 1.00E+01 1.00E+00 1.00E+06 1.00E+07 1.00E+08 1.00E+09 1.00E+10

Log Voxels

1.00E+02 8vx

1.0E+07

1.0E+08

1.0E+09

1.0E+10

Log Voxels 8vx

basic

param-labels

Figure 3. Total amount of time to segment synthetic spheres of a given size averaged across inward and outward total evolution times.

1.00E+01 1.00E+00 1.0E+06

8vx-aligned

basic

param-labels

Figure 2. Average log kernel execution time as a function of log voxel count in input image for each implementation when label sphere radius is twice that of the input sphereâ&#x20AC;&#x2122;s. Providing the algorithm with an initial curve that is smaller than the target and entirely contained by the target sphere results in faster segmentation, versus using an initial curve that is larger than the target sphere. Segmentation using the 8vx-aligned implementation and a label sphere that is smaller than the target sphere takes 22% less time than segmentation when the initial label sphere is larger. The reason for this behavior is that the

The basic and parametric label implementations perform comparably, with the parametric label implementation taking 0.99 times that of the basic implementation for the largest image. Although the amount of time taken to copy the data to the GPU is greatly reduced in the parametric version, the cost of computing the distance between the sphere center and each voxel causes the kernel performance to deteriorate sufficiently to virtually offset the gains achieved in the reduced copy overhead. The 8vx and 8vx-aligned versions perform comparably as well. The version using aligned memory took 0.94 times the amount of time taken by the 8vx version. The kernel execution time was improved by using aligned memory. However, the performance of the memory allocation suffers because of the additional overhead incurred when padding the

Figure 4. Evolution of spherical label curve on benchmark brain MRI image. data to ensure alignment in global memory. Further overhead is incurred when transferring the padded data between the CPU and GPU due to the increase in space required to store the padded data. For all images benchmarked, the additional overhead resulting from padding memory is smaller than the associated kernel performance improvement.

B. Accuracy To verify the accuracy of all implementations, the segmented image regions are checked by comparing them to the data in the original image. If the code is -2 or 1, the corresponding voxel in the original image must not be part of the sphere or must be inside the sphere, respectively. If the code is 2, the current voxel must be adjacent to at least one voxel that is inside of the sphere in the original image. It must also be located on the outermost shell of voxels in the original sphere. If the code is -1, the current voxel must be adjacent to at least one voxel that is not inside of the sphere in the original image. It must also be located on the shell of voxels one unit larger than the input sphere. The number of places that a codeâ&#x20AC;&#x2122;s meaning is inconsistent with the input image data is counted. The number of incorrectly segmented voxels, divided by the total number of voxels in the input image, is computed. The latter number is computed from the input and resulting image for each benchmark. The percentage of incorrectly segmented voxels is zero in all benchmarks, implying that the segmented results are completely correct for the synthetic inputs.

C. Performance Comparison The most efficient existing GPU-accelerated 3-D image segmentation system is developed by Roberts et al. [2]. They benchmark their application on an MRI image of a brain of size 2563 voxels. Their implementation takes 7 seconds to converge when run on a NVIDIA GTX 280 GPU. The implementation in this paper is tested on an MRI image of a brain with size 274 x 384 x 384 voxels using a label curve with a radius of fifty voxels located in the center of the image. The average time required to segment the MRI image is 99.56 ms. The results of the segmentation are found in Figure 4, seen above. The benchmark is obtained from an average of twenty runs using the 8vxaligned version of the code. The operating system targeted in the implementation by Roberts et al. is Windows while the implementation in this paper targets Linux. Benchmarking the previous 3-D implementation [2] on the hardware and software

stack used to obtain the benchmarks in this paper would require porting it to Linux, which is left as future work. A service that aggregates GPU benchmarks called User Benchmark suggests that a NVIDIA GTX 1080 is about 12.95 times more efficient at typical 3-D rendering tasks than a GTX 280. This data is derived from 93,288 and 511 samples for the GPUs, respectively [9]. Benchmarks for the GTX 1080 are leveraged for comparison in place of those for the Tesla K20 because benchmarks for the Tesla K20 GPU are not available on this service. The GTX 1080 provides a more recent GPU architecture, and includes a higher clock speed, higher memory bandwidth, and faster memory versus the Tesla K20 GPU [5, 10]. To compare the previous implementation [2] to the implementation presented in this paper, the benchmarks of the GTX 1080 and those of the GTX 280 are used to generate a GPU factor by which to convert the performance of the previous implementation to the equivalent performance of the same implementation when run on a Tesla K20. The number of voxels processed is equivalent to the total number of voxels in the input image. The total time is the time required to segment the image, including overhead incurred when setting up and freeing GPU resources. The adjusted result is computed according to Equation 4 seen below. đ??´đ?&#x2018;&#x2018;đ?&#x2018;&#x2014;đ?&#x2018;˘đ?&#x2018; đ?&#x2018;Ąđ?&#x2018;&#x2019;đ?&#x2018;&#x2018;đ?&#x2018;&#x2026;đ?&#x2018;&#x2019;đ?&#x2018; đ?&#x2018;˘đ?&#x2018;&#x2122;đ?&#x2018;Ą =

(đ?&#x2018;&#x2030;đ?&#x2018;&#x153;đ?&#x2018;Ľđ?&#x2018;&#x2019;đ?&#x2018;&#x2122;đ?&#x2018; đ?&#x2018;&#x192;đ?&#x2018;&#x;đ?&#x2018;&#x153;đ?&#x2018;?đ?&#x2018;&#x2019;đ?&#x2018; đ?&#x2018; đ?&#x2018;&#x2019;đ?&#x2018;&#x2018; â&#x2C6;&#x2014; đ??şđ?&#x2018;&#x192;đ?&#x2018;&#x2C6;đ??šđ?&#x2018;&#x17D;đ?&#x2018;?đ?&#x2018;Ąđ?&#x2018;&#x153;đ?&#x2018;&#x;) đ?&#x2018;&#x2021;đ?&#x2018;&#x153;đ?&#x2018;Ąđ?&#x2018;&#x17D;đ?&#x2018;&#x2122;đ?&#x2018;&#x2021;đ?&#x2018;&#x2013;đ?&#x2018;&#x161;đ?&#x2018;&#x2019;

(4)

These results are detailed in Table 3, seen below. Table 3. Performance comparison between existing algorithms. Version

Voxels Processed

Total time (s)

GPU factor

Roberts Proposed

1.677x107 4.04x107

7.0 0.1

12.95 1

Adjusted Result (voxels/s) 3.104x107 4.04x108

Despite the generous handicap of assuming that a Tesla K20 is as fast as a GTX 1080, the proposed implementation is able to process voxels at a rate of 13 times the rate of the previous 3-D implementation [2]. Obtaining benchmarks of the previous implementation on Linux using a Tesla K20 would reveal an

27 even larger performance improvement, the size of which is to be determined in future experiments. Due to the 3-D algorithm requiring the examination of more adjacent locations per step and operating on larger data, it is expected that the 3-D algorithm would have a lower per unit throughput than that of the 2-D implementation. However, the 3-D implementation processes 4.04 x 108 voxels/s, while the previous 2-D implementation achieved 2.6 x 108 pixels/s [3]. The previous 2D implementation was benchmarked on the same hardware and software stack leveraged in this implementation.

V. CONCLUSION

engineer on the AI and Ink team. He can be contacted at ben.trapani1995@gmail.com. Julian Gutierrez received a B.S. degree in Electrical Engineering from the University of Costa Rica in 2012. He was a Structural Design Engineer at Intel Corp. in Heredia, Costa Rica and a lecturer at the University of Costa Rica between 2011 and 2015. He received a MSc. degree in Electrical and Computer Engineering from Northeastern University in 2017. He joined the Northeastern University Computer Architecture Research Laboratory under Dr. David Kaeli in 2015 and has been conducting research in High Performance Computing with a focus on GPU applications and heterogeneous systems. He is currently a PhD Candidate in Computer Engineering at Northeastern University. He can be contacted at gutierrez.jul@husky.neu.edu.

Previous 3-D image segmentation implementations are relatively efficient. However, the 7 seconds required to segment a typical MRI image of a brain using existing implementations is larger than the 4 seconds radiologists have to analyze a single image [1]. The implementation presented in this paper is over David Kaeli received a BS and PhD in Electrical Engineering 13 times more efficient than the best previous implementation. from Rutgers University, and an MS in Computer Engineering The efficiency gained as a result of the system detailed in this from Syracuse University. He is the Associate Dean of paper will enable radiologists to segment incoming images well Undergraduate Programs in the College of Engineering and a within the allotted time per analysis, since the proposed system Full Processor on the ECE faculty at Northeastern University, only requires 0.1s to segment a typical MRI image of a brain. Boston, MA. He is the director of the Northeastern University The performance of the proposed system also provides nearComputer Architecture Research Laboratory (NUCAR). Prior instantaneous feedback, which will enable radiologists to to joining Northeastern in 1993, Kaeli spent 12 years at IBM, quickly refine the input parameters to the level set segmentation the last 7 at T.J. Watson Research Center, Yorktown Heights, algorithm. As a result, they will obtain higher quality refined NY. Dr. Kaeli has published over 300 critically reviewed images to guide their analyses. Higher quality image publications, 7 books, and 11 patents. His research spans a segmentations in less time will increase the productivity and range of areas including microarchitecture to back-end accuracy of radiologists. Providing radiologists with improved compilers and database systems. His current research topics tools to allow efficient analysis of medical images without include hardware security, graphics processors, virtualization, sacrificing their accuracy will benefit patients around the world heterogeneous computing and multi-layer reliability. He serves who depend on medical imaging for diagnosis and monitoring, as an Associate Editor of the IEEE Transactions on Parallel and including patients with cancer, heart disease and osteoporosis Distributed Systems, ACM Transactions on Architecture and among many others. The goal of surpassing or maintaining the Code Optimization, and the Journal of Parallel and Distributed performance of the 2D implementation is achieved. The Computing. Dr. Kaeli is an IEEE Fellow and a Distinguished throughput of the proposed system is over 1.5 times larger than Scientiest of the ACM. He can be contacted the throughput achieved by the previous implementation on at kaeli@ece.neu.edu. typical images [3]. The remaining work required to deliver this implementation as a pluggable stage in existing medical image REFERENCES processing pipelines consists of defining an API that client applications can use to specify labels, set initial data and read [1] McDonald RJ, Schwartz KM, Eckel LJ, et al, “The effects of Changes in Utilization and Technological Advancements results. Future work will focus on making the system of Cross-Sectional Imaging on Radiologist Workload”, production-ready, in addition to porting the previous implementation [2] to Linux and benchmarking it. Acad Radiol. Published online July 22, 2015. [2] M. Roberts, J. Packer, M. Costa Sousa and J. Ross Mitchell, "A Work-Efficient GPU Algorithm for Level Set ACKNOWLEDGMENT Segmentation", Eurographics Association, Saarbrucken, The author would like to thank Julian Gutierrez for introducing Germany, 2010. him to general purpose GPU computing, explaining in depth his [3] J. Gutierrez, F. Nina-Paravecino and D. Kaeli, "A Fast 2-D implementation of the algorithm, suggesting the best ways Level-Set Segmentation Algorithm for Image Processing to optimize the 3-D implementation and for providing feedback Designed For Parallel Architectures", 2016 6th Workshop on this paper throughout all revisions. The author would also on Irregular Applications: Architecture and Algorithms, like to thank Dr. Kaeli for editing this paper and for providing 2016. the resources required to conduct this research. [4] G. M. Morton, “A computer oriented geodetic data base and a new technique in file sequencing”, International Business Ben Trapani received a B.S. degree in Computer Science from Machines Company New York, 1966. Northeastern University in 2018. He received an award for [5] "Tesla K20 GPU Accelerator", nvidia.com, 2017. [Online]. Outstanding Student Research in Computer Science at Available: Northeastern University’s RISE 2017 event. After graduating https://www.nvidia.com/content/PDF/kepler/Tesla-K20in the spring of 2018, he will be joining Microsoft as a software

28 Passive-BD-06455-001-v05.pdf. [Accessed: 31- Aug2017]. [6] "Kepler Tuning Guide :: CUDA Toolkit Documentation", Docs.nvidia.com, 2017. [Online]. Available: http://docs.nvidia.com/cuda/kepler-tuningguide/index.html#axzz4jigtZfbq. [Accessed: 11- Jun2017]. [7] M. Harris, "How to Access Global Memory Efficiently in CUDA C/C++ Kernels", Parallel For all, 2017. [Online]. Available: https://devblogs.nvidia.com/parallelforall/howaccess-global-memory-efficiently-cuda-c-kernels/. [Accessed: 02- Jul- 2017]. [8] "CUDA Runtime API :: CUDA Toolkit Documentation", Docs.nvidia.com, 2017. [Online]. Available: http://docs.nvidia.com/cuda/cuda-runtimeapi/group__CUDART__MEMORY.html#group__CUDA RT__MEMORY_1g188300e599ded65c925e79eab2a5734 7. [Accessed: 11- Jun- 2017]. [9] "UserBenchmark: Nvidia GeForce GTX 280 vs 1080", Gpu.userbenchmark.com, 2017. [Online]. Available: http://gpu.userbenchmark.com/Compare/Nvidia-GTX1080-vs-Nvidia-GeForce-GTX-280/3603vsm8413. [Accessed: 31- Aug- 2017]. [10] "GeForce GTX 1080 Graphics Cards from NVIDIA GeForce", NVIDIA, 2017. [Online]. Available: https://www.nvidia.com/enus/geforce/products/10series/geforce-gtx-1080/. [Accessed: 31- Aug- 2017].

CLOSING REMARKS Dear reader, Thank you for your time, from all of us at Embark, we hope you enjoyed the contents of Volume II. There was a considerable effort applied by multiple people, and here is where we’d like to recognize those individuals; PRESIDENT: LYDIA ZHOU VICE PRESIDENT OF ENGAGEMENT: Rebecca Leeper EDITORS: Ani Semerdijian, Afam Nwokolo, Ashley Domogala, Amanda Barbour, Meenu Swaminathan, Kritika Singh, Geanna Flavetta, Rebecca Reals

Additionally, the following individuals have dedicated their time, making crucial contributions to the development and disbursement of Embark, as both a publication and organization. With their guidance, Embark has grown into an established organization that encourages undergraduate research and the exploration of the applied sciences and engineering. Once again, the members of Embark would like to thank you for your time and commitment; NOTABLE ALUMNI: Emma Kaeli and Mariana Mora FACULTY ADVISOR: Dr. Kathryn Schulte-Grahame ADMINISTRATIVE SUPPORT: Dean Richard Harris and Dean Rachelle Reisberg From each of these individuals, we applaud the research efforts undergone by Embark’s contributors as well as Northeastern University’s undergraduate research community in full. Additionally, we encourage undergraduates of the applied sciences and engineering to involve themselves and investigate Northeastern’s vast research community. With gratitude, Clark Luckhardt VICE PRESIDENT OF PUBLICATIONS