Sonic Augmented Reality Urban sound classification, machine-learning, and Neural networks as an assistant for the frequent headphone-user.
Florian Wachter Interaction Design and Technology Chalmers University of Technology Email: info@florianwachter.com with cars, public transport, or other moving Objects in an urban environment. According to the research of the University of Maryland (USA), which investigated the risk of injury through the wearing of headphones or earphones by pedestrians. Between 2004 and 2011, 116 reports of death or injury of pedestrians with headphones or earphones. 68 percent of the victims were males and 67 percent of them were under 30 years old. 89 percent of all the reported victims took place in urban areas and 55 percent involved trains. ”Many cases (29 percent) mentioned that a warning was sounded before the crash.”[16] Surrounding sounds are very important for humans to navigate and interact in their environment. On the other hand, humans derive pleasure from this behavior which makes it hard to implement change. In order to find a solution, I tried to focus on a solution which assists in creating awareness of surrounding.
Abstract—This project considers the rising effect of people wearing headphones or ear pods and putting them-self in risk by isolation from the sonic urban environment. This project seeks to find out whether ”Machine Learning” (ML) with ”Urban Sound Classification” could be an effective tool to find a solution that minimizes the risks of accidents involving urban moving objects such as cars, trams, bicycles, etc. in order not to take away the pleasure of listening to music. Via my study, I assume that I may have found a new augmentation in sound, which I call ”Sonic Augmented Reality.” Sonic Augmented Reality, is the technology of combining ”sonic interactions” (convey information and meaning through interactive context) from our technologydriven environment with computer-generated sound information as a second layer on top of our consumption of music. Keywords—Frequent Headphone User, Urban sound Classification, Machine Learning (ML), Neural Network (NN), Sonic Augmented Reality, Noise Pollution
I.
I NTRODUCTION
The access to music nowadays (2018/2019) through streaming services like Spotify, iTunes, etc. creates a new way of how (every time, everywhere) people consume music. According to the newly published data from ”Nielsen”[39], the average music consumer uses 3.4 devices and listens around 5h per day (35.7 hours per week) to music. In addition, development of headphones and earbuds has risen parallel. The offers for headphones on the market is overwhelming. II.
P ROBLEM
With the advent of music streaming services and the oversupply of headphones, it seems people see this as an opportunity to use headphones more frequently in the urban environment. However, I discovered through interviews, that people listen so frequently to music in the urban environment because of the underlying problem of ”Noise Pollution” in the urban environment. The frequent consumption of music in an urban environment is a subconscious reaction to noise pollution. Noise pollution is affecting both health and behavior because noise pollution can cause hypertension, high stress levels, tinnitus, and more. This explains why people like it to isolate themselves with headphones in public transport, etc. The headphone industry recognized this effect for quite a while and added new features like over-ear headphones, in-ear headphones (ear pots), and noise-canceling to there devices. However, this causes a new problem in the field. The problem deals with accidents in urban environment by lost sensation (see fig:1) for surrounding sonic interactions and hearing damage. This can result in assimilate accidents
Fig. 1.
Frequent headphone use in urban environment.
III.
P ROCESS
For the process, I used the structure of ”Design Thinking” [3] and focuses on ”Activities Centered Design” [14] and ”User Experience Design” but more about in the chapter: X. Theory. A. Stakeholders The Stakeholders are people who using headphones in an urban environment and it does not matter if frequently or not. Software development companies which acting in the field of portable music player (Spotify, Itunes, etc.) could be interested in a solution as a feature in their products. Headphone producer 1
Fig. 2.
• Sound Classification with TensorFlow – IoT For All – Medium [32] • TensorFlow Audio Recognition in 10 Minutes DataFlair [35] • Audio classification using Image classification techniques — Codementor [5] • TensorFlow Sound Classification Tutorial — IoT For All [36] • How to teach Neural Networks to detect everyday sounds - Skcript [13] 2) Sound Classification Examples • GoogleCloudPlatform/cloudml-samples: Samples for Google Cloud Machine Learning Engine[10] • Projects – Machine Learning, Neural Networks, & AI – opensource.google.com[23] • Teachable Machine[34] • Visualizing High-Dimensional Space by Daniel Smilkov, Fernanda Viégas, Martin Wattenberg & the Big Picture team at Google — Experiments with Google[41]
Design Thinking: A 5 stage process [3]
could also be interested to offer a solution as a feature in their devices. B. Anticipated Challenges
C. Result
In order to make a computer powered device (Smartphones, smart-headphones, smart-watches, etc.) aware of urban sounds, it is an urban sound classification needed. This could be managed with a neural network which is focused on sound classification. Neural networks are created with Machine learning and a huge amount of sound samples (Training data). In order to create neural networks a high amount of computing power is needed, but more in Chapter ??.
The result of the project is this research paper, which considers the limitation and complexity of machine learning and neural networks. The process of creating a possible solution for frequent headphone user to prevent them from accidents in an urban environment.It also considers the new field of Sonic Augmented Reality. IV.
Hardware I used in this project:
T HE H UMAN H EARING
The Human Hearing encompasses how we perceive sound, (called Psycho-Acoustics, ”Psycho-Acoustics” is the science of sound perception, i.e., investigating the statistical relationships between acoustic stimuli and hearing sensations. Hearing is the sense by which sound is perceived.” [45]) and how humans process sound signals in the brain.
1) MacBook Pro Retina, 13-inch, Early 2015[18] • Memory: ”16 GB 1867 MHz DDR3” • Graphics card: ”Intel Iris Graphics 6100 1536 MB” R head2) QuietComfort 25 Acoustic Noise Cancelling phone[25] • Noise-Canceling • Inline Mono Microphone R • QC 25 inline remote and microphone cable
A. How Humans perceive Sound The human ear is made of three major sections: the inner ear, the middle ear, and the outer ear (see fig: 3).
Later, I recognized that the equipment was too poor in the way of too less computing power (too long rendering times for the models) to create models for the neural network. On the other hand to collected sounds with a build in microphone is not comfortable and I had a lot of noise in the files which may be disabled the machine learning process. In terms of Resources, Time, and accuracy according to a guide I found on Medium from DeviceHive [32], the use of a GPU from an NVIDIA GTX 970 4GB will result in 1-2h of training time. Which means with my equipment I will need because of the low memory and processor rate four times longer than in this example. At the same time, I also needed a lot of Training data which took me hours to collect.I decided to continue without creating a real NN sound models because, from all the tutorials I read and example scripts I tested, I assume that urban sound classification with machine learning is possible. Maybe not perfect yet but possible in a near future.
This is also called ”peripheral auditory system”[45], were the human hearing is primarily performed. The two important organs for hearing are the eardrum and the cochlea. The eardrum is located between the outer ear and the middle ear (see the middle part of fig:3: Malleus, Incus, Staps). It is made out of a thin membrane that transmits sound from the air to the malleus a hammer-shaped small bone, also called hammer. The cochlea is located in the inner ear (see right part of fig:3: cochlea). The cochlea converts sound vibrations into nerve impulses which then again get transmitted to your brain. In addition, the cochlea helps humans to manage their balance. To perceive sound as a human depends on the factors of intensity, frequency, and overtones. The sound intensity is a specific range of sound between 20 Hz to 20 kHz, which humans can hear. The frequency of a sound is perceived by humans as pitch.[33]
A set of Examples of Tutorials and Examples I tested:
The hearing process works, according to “Audio Watermark: A Comprehensive Foundation Using MATLAB”[45]:
1) Machine learning Tutorials 2
it possible to get a visitor guided through an exhibition without having a person as a guide. This helps to make a guiding independent from time and resources of an exhibition. B. Acoustic Ecology (ecoacoustics) by the World Soundscape Project Acoustic ecology, ecoacoustics, or soundscape studies is the study of the relationship between human and their environment. The term was shaped by the ”World Soundscape Project” (WSP), which was started in the late 1960s by R. Murray Schafer and his team (Barry Truax, Hildegard Westerkamp, Bruce Davies, Peter Huse).[9] ”R. Murray Schafer is a musician, composer and former Professor of Communication Studies at Simon Fraser University (SFU) in Burnaby, BC, Canada”[44] Byound, acoustic ecology is pointing out, according to “An Introduction to Acoustic Ecology”[44], that our acoustic environment is ”noise polluted” because it could be seen as a musical composition by us. Fig. 3.
Structure of the peripheral auditory system [45]
C. Soundscaping by the World Soundscape Project As a solution to noise pollution, The WSP developed a set of “ear cleaning” exercises including “soundwalks”. This was conducted as a meditative walk, ”where the object is to maintain a high level of sonic awareness”.[44] Soundwalks involves: soundscape recordings and the description of sonic features, like background sounds as “keynotes (”In analogy to music where a keynote identifies the fundamental tonality of a composition around which the music modulates”[44]), and foreground sounds as sound signals that intended to attract attention [44]. These recordings and descriptions of sonic futures get marked in analogy to landmarks on a map and got called ”soundmarks”. An example of a soundmarks could be waterfalls, wind traps, and sounds of traditional activities of people.
”First, the sound wave travels through the auditory canal and causes the eardrum to vibrate. This vibration is transmitted via the ossicles (Malleus, Incus, Staps) of the middle ear to the oval window at the cochlea inlet. The movement of the oval window forces the fluid in the cochlea to flow, which results in the vibration of the basilar membrane that lies along the spiral cochlea. This motion causes the hair cells on the basilar membrane to be stimulated and to generate neural responses carrying the acoustic information. Then, the neural impulses are sent to the central auditory system through the auditory nerves to be interpreted by the brain.”[45] When the brain is received the neural impulses as signals from the auditory system it starts to decode them into soft or loud sound, high or low, and its location, in order to cause a sensation or conscious perception.[7] ”In exchange, the brain can alter how the cochlear functions. For example, in the general noise of a cocktail party, we are able to focus on a friendly conversation, even though our ears are getting stimulated by many different sources, which are often louder. Our brain has ”asked,” to prioritize the information coming from an interesting conversation!”[7]
”The WSP’s conception of the acoustic environment was not necessarily the natural soundscape of habitats and ecosystems that soundscape ecologists study, rather, it referred to the soundscape that we humans encounter in our everyday life, and how that soundscape affects our ability to connect to our community. The WSP’s goal revolved around finding solutions for an “ecologically balanced” soundscape, where the relationship between the human community and its sonic environment is “in harmony.” Through active listening and “earcleaning” exercises, the WSP emphasized the responsibility that the listener has towards his or her soundscape.”[28]
B. Directional and Spatial Hearing Hearing not only allows humans to indicate sounds it also allows humans ”to orient in space and to segregate multiple sound sources in complex acoustic environments”[15]. This helps human to facilitates communication by improving the understanding of speech in competing noise. ”In order to meet these demanding requirements, the human auditory system has developed highly sophisticated spatial-information-processing strategies.”[15] V.
D. Sonic city by M. Johansson and S. Lerén The ”Sonic City” (see fig:4) project was a collaboration in 2002-2005 between the Interactive Institute and the Viktoria Institute, conducted as a master’s theses project from Magnus Johansson and Sara Lerén at the ”IT University Göteborg”. Sonic city was an interactive wearable technology which created music in real time, based on inputs from sensing bodily and local factors. ”Sonic City generates a personal soundscape produced by a combination of physical movement, local activity, and urban ambiance. Encounters, events, architectures, (mis)behaviors – all become means of interacting with, or ‘playing’ the city. Our intention was to break out of traditional forms of music creation and listening, enhancing
R ELATED W ORK
A. Audio Guides An Audio guide is a handheld device which provides information in a text or recorded format for visitors in an exhibition context (Museum, gallery, etc.). Audio guides make 3
different parts of the city. In order to hear the sounds you just have to click on them in the map. Separately, The sounds are grouped into categories (economic, political, social, or religious sounds). Beyond, ”A section of ”Historical” sounds consists of excerpts of books and diaries from the 11th century on, again divided into useful categories (sounds of street and town, communal living, river traffic, plague/war/disaster).”[4]
personal expression and encouraging new uses of the urban landscape.”[31]
Fig. 4.
The Sonic City Project [31]
E. Microsoft Soundscape, Soundscaping App Fig. 6.
”Microsoft Soundscape is a research project that explores the use of innovative audio-based technology to enable people, particularly those with blindness or low vision (see fig:5), to build a richer awareness of their surroundings, thus becoming more confident and empowered to get around.”[19] According to Microsoft Soundscape - Microsoft Research, the app uses 3D audio cues to enrich ambient awareness and provide a new way to relate to the environment. The app should help to build up a mental map, which should help to navigate in unfamiliar spaces.
London general sound map recordings [4]
G. Listening to the Deep Ocean Environment ”The Laboratory of Applied Bioacoustics (LAB) of the Technical University of Catalonia (BarcelonaTech, UPC) is leading an international programme entitled ”Listen to the Deep Ocean Environment (LIDO)”[17] LIDO is a project according to the increasing level of offshore industrial development which will lead to an increase in noise pollution in the oceans. This will lead to ”physical, physiological and behavioral effects on marine fauna in the area of activity: mammals, reptiles, fish and invertebrates can be affected at various levels depending on the distance to the sound source”[17]. The goal is to detection, classification, and localization of acoustic sources from the ocean Environment and publishes them online (see fig:7).
Fig. 5. A Blind person is using ”Microsoft Soundscape” to orientate in the city [19]
Fig. 7.
F. The London sound survey
H. Radio Aporee
London Sound Survey includes a grid map of London (seen fig:6). On the map, there are different soundscape marked in
Radio Aporee is a platform and ”it is a global soundmap (see fig:8) dedicated to field recording, phonography and the 4
Listen to the Deep (listentothedeep.net) [17]
with from the source of creation disconnected devices and by event interacting with geographical information services will lead into ”Sonic augmented reality”. Sonic augmented reality is adding a new layer of audio to any other audio experience like music, podcasts, etc. which people consume through headphones. This supports user to create a richer awareness of its surroundings (see V-E). With help of 3d audio cues to enrich ambient awareness” (see V-E). Users of Sonic augmented reality create a new way to relate to users environment (see V-E). This not only makes aware it also maintain a high level of sonic awareness (see V-C). Beyond, of what the human hearing can perceive (see IV-A), Sonic augmented reality can make sounds visible which the human hearing can not sense (for example listening experience from Saturn, see V-I). The use of Sonic augmented reality can be also adapted to, how soundscape effects our ability to connect to our community (see V-C). Or, sonic cartography can be created and can be made publicly accessible as a collaborative project (see V-H). In order to avoid stimulus’s from noise pollution, Sonic augmented reality can support to create the desired result of a sonic environment in harmony (see V-C).
art of listening. It connects sound recordings to its places of origin, in order to create a sonic cartography, publicly accessible as a collaborative project”[26]. According to radio aporee ::: maps - sounds of the world, the project is a valuable resource for art, education, research, and personal pleasure, because of its recordings from numerous urban, rural and natural environments, disclosing their complex shape and sonic conditions, as well as the different perceptions, practices and artistic perspectives of its many contributors.
VI. Fig. 8.
Radio Aporee (aporee.org) [26]
U RBAN S OUND C LASSIFICATION
Urban Sound Classification is dealing with sounds in an urban environment (traffic, humans, etc.). The goal is to train an artificial intelligence to understand scenes in an urban environment. The following sections show several examples of sound classification, which can be used as training data or including already trained data. The following subsections show some example work in the field of sound classification, which includes some urban sounds.
I. Radio Sounds from Saturn ”Radio Sounds from Saturn” is a 73-second listening experience from a compresses 27 minutes of radio emissions (sonogram of the radio emission, see fig:9) from Saturn.[37]
A. Urban Sound Data-Set The Urban Sound Dataset includes information and download links to datasets and taxonomy which was presented from J. Salamon, C. Jacoby and J. P. Bello, ”A Dataset and Taxonomy for Urban Sound Research”, 22nd ACM International Conference on Multimedia, Orlando USA, Nov. 2014. The sources on this side include: UrbanSound dataset, UrbanSound8K dataset, Urban Sound Taxonomy, a list of models.”[40] B. Google Audio-Set The Google Audio-Set includes an ”ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos”[6]. The sounds are categorized into human sounds, animal sounds, musical instruments, genres, and common everyday environmental sounds. The google Audio-Set tries to present with this project a comprehensive vocabulary of sound events.
Fig. 9. Saturn radio emissions (http://cassini.physics.uiowa.edu/spaceaudio/cassini/SKR1/) [37]
J. Sonic Augmented Reality Sonic is used to describe Things related to sound. An example of something sonic is a boom from an airplane that broke the speed of sound. This means sonic is positioned at the intersection of the auditory display between humans and artifacts, services, environments, applications that range from the critical functionality of an alarm, and the artistic significance of musical creation. In terms of auditory displays, the action to design them is called Sonic Interaction Design.[29] In return, taking track of these auditory displays
C. Youtube 8M-Modules The Youtube 8M-Modules contains programming code for training and evaluating machine learning models over the YouTube-8M dataset. ”The code gives an end-to-end working example for reading the dataset, training a TensorFlow model, and evaluating the performance of the model”[11]. The code examples on the website can be easily extended to train your own custom-defined models. 5
VII.
IX.
M ACHINE L EARNING
Machine learning (ML) is the step towards Artificial intelligence becoming to work. ML is a programmatic and algorithmic approach to progressively improve performance on a specific task. The algorithms of an ML are inspired by a mathematical model based on vectors. The process of ML is to teach a machine to solve problems by showing them huge amounts of examples (Training data) and letting them infer their own patterns of thinking. The goal is to enable computers to learn on their own.
L IMITATIONS OF U RBAN S OUNDS CLASSIFICATION WITH M ACHINE L EARNING
During this project, I had to face a series of limitations in order to build a prototype. finally, I failed to get a prototype working. When I tried to set up my first Machine learning environment to train a sounds of a tram bell. I realized how much time it takes to train a model, which can be traced back to the low processing power of my computer. In return the resulting model was not accurate enough to recognize the same sound through my headphones microphone. Which can be also traced back to the poor quality of my headphone microphone. In order to get it work I made some research and found out my microphone produced to much back noise. The problem with the back noise from the microphone and the noise in an urban environment (white noise, people, traffic, other surroundings), leads to a ”Demixing Problem”: ”Demixing is the problem of identifying multiple structured signals from a superimposed, undersampled, and noisy observation.”[1] To solve this problem I would need a some preprosessing filter algorithms to get a clean sound. To capture sounds in an urban environment is time-consuming, in order to get the sound of a tram-bell I had to wait. As a workaround I used youtube and the Youtube 8M-Modules (see fig:VI-C) to get training sounds instead of standing outside for hours and waiting until a tram is ringing the bell.
Machine learning follows different concepts of learning: • Supervised Machine Learning is using an algorithm to map input data to output functionality. • Unsupervised Machine Learning is when you only have input data and no corresponding output variables. The goal is to model data in order to understand the data. • Semi-Supervised Machine Learning is if you have data which is labeled and unlabeled together. In the world of Machine learning, there are several breakthroughs, for example, human speech recognition services like Pocketsphinx, Google’s Speech API, Alexa, Siri. Contrarily, in the field of urban sound classification, there are no notable examples.
X. VIII.
T HEORY
N EURAL N ETWORKS (NN) In this section I give a briefly overview about the used theory of: Design Thinking, Activities Centered Design, User Experience Design, Speculative Design, and Design Fiction.
Neural Networks (NN) is mimicking the operation of the human brain, like the nerves and neurons. ”A key feature of neural networks is that they are programmed to ’learn’ by sifting data repeatedly, looking for relationships to build mathematical models, and automatically correcting these models to refine them continuously. Also called neural net.”[43] The following subsections presenting some examples of NN’s which were important for this project.
A. Design Thinking ”Design Thinking”[3], follows a structure of different design phases (emphasis-, define-, ideate-, prototype-, test, implement-phase). Every stage have a wide set of design methodology which helps to get the right answers you need to make progress in designing a product or service for a user group. Design Thinking was used in the project to structure the progress and set milestones between the literature research and the more practical part of prototyping and testing.
A. Convolutional Neural Networks (CNN) This NN is designed to work with images and it reduces an image into a vector. In order to create vectors out of an image, it uses filters and applies them to the image. These NN are common practice for camera input and Sound.
B. Activities Centered Design ”Activities Centered Design” (ACD), is a part of ”HumanCentered design” and instead of focusing on the user needs it focuses on the user’s activity. ACD, because the underlying problem of frequent headphone users is grounded in their activities in the urban environment. Instead of designing for user needs, the goal is to focus on a wide range of activities of the user in use of headphones, a kind of Hearing Augmentation. Activities Centered Design was used because the underlying problem of frequent headphone user is not a need to fix from the perspective of the users, it is more an need to fix a problem which is pointing on the activity of the users which needs an adjustment without setting constraints to the users pleasure of listening music everytime and everywhere.
B. Recurrent Neural Networks (RNN) This NN is the best practice with sequential data sets, for example, sentences (like sequences of words, in speech), the real world (like sequences of numbers). The input data will be converted into a vector of numbers. These NN’s are used to converse, translate and write.
C. Artifical Neural Networks (ANN) This NN made for Gaming, it works for in-game tasks because it is mapping input to output. 6
I found during reading. As an example, I experimented with ”OpenFrameworks”[22] and ”Wekinator” [42] from wekinator.org , at the very beginning of the research and gained great knowledge about Machine learning. This helped to understand the following readings which were connected to Machine learning. ”The Wekinator is free, open source software originally created in 2009 by Rebecca Fiebrink. It allows anyone to use machine learning to build new musical instruments, gestural game controllers, computer vision or computer listening systems, and more.[42]”. The Wekinator is a software which provides an easy to use interface for creating quickly machine learning results. In order to gain more knowledge in the field of urban sound classification and frequent headphone user, I carry out four Experiments.The experiments are studying the reaction of frequent headphone users to a by machine learning generated system that can react to urban sounds classification. For the implementation of the experiments, I build different prototypes which got tested by at least 3 persons per experiment. Every participant was interviewed afterward and the results were to being used for further development for the next experiment.
C. User Experience Design Users have experiences with products when these experiences are continuing positive, the user will stick to the product and the product can reach a certain state of sustainability. The Norman and Nielsens group define it, ”as an encompassing of all aspects of the end user’s interaction with the company, its services, and its products”[21].” User Experience Design was used because every interaction with humans and devices need a high valid experience in use, otherwise, the user will not have a satisfying confrontation with the product and it risks to fail. D. Speculative Design and Design Fiction Speculative design and Design Fiction refers historically back to, by Anthony Dunne termed “critical design” practice, according to “Introduction to Speculative Design Practice”: ”it pointed to the radical architecture of the 1960s, and partially to the critical practice of avant-garde and neo-avant-garde art. They are particularly inspired by the narrative quality and imaginary worlds of literature and film.”[20] ”Later on, Anthony Dunne and Fiona Raby expanded the focus of critical Design activities to the cultural, social and ethical implications of new technologies, and, most recently, on speculations about broader social, economic and political issues. [...] The speculative design approach takes the critical practice one step further, towards imagination and visions of possible scenarios.”[20] Speculative Design has to be conducted in a balance between comprehensible future and present (see 10), key factor is careful management of the speculation.
B. Observation Observation is a method which affords meticulous looking and precise recording of phenomena (see fig:11), which includes people, artifacts, environments, events, behavior, and interactions.[12] C. Body-storming Body-Storming is akin to brainstorming, but body-storming is for simulating the physical experience, with role-playing, in order to inspire for new ideas. Designers using body storming to immerse into their users, so that the designer get a closer idea of how their users experience the world.[12] D. Semi Structured Interview A semi-structured Interview is written out of structured and unstructured interviews and uses closed and open questions. The semi-structured interview follows a pre-defined structure, so that the same structured is covered with each interviewee.[30] E. Prototyping
Fig. 10.
A prototype is a physical or digital(click dummy, video, picture, etc) representation of or a part of a product, in order to test the emotional response and functionality of a potential user. Prototyping works best for testing of ideas within design teams and with clients and users.[12]
Alternative Presents and Speculative Futures [8]
This project is Speculative Design related because it deals with behavior in sociological context and technology which is still under development and currently not working as expected but all the evidence of economic efforts shows that it will lead into the suspected accuracy of the technology. XI.
F. Wizard of Oz Wizard of Oz is a design method which works with faking functionality. Someone is directing behind the scene, in order to make the user believe the prototype would be work.[24]
M ETHOD
A. Literature Research
G. Thinking ALoud
Literature Research was necessary in order to get an idea about Sound, Sound-walks, Sound classification, Machine learning, neural networks. The Literature Research was not only about reading. I gained great knowledge, in testing tools
Thinking ALoud is conducted during a user testing. The user has to talk loudly what he is doing and thinking about the current situation. For an analysis after the test the user gets recorded.[38] 7
H. Experiments ONE ”Experiment ONE” was carried out with headphones (QuiR headphone[25]) etComfort 25 Acoustic Noise Cancelling and built-in mono microphone. The experiment was an Observation (see XI-B) about, ”how does a microphone experience the world?”. It was an auditive perspective shift towards the technology which is attached to us. In order to detain the experience for analysis, I recorded the sound of the microphone while I am wearing my headphones and walking to university. The walk was conducted in Gothenburg, Sweden over a distance of 2.7 km on a rainy and windy day. During the walk, I used public transport (Trams/Buses), shopped a public transport ticket in a local store, talked with people in a shop and finally run to the tram in order to simulate a critical situation (see XI-C). After, I listened to the record and classified (see 11) what the microphone perceived in specific situations. What I discovered: • The microphone records sound more silent as it is in reality. • The microphone records easily Interference’s (noise resulting in microphone strips on clothes) • The Microphone records a lot of white noise if it is facing the direction of the wind. • Microphone records human noises like coughing, sniff, etc.
Fig. 12. Experiement 02 - Experiance Prototype of Urban Sound Classification
The prototype was able to imitate two different scenarios. First Scenario, was about create awareness and draw attention towards specific surroundings. The second scenario, should enhance with surroundings the music listening experience towards pleasure.
As a result of the experiment I created a library of visual sound classifications 11
The prototype provokes with different controls these scenarios: • control 01: Warning sound and music gets paused/muted during the warning sound is sounding. • control 02: Adding white noise to the background of the music. (imitates wind/water) • control 03: Adding Echo (reverb effect) to the music. • control 04: Add sound coming from the microphone to the music. While testing the prototype is litening to three differnt music categories, which varies in speed, volume and genere: • Song01: slow music, not too loud with relaxing guitar music • Song02: medium fast music, loud music with a lot of bass, electronic tunes and human voices, relaxing moood. • Song03: fast song, medium loud with a lot of bass, electronic and guitarr mixed, euphoric and productive mood. Fig. 11.
Visual Sound classification List
The user testing was organized in a Wizard of Oz (see XI-F) approach. I directed the urban sound classification from behind. After the testing, I interviewed (XI-D) the test persons and got the following results:
I. Experiments TWO
• The Warning sound (control 01) is to sudden and can frighten. • Adding white noise (control 02) feels relaxing and makes me calm.
For the second Experiment I used a smartphone12 and a web browser to create with HTML, CSS, javascript a prototype (see XI-E). This prototype was capable to fake sound recognition in urban environment. 8
• Adding Echo (reverb effect) to the music (control 03) adding a bigger spacial feeling to the environment. • Add sound coming from the microphone to the music (control 04), shifting the focus from listening to music to the surroundings.
The Further development was focused on: • Make warning Sound more recognizable and distinguish to the music which the user is listening to. This was implemented with variations of different voice feedback and notification sounds. • The notification sounds were limited to critical situations and pleasure situations (Immerse Mode). The user should feel the system cares about him. • In addition to the Urban Sound Classification, the system uses different sources of ”Geographic Information Services” to make more context out of the sounds which got classified. for example, if you are in an area with high criminality of pick-pocket and the system classifies a high a mound of people around you it notifies a high risk of a pick-pocket.
As a result I could say: A new layer on top of the music can shift user experience and warning sounds need time to get understood by the user. J. Experiments THREE Experiment THREE was a result out of the difficulties to get participants for the testing for Experiment TWO. In this experiment, I used a Video Prototype (see XI-E) but not only a Video I used a Virtual Reality (VR) video13. For the production of the video, I used a City walk channel on Youtube called ”Wind Walk Travel Videos”[2]. With video
Fig. 13.
The results of the experiments were: • The test persons perceived the voice feedback as feedback of what he saw in its surrounding. • The test persons perceived the information as use-full in most of the cases. • Location-based information, about restaurant, bar, waterfalls was perceived as annoying information. • The test persons perceived the system as a system which is capable to see, hear, and recognizing surroundings. • The test persons liked the idea to get nudge about things they should be aware in some situations (pickpocket in a crowd of people, high traffic, bicycle driver ). • The warning in critical situations is to slow (bicycle driver, tram bell ). • The controlled Volume by the system got perceived as very helpful.
Virtual Reality Video Prototype of Experiment 03
editing software like ”Adobe Premier,” I manipulated the video to let it look like a walk with urban sound classification. The testing with the VR Prototype was conducted in a Thinking ALoud approach (see XI-G). After the testing, I interviewed them with a Semi-structured interview (XI-D).
More about the result of the experiment is considered in XII. XII.
F INDINGS
A. Literature and Technologies review
The results:
• Artificial intelligence (AI), Machine learning and Neural networks are still young technologies. • In case of AI, AI is not existing, yet. If people talk about AI they mean narrow Artificial Intelligence (NAI) or Weak AI, NAI is something which comes close to an AI but is only specialized for one task. Actually, an AI is a human intellect simulating machine, which is capable to create, learn, manipulate, make jokes, and more. • Urban Sound Classification in the field of Machine learning is not a well-researched field in contrast to human speech classification. • The training’s data to create a good model for Urban sound classification is huge and on the other hand sound in urban environment can sound completely different in another country but mean the same. To build a model which works cross country needs even more training data.
• Warning Sound is too loud • The transition between music and Warning has to be more sharp otherwise it feels not serious enough. • The warning sound is too quick. • The warning sounds getting annoying after a while. • The warning sounds needs some pre-configuration, the user does not want to be warned about everything all the time. • The sounds are not clear enough it needs more than just a sound to get aware of what got recognized by the system. • The classification could be enhanced with other data (GPS, Google Maps, other Geographic information services) in order to become more rich information. K. Experiments FOUR The Experiment FOUR is a further development of Experiment THREE. It is conducted in the same matter, this means video prototype (see XI-E) and the same methodology, Thinking ALoud (see XI-G), and Semi-structured interview (see XI-D).
B. Interviews and User-testing • Urban sound classification needs to be connected to other data like Geographic Information Services in order 9
R EFERENCES
to create high valid feedback (Example: the pic-pocket scenario (see XI-K). • Urban sound classification being used in order to control the volume of the music enhances the experience of being more aware in critical situations and being more immersed in uncritical situations. • According to the research some people getting annoyed pretty fast some not. The point when things getting annoyed is very personal. That is the reason why the user should have the possibility to regulate the urgency of getting notified by itself. • As bigger a Model of a Sound from the Urban sound classification that longer it takes to recognize a model fast. XIII.
[1] [2] [3]
[4] [5]
R EFLECTION
Retrospectively, on this project, I would like to consider the fact that limited time, resources, and test capability, limited the results. [6]
In this project, I faced the complexity of Machine learning and its development. I was limited with my programming skills to build a working prototype. But with the approach of building an experience prototype (see XI-I) and video prototypes (see XI-J), I got as close as I can get to real results.
[7] [8]
As a reflection on Machine learning I have to mention, the reaction-time which is depending on computing power and the size of a trained neural network model. The reaction time of a human is up to ”0.7 seconds”[27] in an unprepared state and triggered by visual or auditive stimuli. But, if the reaction time of a processed model in a neural network is even longer, which means the machine reacts slower than a human than the system becomes useless to a user.
[9]
[10]
The approach of sonic augmented reality shows great potential, to create awareness of a frequent headphone user, while it is still not sure where it is best suitable. Further research and user-testing are needed to find how sonic augmented reality ( interaction between sound and geographic information services) support users of headphones in daily use. XIV.
[11] [12]
C ONCLUSION
To prevent frequent headphone user for accidents in an urban environment, I found out Machine learning on its own is not the final solution but it shows great potential in use with geographic information services together (sonic augmented reality). I would assume further research in this field could lead to the desired result. Besides, it is important to make references to the limitation of the development of machine learning and artificial intelligence in general, today. The current state of this technology makes it difficult to build a system which can work errors free. It may not be the best technology to save people in critical urban situations in terms of effort and results.
[13]
[14] [15]
[16]
ACKNOWLEDGMENT I want to thank my supervisor Palle Dahlstedt to provide me with information to continue in this project and rich discussions with enhances my ideation and testing states with extra information.
10
[1309.7478] The achievable performance of convex demixing. https://arxiv.org/abs/1309.7478. (Accessed on 01/10/2019). (20) Wind Walk Travel Videos - YouTube. https://www. youtube . com / channel / UCPur06mx78RtwgHJzxpu2ew. (Accessed on 01/10/2019). 5 Stages in the Design Thinking Process — Interaction Design Foundation. https : / / www. interaction - design . org/literature/article/5- stages- in- the- design- thinkingprocess. (Accessed on 12/23/2018). ACOUSTIC ECOLOGY — Soundscape Links. https : / / www . acousticecology . org / soundscapelinks . html. (Accessed on 01/08/2019). Audio classification using Image classification techniques — Codementor. https : / / www . codementor . io / vishnu ks / audio - classification - using - image classification - techniques - hx63anbx1. (Accessed on 01/05/2019). AudioSet. https : / / research . google . com / audioset/. (Accessed on 01/09/2019). Auditory Brain, auditory perception — Cochlea. http:// www.cochlea.org/en/hearing/auditory-brain. (Accessed on 01/07/2019). James Auger. “Speculative Design: Crafting the Speculation”. In: Digital Creativity 24 (Mar. 2013). DOI: 10. 1080/14626268.2013.767276. CEC — eContact! 5.3 — An Introduction To Acoustic Ecology by Kendall Wrightson. https : / / econtact . ca / 5 3 / wrightson acousticecology . html. (Accessed on 01/18/2019). GoogleCloudPlatform/cloudml-samples: Samples for Google Cloud Machine Learning Engine. https : / / github . com / GoogleCloudPlatform / cloudml - samples. (Accessed on 01/05/2019). google/youtube-8m: Starter code for working with the YouTube-8M dataset. https : / / github . com / google / youtube-8m. (Accessed on 01/09/2019). Bruce Hanington and Bella Martin. Universal Methods of Design: 100 Ways to Research Complex Problems, Develop Innovative Ideas, and Design Effective Solutions. Rockport Publishers, 2012, p. 208. ISBN: 1592537561. How to teach Neural Networks to detect everyday sounds - Skcript. https : / / www . skcript . com / svr / building - audio - classifier - nueral - network/. (Accessed on 01/05/2019). Human-Centered Design Considered Harmful. https:// jnd . org / human - centered design considered harmful/. (Accessed on 12/23/2018). Bernhard Laback. “The Psychophysical Bases of Spatial Hearing in Acoustic and Electric Stimulation”. In: (Februar 2013), p. 21. URL: https://www.kfs.oeaw.ac.at/ staff and associates/laback/Habilitation Laback.pdf. Richard Lichenstein et al. “Headphone use and pedestrian injury and death in the United States: 2004–2011”. In: Injury Prevention 18.5 (2012), pp. 287–290. ISSN: 1353-8047. DOI: 10 . 1136 / injuryprev - 2011 - 040161. eprint: https : / / injuryprevention . bmj . com / content / 18 / 5/287.full.pdf. URL: https://injuryprevention.bmj.com/ content/18/5/287.
[17] [18] [19] [20] [21]
[22] [23]
[24]
[25]
[26] [27] [28] [29]
[30] [31] [32]
[33] [34]
Listening to the Deep Ocean Environment. http : / / listentothedeep.net/acoustics/index.html. (Accessed on 01/12/2019). MacBook Pro (Retina, 13 Zoll, Anfang 2015) - Technische Daten. https://support.apple.com/kb/SP715?locale= de DE. (Accessed on 01/05/2019). Microsoft Soundscape - Microsoft Research. https : / / www . microsoft . com / en - us / research / product / soundscape/. (Accessed on 01/08/2019). Ivica Mitrović. “Introduction to Speculative Design Practice”. In: (). Don Norman and Jakob Nielsen. “The definition of user experience”. In: NN/g Nielsen Norman Group. https://www.nngroup. com/articles/definitionuser-experience/,(accessed 2018-05-20) (2016). openFrameworks. https : / / openframeworks . cc/. (Accessed on 01/13/2019). Projects – Machine Learning, Neural Networks, & AI – opensource.google.com. https : / / opensource . google . com/projects/explore/machine- learning. (Accessed on 01/05/2019). Prototyping: Learn Eight Common Methods and Best Practices — Interaction Design Foundation. https : / / www . interaction - design . org / literature / article / prototyping - learn - eight - common - methods - and - best practices. (Accessed on 01/10/2019). R 25 Acoustic Noise Cancelling R headQuietComfort phones - Apple devices. https : / / www . bose . com / en us / products / headphones / over ear headphones / quietcomfort - 25 - acoustic - noise - cancelling headphones - apple - devices . html # v = qc25 black. (Accessed on 01/05/2019). radio aporee ::: maps - sounds of the world. https:// aporee.org/maps/. (Accessed on 01/12/2019). Reaction time in accident reconstruction. http://www. technology - assoc . com / articles / reaction - time . html. (Accessed on 01/14/2019). Megan A. Reich. “Soundscape Composition as Environmental Activism and Awareness: An Ecomusicological Approach”. In: (2016). Davide Rocchesso et al. “Sonic Interaction Design: Sound, Information and Experience”. In: CHI ’08 Extended Abstracts on Human Factors in Computing Systems. CHI EA ’08. Florence, Italy: ACM, 2008, pp. 3969–3972. ISBN: 978-1-60558-012-8. DOI: 10 . 1145/1358628.1358969. URL: http://doi.acm.org/10. 1145/1358628.1358969. Yvonne Rogers, Helen Sharp, and Jenny Preece. Interaction design: beyond human-computer interaction. John Wiley Sons, 2011. Sonic City — RISE Interactive. https : / / www. tii . se / projects/sonic-city. (Accessed on 01/11/2019). Sound Classification with TensorFlow – IoT For All – Medium. https : / / medium . com / iotforall / sound classification - with - tensorflow - 8209bdb03dfb. (Accessed on 01/05/2019). Sound Intensity and Level — Boundless Physics. https:// courses.lumenlearning.com/boundless-physics/chapter/ sound-intensity-and-level/. (Accessed on 01/06/2019). Teachable Machine. https : / / teachablemachine . withgoogle.com/. (Accessed on 01/05/2019).
[35] TensorFlow Audio Recognition in 10 Minutes DataFlair. https://data- flair.training/blogs/tensorflowaudio-recognition/. (Accessed on 01/05/2019). [36] TensorFlow Sound Classification Tutorial — IoT For All. https : / / www . iotforall . com / tensorflow - sound classification - machine - learning - applications/. (Accessed on 01/05/2019). [37] The Eerie Sounds of Saturn’s Radio Emissions. http : //cassini.physics.uiowa.edu/space-audio/cassini/SKR1/. (Accessed on 01/12/2019). [38] Thinking Aloud: The #1 Usability Tool. https://www. nngroup.com/articles/thinking- aloud- the- 1- usabilitytool/. (Accessed on 01/10/2019). [39] Time with Tunes: How Technology is Driving Music Consumption. https://www.nielsen.com/us/en/insights/ news/2017/time-with-tunes-how-technology-is-drivingmusic-consumption.html. (Accessed on 01/11/2019). [40] Urban Sound Datasets - Urban Sound Datasets. https: / / urbansounddataset . weebly . com/. (Accessed on 01/09/2019). [41] Visualizing High-Dimensional Space by Daniel Smilkov, Fernanda Viégas, Martin Wattenberg & the Big Picture team at Google — Experiments with Google. https : / / experiments . withgoogle . com / visualizing - high dimensional-space. (Accessed on 01/05/2019). [42] Wekinator — Software for real-time, interactive machine learning. http : / / www . wekinator . org/. (Accessed on 01/10/2019). [43] What is neural network? definition and meaning BusinessDictionary.com. http://www.businessdictionary. com / definition / neural - network . html. (Accessed on 01/18/2019). [44] Kendall Wrightson. “An Introduction to Acoustic Ecology”. In: (2016). [45] Waleed H. Abdulla Yiqing Lin. “Audio Watermark: A Comprehensive Foundation Using MATLAB”. In: (2015), pp. 15–16. DOI: 10.1007/978-3-319-07974-5.
11