To develop voice assistants like Siri and Alexa, companies spend years investigating what sounds like a human voice and what doesn't. But what we've ended up with is just one possibility of the kinds of voices that we could be interacting with. In this episode, we talked to sound engineer Frederik Juutilainen, and Assistant Professor at the University of Copenhagen, Stina Hasse Jørgensen, about their participation in [multi'vocal], an experimental research project that created an alternative voice assistant by asking people at a rock festival in Denmark to speak into a portable recording box. We talk about voice assistants' inability to stutter, lisp and code switch, and whether a voice can express multiple personalities, genders and ages.
Stina Hasse Jørgensen (she/her) is an assistant professor at the Digital Design Department at the IT University of Copenhagen. Stina started her path as a practitioner and theorist of sound art and interactive sound at Tonespace / Electronic Music and Sound Art, at the Danish National Academy of Music (SDMK). She then took courses on advanced music technology and creative sound design at the School of Music and Media Arts, at University of London, Royal Holloway (RHUL), and studied sound art at Art History and Auditive Culture at University of Copenhagen (UCPH). Later she was taught by Douglas Repetto, Brad Garton and George Lewis at the Columbia Computer Music Center (CMC).
Frederik Juutilainen is a software developer and dj from Copenhagen. He has a degree in philosophy and computer science from Roskilde University, and a Master's in IT & Cognition. He works in the development and programming of digital-/physical installations and is a part of Group Therapy, a Copenhagen-based club-night focusing on diversity and representation in underground dance music.
READING LIST:
Multi'Vocal: https://multivocal.org/
2021: "The generation of a [multi’vocal] voice," in Seismograf Peer, special issue Sounds of Science: Composition, recording and listening as laboratory practice. Eds. Sanne Krogh Groth and Henrik Frisk. (2021). Written with Alice Emily Baird, Frederik Tollund Juutilainen, Mads Pelt & Nina Cecilie Højholdt.
Agnew, William, Julia Barnett, Annie Chu, Rachel Hong, Michael Feffer, Robin Netzorg, Harry H Jiang, Ezra Awumey, and Sauvik Das. “Sound Check: Auditing Audio Datasets,” 2024. doi:10.48550/arxiv.2410.13114.
Jai Vipra and Sarah Myers West, 'Computational Power and AI': https://ainowinstitute.org/publication/policy/compute-and-ai.
Michael Kwet, Digital Degrowth: https://www.plutobooks.com/9780745349862/digital-degrowth/.
TRANSCRIPT:
Kerry: Hi, I'm Dr. Kerry McInerney. Dr. Eleanor Drage and I are the hosts of the Good Robot podcast. Join us as we ask the experts, what is good technology? Is it even possible? And how can feminism help us work towards it? If you want to learn more about today's topic, head over to our website, www.thegoodrobot.co.uk, where we've got a full transcript of the episode and a sample. by every guest. We love hearing from listeners, so feel free to tweet or email us. And we'd also so appreciate you leaving us a review on the podcast app, but until then sit back, relax and enjoy the episode.
Eleanor: To develop voice assistants like [00:01:00] Siri and Alexa, companies spend years investigating what sounds like a human voice and what doesn't. But what we've ended up with is just one possibility of the kinds of voices that we could be interacting with. In this episode, we talked to sound engineer Frederik Juutilainen, and professor and digital sound at the university of Copenhagen Stina Hasse Jørgensen about their participation in [multi'vocal], an experimental research project that created an alternative voice assistant by asking people at a rock festival in Denmark to speak into a portable recording box. We talk about voice assistants' inability to stutter, lisp and code switch, and whether a voice can express multiple personalities. Genders and ages. We hope you enjoy the show.
Kerry: Thank you so much to both of you for joining [00:02:00] us and also to our special guest Stina's baby who may make a vocal appearance at some point, also would not be the first baby that we've had on the podcast that's partly though why this episode is going to be audio only. So for all of our lovely YouTube watchers, you can still listen to this episode on YouTube and of course, on our normal platforms, Apple, Spotify, the drill, but thank you so much to all three of you for joining us here. So just to kick us off, could you tell us a little bit about who you are, what you do and what's brought you to thinking about the voice, gender, and technology.
Frederick, could we start with you?
Frederik: Yes. I think these are all topics that for a long time that are separately have been interested in and working working in, and then more and more they ended up getting twangled together somehow. Things that relate to like audio generation and mangling and production, and especially using machine learning techniques for that.
And then I think I have a background in computer science and philosophy and with a heavy focus on [00:03:00] phenomenology and gender studies. So that kind of entered the mix. And I also work a lot in in nightlife and work with feminism through activism and allyship.And Stina, and I, we know each other from from university.
And then later I got involved in the multivocal project that had already been running for some time with Alice.
Kerry: Fantastic. And now Stine, we would love to hear from you and how wonderful that you went to university together. I didn't know that.
Stina: Yeah. University of Copenhagen, where I will soon be going back to be a, an assistant professor in sound studies with a focus on digital sound. Yeah, and I've been long interested in doing practice based research on the politics and aesthetics of voice and voice design. And that has been like my main focus especially researching this from a feminist point of view.
Eleanor: So our podcast is called The Good Robot. So what is good technology? And maybe we can think about this [00:04:00] specifically in relation to sound technology. What is a good sound technology? Is it possible? And how can we work towards it using feminist ideas?
Frederik: Yeah, I was thinking about this.
This is a very hard question. I think I can remember who is something in relation to what is a good technology. I remember someone at one point saying that that what is magic and magic is like porn. It's porn when you see it. And I think similarly, we can think about good technologies in the sense that it's, it can be very hard to define, but you also have a sensitivity and a feeling and sometimes an intuition about it when you see it or when you experience it.
And I think one of the things that I really think that we should shy away from is to try and think of technologies as something that exists in a vacuum of some sorts, but it's always being defined or co constructed in the networks or in the realities in which the technology is embedded.
So I think this co creation between different technologies and the politics is very [00:05:00] important. So you can remove the technologies from the politics, but you can also not remove the politics from the technologies. And I think one of the things that we talked a lot about when we started this project is to think about how technologies have to be created with and by the people that are affected by them.
And of course this means a, this could be a fluffy term and also, especially when you work with something like machine learning created with, does that mean that we include the data of, or does that mean that this is a, the people that have been working with the machine learning techniques, or is this decision makers?
And I think one of the things that we have been very driven for with this project from the start has also been to, to try and work with ideas of the democratization in who and how you can experiment with their voice technologies and machine learning.
Stina: And I might add that we also believe that transparency [00:06:00] is heavily important when, thinking about what technology is.
And not transparency in the sense that, Oh, but you can also just get the Python code, but also to actually have a responsibility to communicate. What is this technology about? So that citizens in a democracy can also understand and be part of decision making. So I think that is a, another part that is really important.
And then the last one that we also talked about briefly is climate aware technology in the sense that there is a responsibility when working with technology that you also think about the footprints that it makes in relation to climate, is it worth it? You know what do we need what is worth it.
And of course, who can answer that. I don't know. But I think it's very important to have a climate aware technology as well.
Frederik: This is also then an odd time to work with this, because of the waves that have been going towards open source technologies, so suddenly everybody has a laptop and it's able to [00:07:00] program, and so on that we are now so heavily relying on very large amounts of compute that is not the reality for most people. And similarly when we talk about transparency, what kind of techniques do we have for opening up and understanding decision making in deep neural networks when this is something that is very much an unsolved problem at the same time.
Kerry: Thank you. And definitely for our listeners, we always have a reading list to go with each episode. And so we'll definitely attach some resources around the concentration of compute power in the hands of kind of a few big companies or the way that this kind of DIY approach to making these technologies has unfortunately started to become quite constrained by this exact concentration of compute.
But we actually came across your work because we thought you were both working on a really interesting project. interesting project called Multivocal that was, trying to make a little aperture in this idea of what could a good technology be in the voice space? And we actually want you all to hear what this voice tech sounded like.
Eleanor: I've [00:08:00] always been interested in how the human is indexed by its exclusions. And there is no human without race and gender and these ways that we divide ourselves. And this amazing sort of alien like it sounds to me like a Some sort of UFO moving away in space, like the flutter of wings, it's at once natural and not, and I love that reverberation sound.
I found it totally captivating because technologies often try and hide the multiplicity within them. All the data, all the different voices that go into making a product, even if at the end, it's just one person speaking into a microphone. And I love that you can hear all the people that recorded at the festivals that you use to, to make that voice.
And it's many different humans and it becomes not human because it doesn't call to you with gender in the same way. I just found it like totally captivating.
Kerry: This is a really different interpretation of a voice [00:09:00] technology to what many of us are probably used to. Your Siri or your Alexa, which is often a kind of a unified, singular semi robotic sounding voice. So could you tell us a little bit more about Multivocal? Why did you make this project? How did you do it? And what exactly were you trying to challenge about that conventional voice that we're used to with things like Alexa or Siri?
Frederik: I think one of the things that we that was a very much a driver for the project is curiosity. And curiosity in terms of speech driven or speech centered technologies that they have they have a voice.
And then of course there are ones where you can decide which of the voices that you want, but these are still singular and it would seem intuitive to think that okay, this singularity of the voice is because of how it's being generated and how it's being trained. Where it's a voice actor that is recording.
And then we then have, machine learning approach to try and [00:10:00] emulate the same speaking qualities as the speakers present in the dataset. And I think we have a little bit of a punk attitude and it's 'okay what happens If we take and we try and throw a lot of different kinds of speakers through these kinds of networks?', because it's, it's a very novel thing that you can attempt to decouple voices from a body, and if you didn't think about recordings or radio, which of course has been around for a very long time, there would be a kind of a connection or a or an imagined connection between the speaking body and the audio that is being perceived and this then puts the limit on, okay, this body has a gender and this body has a geographic origin. This body has an age. What will happen if we then pass multiple people through these neural networks? Will it then be like a multiplicity or [00:11:00] will it be a complete distorted other. So very much the project, that we were working on with multi vocal is trying to take some of the assumptions about a, what is a voice-based technology. And then a little bit naively at times take a step back and see what if it wasn't so.
Kerry: Fantastic. And just for the sake of our listeners, could you very quickly describe for us Yeah, exactly. What multivocal like is quote unquote, because it's such an unusual and fascinating project. And then also how you made this, like how did you go through the entire process of then creating this kind of final product.
Stina: I will try to give it a go. Let's see how it goes with some baby sounds. Just to maybe clarify that multivocal was a collective and a project from 2015 to 22.
So it's, also now working with older technologies, a lot of things have happened since. But some of the core ideas are still relevant. And some of the [00:12:00] explorations are still relevant. And it's not only me and Frederik, but also Alice Emily Baird, who is an audio researcher and artists from UK, and Mads Pelt who is a programmer and a software architecture developer in a Danish startup.
And also Nina Cecilie Høyholt, who is an interaction designer and artist from Copenhagen as well. And we've been collaborating on this as Frederik said starting from the curiosity of why is Siri and Alexa sounding as they do? Could it be different? And then we've been out, making a lot of different installations, collecting voices at the different festivals, both music festivals and tech festivals and at the universities and different places to create this voice. And then we've been doing a lot of different tests with a different kind of machine learning methodologies, trying to generate these synthetic voices with different [00:13:00] approaches. And I think that we find that it's very important to think about how are these synthetic voices generated as vocal performances, shaping how we come to understand ourselves and others through these vocal material relations, but also what kind of cultures do they create? And that's been like our main interest as well. And I think if maybe you can talk a little bit more about the ways in which we've been producing these voices as well?
Frederik: Yeah the way that you would generally work with this is to try and have a very pristine audio recording situation and most likely also have trained voice actor who is speaking their mother tongue and so on.
And the way that we have been doing this has been very much to be to do these kinds of interventions or installations where people could record sentences and then have it be incorporated into this synthetic voice. [00:14:00] And I think I think that happened also because we did this in festivals and we did this in like we, we had to build a box that we could leave somewhere.
In within the limitations of the project, we ended up also creating a extremely messy data set where I think this was not the intention from the get go, but very quickly also became something that we enjoyed a lot about the project that this is supposed to be... like even like now the conversation that we have is also messy.
And, we have to mute when we're not talking and there's babies in the background and then we had voice recordings whereas multiple people screaming and you can hear a death metal concert in the background and so on. And I think this messiness of of the data, combined with the fact that that we also trying to explore that, that speaking always have happens in a performative manner and therefore the way that people chose to not only [00:15:00] donate the characteristic characteristics of their voice, but their voice ended up being a very big kind of impact on on the data that we've been working with. And then the way that we are working with this is that we have both experiments in doing. Transfer learning. So this is taking an existing, a synthetic voice and in Infusing the style of our recordings on this.
And then we also have tried experiments with training from the ground up doing a pre training only with the data that we have. And the I think we've quickly found out that the more that we relied on fully having data that was quote unquote multivocal the more it became. And we have them a part of the work that we've done.
It's also been to, eh, document because you, you do many training iterations with this and document these different kinds of it's relations and explore when does this [00:16:00] start feeling like a voice. And what is the voice like aspect is that it ends up being the, what's very unifying somehow is breathing and not breathing in necessarily an, like an elegant way, but more in the sense that it sounds like something that's been made with air. And I think it's very interesting that we didn't have so many people speaking and the great unifier for this is that there's oxygen involved somehow.
Stina: But, and maybe I can just add that one other finding with these recordings that we did at festivals and as you say, black metal noise in the background is really also the fact that it's such a weird and awkward paradigm for synthetic voices and voice design currently that it's recorded in a studio without the surroundings being at all present without the baby breathing in the background.
And you don't hear other people, you don't hear the room tone, like all these kinds of things is super interesting. And of course there's [00:17:00] some technological, reasons behind it, but it's still, what kind of idea of a voice do we get from these synthetic voices and could it be different?
And there is of course, some kind of a question of when is something understandable in terms of being a voice when you have these kinds of connections to the surrounding space. But there's also really some kind of interesting part of this where it's okay, so all this is studio based recordings normally, and we took the studio out into the open field of people interacting.
Kerry: Oh my goodness, I love this idea of like taking the studio out because, Eleanor and I both do a lot of radio and of course podcasting. And so much of this emphasis is on the idea of the clean sound and I'm sitting in this like supposedly soundproof, definitely not actually soundproof, but theoretically soundproof part in the office.
And it's all about, trying to ensure that there's no outside interference. And so I think what you're doing is cutting against the standards or the normal procedures [00:18:00] of sound and recording. So I want to hear a bit more about how you did this at festivals, right?
This is just really intrigued me. So if I'm the average festival goer and I'm going to go listen to some, death metal music or something - Eleanor knows this would never happen, cause I like hate camping and I hate going to bed after 10PM. So, did you just have a tent there and some recording equipment?
Like how did you get people to buy in to having their voices recorded? Cause I feel like people can be quite suspicious, of these sorts of things.
Frederik: Yeah I think one of , the big ironies of this project was also like how many how many machine learning developers does it take to build a box that you can record things in?
And this was trying to find this mixture of something that was somehow usable and approachable, but at the same time also something that we, we put our own money in this box, and we also talked about that if somebody runs with it, then it shouldn't be it shouldn't be the end of the world somehow.
So what we did more specifically is that we built like a wooden box [00:19:00] with some very ad hoc way of trying to dampen the reverberations inside the box, and then you would put your head up in the box. And then we had a small microphone and then a Raspberry Pi where it said a sentence that you could read aloud and you could hold a button down when you read it aloud. And then and that was it. So we didn't ask about any information. We didn't give any indication, you should only do this and that many recordings, or make sure that you speak directly into the microphone and so on.
And in hindsight, maybe we would have done more of this, but it also, the things that we created also very much became, a product of this. And then we had the same box that we moved from having in a university, for example, where it was, became much more, I think an academic project somehow.
And then when we had it at the, at Roskilde Festival it was like a thing to do at night. I had the feeling of going and shouting to the box. [00:20:00] And I think one of the things that also to add on to what Stine is saying is that the issue of, if we think about synthetic speech as a computer that has to read something aloud for us to understand or comprehend, this is like an issue that was.
Solved many years ago. So what we're doing, like what TTS is doing now is not necessarily only in, in order to make people comprehend text better, but this, it becomes a brand for corporations or it be it's something that's supposed to evoke a feeling and somehow, so I think we in this project where we didn't work with trying to.
Work in different ways and what kind of feelings do we want or what kind of voice is this then we also very much ended up in a way with in making comprehension a hard problem again.
Stina: But maybe I just want to add that this box, apart from having the aim [00:21:00] of recording Different people, all sorts of genders and ages and demographical situations.
We also wanted to create debates around synthetic voices and that was a huge part of this because it was also like, what kind of vocal future do we want, how do we want technology to play a role in this? And so to have people ask questions around this. So it was. Just as much as this was an item of collecting voices, it was just as much to spark debate and we also had a session where we presented the project and had people start questioning some of these synthetic voices in the different places where we had this box up because the box was also Yeah, to see it as some kind of a conversation starter as well around these things so it wasn't just a lone voice connecting box.
Frederik: I was just going to add I think that this kind of debate was also sparked by the fact [00:22:00] that the majority of us don't sound like a mainstream TTS voice. And in this reflection that you have, when you start. Like for most people, there is no speaking without listening and where does the awkwardness come from?
And where does the tension come from when you then record and donate your voice and you think about, okay, what kind of voice do I actually have? And what kind of representation is story in the voices? I think there was very much Also a thing that happened with the people that chose to actively participate and do the recordings and the reflections that we had with these people.
Eleanor: That personal aspect you don't get with normal voice technologies, you get them in that friendly atmosphere of the festival, where everyone's making best friends and not having a shower. That's a totally different way of exploring what it means for people to come together and like sing in unison in front of a stage in [00:23:00] front of their favorite people, there's this kind of voice collective thing that goes on at a festival that is very unusual anyway, in day to day life. I wanted to ask you about what you call the paralinguistic features, and you can explain to listeners what a paralinguistic feature is, but I was really taken by how you described the voice that you created, not as a voice, but as a breath, it feels like, uh, a big inhalation or perhaps like an exhalation. A very captivating life filling sound and you wanted, you said a voice that, that conjures a different feeling to a traditional voice technology. My boyfriend asks Alexa every morning what the news is. And, there's a feeling of familiarity, that it's supposed to give a feeling of friendliness, companionship. Those are the kinds of feelings of affects that we are supposed to experience when we relate to these technologies. What kind of feeling then, what was this voice supposed [00:24:00] to induce or did it end up inducing?
And can you tell me what this has to do with the paralinguistic features? Sound, rhythm, vocal expressions, the other elements of how we speak and not just the words that, that we use.
So can you tell me about all those kind of other things around the content of the voice?
Stina: Maybe just the - briefly I think that one of the things that we wanted to create was actually also this investigation into using these technologies. When is something becoming a voice? When is something actually at the stage where we think, oh, this sounds like a voice?
And when is it not? So we're really interested in this whole gray zone area of other things, sounding and not sounding, and osculating between those and having this ambiguity really into this, having also the technology. Being part of it, sounding technology as well into [00:25:00] this. And I think that has definitely been interesting to us, like this becoming a voice, a technological voice, like that kind of process.
And maybe then the listeners can also be taken into this process instead of just showing the end result of a training, then actually having the training being. The result, right? That was something that we never hear. We don't hear how Siri is being trained or Alexa. We only hear the end result of this.
So really taking it into the machinery of things.
Frederik: Yeah. And I think with that very much the hating, to use that word, but then still gonna use it was also very data driven in the sense that I think we honestly had no idea what it would sound like. And then and then going through this process of of trying to work with it and then listening back and then I think we were also discussing then, okay, how do we approach this [00:26:00] now? Because we do know that this sounds like people have been shouting at a festival. Should we limit that, like how should we, how much should it be its own thing or how much should we try and go in with an imagined, multi vocal voice and try and generate that.
And I think we also very quickly ended up in, in talking about that if you wanted to generate a voice that you know what it's gonna sound like, then this is the absolute worst way of doing it. That we really had to go with the kind of the recordings and the data that we had and have it be led by that.
And I think we had this, I think there, there could be some ideas about ambiguity where you could think, okay, maybe it has the appearance of multiple genders and of multiple of multiple ages. So while listening through, suddenly this is gonna peak forward and now different thing is gonna be there.
And I can sense these different personalities peaking through. So there could be that, or we could have [00:27:00] I think what we had much more was this extreme ambiguity, where sometimes it's an ambiguity between different kinds of imagined bodies, but also sometimes it's a, in ambiguity of thinking, is this even a body that is generating this?
And I think relating it to the question of them, of paralleling linguistics. We did a lot of when you generate synthetic speech, you then have your trained neural network, and then you input text. And then you say, this is the text that I want to read aloud. And we also worked a lot with trying to read aloud nonsense.
So this would not be focusing on the semantic value of what this TTS was generating, but was trying to maybe focus on more like a linguistic variety of what we could produce. Then when you talk about paralinguistics, there's also very often talk about pitch and intonation. And all these things that relate to that are not [00:28:00] the, like the phonetic components directly of what you're saying.
Think that what we then also found, that there's also then a big part of this is also then the glitches that we have in the data that are the glitches that can stem from people starting speaking after, or before they hit the button for recording or the glitches of. of the microphone bouncing against something and so on.
So we have this perfect storm of a technical accidents and weird kind of yeah, I think glitching is a good word to describe things that, that also ended up being a big connector between the different voices that we hear in a, in our data sets.
Kerry: That's really fascinating. And I think, I love this idea of the glitch or kind of, the introduction of the unexpected and the different, because I do think something that your project critiques, I think both implicitly and explicitly, is the push towards homogenization and voice technology. And I guess this also comes back to what Eleanor was saying about indexing the human, because I think our voices are carriers [00:29:00] of so many different kinds of markers of difference that often relate to social inequality.
So when I came here when I was 18, I only had a very vague understanding, almost no understanding of the way that British accents, index kind of class, regionality, and Kerry huge social and political meaning. I was much more familiar with accent discrimination in relation to race and racialized communities and xenophobia where I'm from New Zealand.
And I say this as someone who I think has like a very white voice and so moved through a lot of spaces, I think really benefiting from that and the ways I saw maybe like peers and family members, like not benefiting from. And I think though, when we see this, push towards creating like a single electronic voice.
We lose so many different kinds of difference to overuse that word, everything from say vocal affectations and things like stuttering or lisping through to, yeah, again, like different kinds of pronunciation, code switching. And yeah, I was wondering, to what extent do you think that voice technologies as they stand, like can grapple with or handle these kinds of difference?
Or to what extent do you [00:30:00] think that actually, As they currently stand, this push towards homogenisation is inevitable and that we need like a fundamentally different way of doing things.
Frederik: Yeah, I think a big part of what makes this a very hard question to answer is in where does it currently stand and how Extremely rapidly things are developing right now. So I think I could embarrass myself and also just really jumble around with, if I think about the limitations of what we can currently do.
But I think that you strike on like when you talk about what kind of intention is behind creating this technology and maybe more of the limitations that we see, in limitations of the capitalistic intent of creating these voices. So I think we're not even close to exploring limits of what can be done with voice technologies.
But because of the, like we talked about earlier, the enormous [00:31:00] amount of compute that are needed to generate these And then very much these are being fit or made to be fit in some very rigid ideas and a less exploratory approaches. Because what a, why would a big tech company be interested in generating speech that is not welcoming, speech that is unpleasant to listen to, or speech that is hard to understand.
So I think that we can have a, we now have speech models that are emotional and we have speech models where you can adjust different performative aspects of them. Eh, so I think why shouldn't they be able to emulate like a plethora of their different bodies? Eh why should they not be able to code switch or speak with accents and so on?
So the issues that come from this is then of course, also related to data and is a data and compute issue is like what kind of training data? Is this based on, and what data do we have to collect? And what would the [00:32:00] intention be of collecting this data? And then I think we really we reach some of the limitations being limitations that are within kind of the confined, like rigid structures of of big tech.
Eleanor: Wonderful. This has been an incredible conversation. And as Kerry knows, this is very much within my wheelhouse of core interests. It's been a real dream to talk to you guys. And, having spent ages on the website, looking at the project it's Just, yeah, a real dream. So thank you so much for speaking with us.
And I really hope that at some point we can come and meet you both in Copenhagen, if that's where you still are, or please let us know if you're in Cambridge and we'd love to do dinner or whatever. But yeah, thank you so much for coming on the show.
Frederik: Absolutely. Thank you for having us.
Eleanor: This episode was made possible thanks to the generosity of Christina Gaw and the Mercator Foundation. [00:33:00] It was produced by Eleanor Drage and Kerry McInerney and edited by Eleanor Drage.
Comments