Файл: В. И. Ульянова (Ленина) (СПбгэту лэти) Направление 27. 04. 04 Управление в технических системах Магистерская программа.docx

ВУЗ: Не указан

Категория: Не указан

Дисциплина: Не указана

Добавлен: 25.10.2023

Просмотров: 272

Скачиваний: 1

ВНИМАНИЕ! Если данный файл нарушает Ваши авторские права, то обязательно сообщите нам.

СОДЕРЖАНИЕ

Contents

INTRODUCTION

1.1 Explaining neural network:

1.2 Explaining the functionality of Neural Networks Works:

1.3 Types of neural networks:

1.3.1 Convolutional Neural networks (CNN):

1.3.2 Convolution Layer:

1.3.3 Pooling Layer:

1.3.4 Non-Linearity Layers:

1.3.5 Filters:

2. DESIGNING A CONVOLUTIONAL NEURAL NETWORK

2.1 Recurrent neural networks (RNNs):

2.2 Understanding Long Short-Term Memory (LSTM):

2.3 Describing LSTM mathematically:

2.4 Recurrent neural networks functionality in voice cloning applications:

2.5 Gated Activation Units:

2.6 Conditional WaveNet:

3. IMPROVING DEEP LEARNING NETWORKS

3.1 Structure of Deep Learning:

3.2 Natural Language Processing using Voice Data:

4. VOICE PRE-PROCESSING USING DEEP LEARNING

4.1 Voice pre-processing:

4.2 Voice preprocessing implementation and logic:

4.3 Fast Fourier Transform (FFT):

4.4 Short-Time Fourier-Transform (STFT):

4.5 Mel frequency spectrum (MFCC):

4.6 Using deferent library:

4.7 Building the autoencoder:

4.7.1 Budling the encoder:

4.7.2 Building decoder:

4.7.3 Budling the autoencoder:

4.7.4 The training process:

4.7.5 Results of the training process:

5 IMPLEMNTAION

5.1 Explaining text to speech models (TTS):

5.2 DALL-E and its functionality:

5.3 Denoising Diffusion Probabilistic Models (DDPMs) and its functionality:

5.4 Tacotron and Tortoise:

5.5 Tortoise-TTS:

5.6 Implementing Tortoise-TTS:

5.6.1 Theoretical Background:

5.6.2 Fine-Tuning Tortoise-TTS Model:

5.6.3 Creating Audio Samples:

5.6.4 Installing Tortoise-TTS:

6. RESULTS

6.1 Compartment and validation of voice cloning output data:

7. DRAFTTING A BUISNESS PLANE

7.1 Executive Summary:

7.2 Object and Company Description:

7.3 Market Analysis (Russia, Saint Petersburg):

7.4 Market Size and Growth Potential:

7.5 Competitive Landscape:

7.6 Target Market Segments:

7.7 Marketing and Distribution Channels:

7.8 Regulatory and Cultural Considerations:

7.9 Market Entry Challenges:

7.10 Economic Result Estimation:

7.11 Financial and Funding Plan:

7.12 Analysis of Pricing:

7.13 Possible Project Risks:

7.14 Explaining the Final Product:

8. CONCLUSION

REFERENCES:



Санкт-Петербургский государственный электротехнический университет

«ЛЭТИ» им. В.И.Ульянова (Ленина)

(СПбГЭТУ «ЛЭТИ»)


Направление

27.04.04 – Управление в технических системах

Магистерская программа

Автоматика и мехатроника

Факультет

ФЭА

Кафедра

САУ

К защите допустить




Зав. кафедрой

Шелудько В. Н.


ВЫПУСКНАЯ КВАЛИФИКАЦИОННАЯ РАБОТА

Магистра

Тема: «Имитация голоса с использованием глубокого обучения

(«Voice Mimic Using Deep Learning»)


Студент










Алхасан Али.







подпись







Руководитель

к.т.н., доцент







Филатова Е.С




(Уч. степень, уч. звание)

подпись







Консультанты

д. э. н. профессор







Медынская И.В.




(Уч. степень, уч. звание)

подпись










к.т.н., доцент







Стоцкая А. Д




(Уч. степень, уч. звание)

подпись



















Девяткин А.В..




(Уч. степень, уч. звание)

подпись









ЗАДАНИЕ

на выпускную квалификационную работу





Утверждаю




Зав. кафедрой САУ




____________ Шелудько В. Н.




«___» ________ 2023 г

Студент

Алхасан А.




Группа

7490







Тема работы: Имитация голоса с использованием глубокого обучения (Voice Mimic using deep learning)







Место выполнения ВКР: СПбГЭТУ «ЛЭТИ»







Исходные данные: изучение систем воспроизведения звука, обработка данных, используемых для обучения, и создание воспроизведенного звука для неродного носителя языка.







Потенциал методов глубокого обучения для имитации голоса рассматривается в этой магистерской диссертации. Основное внимание уделяется моделям обучения, которые могут точно имитировать целевую речь, используя сверточные нейронные сети (CNN), повторяющиеся нейронные сети (RNN) и порождающие состязательные сети (GANs). Сравнение результатов с новейшими технологиями в области звукового моделирования в качестве оценки производительности предложенных моделей с использованием как объективных, так и субъективных шкал.







Дополнительный раздел: подготовка бизнес-плана для маркетинга результатов магистерского исследования







Исходные данные: изучение систем воспроизведения звука, обработка данных, используемых для обучения, и создание воспроизведенного звука для неродного носителя языка.







Дата выдачи задания

Дата представления ВКР к защите







«___»______________20___ г.

«___»______________20___ г.







Студент










Алхасан А.










подпись










Руководитель

к.т.н.







Филатова Е.С







(Уч. степень, уч. звание)

подпись











календарный план выполнения

выпускной квалификационной работы





Утверждаю

Утверждаю




Зав. кафедрой САУ

кафедрой аббревиатура названия кафедры




____________ Шелудько В. Н.

____________ Иванов И.И.




«___» ________ 2023 г.

«___»______________20___ г.




Студент

Алхасан А.




Группа

7490

Тема работы: Имитация голоса с использованием глубокого обучения (Voice Mimic using deep learning)

п/п

Наименование работ

Срок выполнения

1

Approaches to Building a Simulation Algorithm

01.01 –10.02

2

Processing the audio signal

11.02 – 25.02

3

building the Autoencoder using the Python programming language

26.02 – 08.03

4

Voice synthesis using manually prepared dataset

09.03 – 13.03

5

Training the model and understanding results

14.03 – 23.03

6

Cloning and results analysis

24.03 – 23.04

7

Drafting a business plan

24.04 –30.04

8

Summarizing work and thesis preparation

01.05 – 10.05

9

Presentation preparation

11.05 – 15.05



Студент










Алхасан А.


































Руководитель

к.т.н.







Филатова Е.С







(Уч. степень, уч. звание)

подпись











ABSTRACT
Explanatory note contains 105 pages, 8 chapters, 41 figures, 4 tables, 12 references.
INTRODUCTION TO DEEP LEARNING, CONVOLUTIONAL NEURAL NETWORK, IMPROVING DEEP LEARNING NETWORKS, VOICE PRE-PROCESSING.

Topic: Voice Mimic using deep learning)

The subject of thesis is studying and implementing deep learning in order to achieve accurate voice cloning and discuss the possible problems that might occur such as determining the accuracy of the process and the input-related problems as well,

The aim of the work is to approach a simple method implementation of voice cloning model, which will be based on deep learning models and the concept of end to end hypered models.


SUMMARY
This thesis seeks to make a contribution to the fields of voice imitation and deep learning by addressing these research problems and objectives. It also seeks to offer guidance for the creation of more efficient and realistic-sounding voice mimicry systems.

The following are the goals of this thesis:

  • Conduct a thorough analysis of the literature on voice imitation and speech processing methods based on deep learning.

  • Present a vocal mimicking method based on deep learning that makes use of CNNs, RNNs, and GANs.

  • Utilize both quantitative and subjective indicators to assess the effectiveness of the suggested technique.

  • Compare the accuracy and authenticity of the generated speech using the suggested approach to those achieved using the voice-mimicking techniques already in use.

  • Describe the drawbacks and difficulties of the suggested approach and make recommendations for future research.




Contents




INTRODUCTION 8

1.INTRODUCTION TO DEEP LEARNING 9

1.1 Explaining neural network: 9

1.2 Explaining the functionality of Neural Networks Works: 10

1.3 Types of neural networks: 11

1.3.1 Convolutional Neural networks (CNN): 13

1.3.2 Convolution Layer: 14

1.3.3 Pooling Layer: 15

1.3.4 Non-Linearity Layers: 17

1.3.5 Filters: 18

2. DESIGNING A CONVOLUTIONAL NEURAL NETWORK 20

2.1 Recurrent neural networks (RNNs): 21

2.2 Understanding Long Short-Term Memory (LSTM): 22

2.3 Describing LSTM mathematically: 24

2.4 Recurrent neural networks functionality in voice cloning applications: 26

2.5 Gated Activation Units: 27

2.6 Conditional WaveNet: 28

3. IMPROVING DEEP LEARNING NETWORKS 30

3.1 Structure of Deep Learning: 32

3.2 Natural Language Processing using Voice Data: 33

4. VOICE PRE-PROCESSING USING DEEP LEARNING 36

4.1 Voice pre-processing: 36

4.2 Voice preprocessing implementation and logic: 37

4.3 Fast Fourier Transform (FFT): 39

4.4 Short-Time Fourier-Transform (STFT): 43

4.5 Mel frequency spectrum (MFCC): 46

4.6 Using deferent library: 51

4.7 Building the autoencoder: 55

4.7.1 Budling the encoder: 55

4.7.2 Building decoder: 56

4.7.3 Budling the autoencoder: 56

4.7.4 The training process: 58

4.7.5 Results of the training process: 59

5 IMPLEMNTAION 64

5.1 Explaining text to speech models (TTS): 64

5.2 DALL-E and its functionality: 67

5.3 Denoising Diffusion Probabilistic Models (DDPMs) and its functionality: 70

5.4 Tacotron and Tortoise: 72

5.5 Tortoise-TTS: 74

5.6 Implementing Tortoise-TTS: 77

5.6.1 Theoretical Background: 77

5.6.2 Fine-Tuning Tortoise-TTS Model: 79

5.6.3 Creating Audio Samples: 79

5.6.4 Installing Tortoise-TTS: 82

6. RESULTS 87

6.1 Compartment and validation of voice cloning output data: 87

7. DRAFTTING A BUISNESS PLANE 93

7.1 Executive Summary: 93

7.2 Object and Company Description: 93

7.3 Market Analysis (Russia, Saint Petersburg): 96

7.4 Market Size and Growth Potential: 96

7.5 Competitive Landscape: 97

7.6 Target Market Segments: 100

7.7 Marketing and Distribution Channels: 102

7.8 Regulatory and Cultural Considerations: 103

7.9 Market Entry Challenges: 104

7.10 Economic Result Estimation: 105

7.11 Financial and Funding Plan: 106

7.12 Analysis of Pricing: 107

7.13 Possible Project Risks: 113

7.14 Explaining the Final Product: 115

8. CONCLUSION 117

REFERENCES: 119


INTRODUCTION


Voice mimicry, or the capacity to impersonate another person's voice, has a wide range of uses in a variety of industries, including entertainment, voiceover, and language acquisition. Traditional methods of voice impersonation relied on manual techniques like equalization, time stretching, and pitch shifting, which were frequently ineffective and demanded a high level of skill to generate convincing results. However, with recent developments in deep learning, voice imitation can be accomplished using artificial neural networks that can learn the features of a target voice and produce speech that sounds similar to that voice.

Examine the potential of deep learning methods for voice imitation is examined in this master's thesis. focusing on training models that can accurately imitate a target speech using convolutional neural networks (CNNs), recurrent neural networks (RNNs), and generative adversarial networks (GANs). comparing the outcomes with the state of the art in voice imitation as assessment the performance of the suggested models using both objective and subjective metrics.

This thesis addresses the research questions: Can deep learning models successfully pick up on a target voice's qualities for voice mimicry? Which deep learning techniques are most effective for voice imitation? How accurate and realistic is the generated speech using the suggested method compared to other voice-mimicking techniques?

  1. INTRODUCTION TO DEEP LEARNING




Deep learning is a subset of machine learning that has revolutionized the field of artificial intelligence. It is based on artificial neural networks, which are inspired by the structure and function of the human brain. Neural networks consist of layers of interconnected nodes or neurons, which work together to extract and learn features from the data.