amirreza ahmadnejad

Bsc

Smartphones That Reads Minds

Abstract

Speech communication in acoustic environments with more than one speaker can be extremely challenging for hearing- impaired listeners. Assistive hearing devices have seen substantial progress in suppressing background noises that are acoustically different from speech but they cannot enhance a target speaker without knowing which speaker the listener is conversing with. Recent discoveries of the properties of speech representation in the human auditory cortex have shown an enhanced representation of the attended speaker relative to unattended sources. These findings have motivated the prospect of a brain controlled assistive hearing device to constantly monitor the brainwaves of a listener and compare them with sound sources in the environment to determine the most likely talker that a subject is attending to . Then, this device can amplify the attended speaker relative to others to facilitate hearing that speaker in a crowd. This process is termed auditory attention decoding.
Multiple challenging problems, including nonintrusive methods
for neural data acquisition and optimal decoding methods for
accurate and rapid detection of attentional focus, must be
resolved to realize a brain-controlled assistive hearing device.
In addition, we have only a mixture of sound sources in
realistic situations that can be recorded with one or more
microphones. Because the attentional focus of the subject is
determined by comparing the brainwaves of the listener with each sound source,
; in this process,
However, this approach
requires multiple microphones and can be beneficial only when ample spatial separation exists between the target and interfering speakers. An alternative and possibly complementary method is to leverage the recent success in automatic speech separation algorithms that use deep neural network models .In one such approach, neural networks were trained to separate a pretrained, closed set of speakers from mixed audio. Next, separated speakers were compared with neural responses to determine the attended speaker, who was then amplified and added to the mixture.
OBJECTIVES
To alleviate the limitation, we propose a causal, speaker-independent automatic speech separation algorithm that can generalize to unseen speakers, meaning that the separation of speakers can be performed without any prior training on target speakers. Speaker-independent speech separation has been one of the most difficult speech processing problems to solve. In recent years, several solutions have been proposed to address this problem.
One such approach is the deep attractor network (DAN). DAN performs source separation by projecting the time-frequency (T-F) (spectrogram) representation of a mixed audio signal into a high-dimensional space in which the representation of the speakers becomes more separable. Compared with the alternative speaker-independent approaches , DAN is advantageous in that it performs an end-to-end separation, meaning the entire process
of speaker separation is learned together. However, DAN was proposed for noncausal
speech separation, meaning that the algorithm required an entire utterance to perform
the separation. In real-time applications, such as in a hearing device, a causal,
low-latency algorithm is required to prevent perceivable distortion of the signal. In this study,
we address the problem of speaker-independent AAD
DAN [online DAN (ODAN)] Because this system can generalize to new speakers, it overcomes a major.
limitation of the previous AAD approach that required training on the target speakers. The proposed AAD framework enhances the subjective and objective quality of perceiving the attended speaker in a multi-talker (M-T) mixture. By combining
recent advances in automatic speech processing and brain-computer interfaces, we can help people with hearing impairment communicate more easily.
Model
DNN architecture
We used a common deep neural network architecture that consists of two stages: feature extraction and feature summation. In this framework, a high-dimensional representation of
the input is first calculated (feature extraction), which is then used to regress the output of the model (feature summation). The feature summation and feature extraction networks are optimized jointly together during the training phase. In all models examined, the feature summation step consisted of a two-layer fully connected network
with L2 regularization, dropout, batch normalization, and nonlinearity
in each layer.
We study five different architectures for the feature extraction part
of the network: the fully connected network, the locally connected network (LCN),
convolutional neural network (CNN), FCN + CNN, and FCN + LCN. In the combined networks, we concatenated the output of two parallel paths, which were fed into the summation network. For FCN, the windowed neural responses were flattened and fed to a multilayer FCN. However, in LCN and CNN, all the extracted features were of the same size as the input, meaning that we did not use flattening, stride convolution, or down sampling prior to the input layer or between the two consecutive layers. Instead, the final output of the multilayer LCN or CNN was flattened prior to feeding the output into the feature summation network.
The optimal network structure was found separately for the auditory spectrogram and vocoder parameters using an ablation study. For auditory spectrogram reconstruction, we directly regressed the 128 frequency bands using a multilayer FCN model for feature extraction .This architecture, however, was not plausible for reconstructing vocoder parameters due to the high-dimensionality and statistical variability of the vocoder parameters. To remedy this, we used a deep autoencoder network to find a compact representation of the 516-dimensional vocoder parameters (consisting of 513 spectral envelopes, pitch, voiced-unvoiced, and band periodicity). We confirmed that decoding the AEC features performed significantly better than decoding the vocoder parameters directly. To carry out decoding, we used a multilayer FCN, in which the number of the nodes changed in a descending (encoder) and then ascending order (decoder). The bottleneck layer of such a network (or the output of the encoder part of the pre-trained AEC) can be used as a low-dimensional reconstruction target by employing the neural network model, from which the vocoder parameters can be estimated using the decoder part of the AEC. We chose the number of nodes in the bottleneck layer to be 256, because it maximized both the objective reconstruction accuracy, and the subjective assessment of the reconstructed sound.
Conclusion
We hope use non-invasive technologies attack to the skin as well as a new kind of microphone he is been building that incredibly good at separating different voices so we are creating a smarter microphone that can
separate the sources our device does this better than any exists today and can do it in real time
during a conversation it uses a form of artificial intelligence called Neural Network models
loosely based on the neurons in the brain the next generation of the hearing aids that
we are trying to create so they monitor the brain of person to decide who that person
is talking to and amplify the sound source that’s most similar to the brain rate our hope is that in the next five years this will become into a real product the more we learn about the brain the better our machines become whether its helping your phone understand you or helping those with hearing impairments understand their friends at a party ..
About Us
We are two old friends who have been in elementary and middle school together and now study in two different fields.
One is electrical-telecommunication engineering and the other is computer-software engineering
We were born in Shahrekord, Iran. In university we found that both of us always been thinking about future, ways to create new things that can improve people's lives, becoming a successful person and generally finding our own path in life.
About one years we start discussing brought a new type of headphone for our smartphones. Eventually, we came to the




automatically separate the sound sources in the environment to detect the attended source and subsequently amplify it.
a practical AAD system needs to
One solution that has been proposed to address this problem is beamforming
neural signals are used to
steer a beamformer to amplify the sounds arriving from the location of the target speaker.
Although this method
can help a subject interact with known speakers, such as family members, this approach is limited in generalization to
new, unseen speakers, making it ineffective if the subject converses with a new person, in addition to the difficulty of
scaling up to a large number of speakers.