![]() We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you. MAGICDATA Mandarin Chinese Read Speech Corpus Dataset.To conclude, here are top picks for the best Mandarin Chinese Language Speech datasets for your projects: Each sentence is up to 29 Chinese characters in length and does not contain English letters, Arabic numerals, and rare punctuation. The dataset consists of 102,072 spoken sentences from 11 speakers, recorded between June 2009 and June 2018 from the national news program “News Broadcast”. It was designed to facilitate research on visual speech recognition, sometimes also referred to as automatic lip-reading. It outputs a 1-hour testing audio file (valid recording) for each tester, which has 4 hours of materials.ĬMLR dataset was collected by the Visual Intelligence and Pattern Analysis (VIPA) group of Zhejiang University. For this target, four chosen non-native Chinese speakers participated in this project, and their mother tongue (L1s) varies from Russian, Korean, French, and Arabic. The dataset aims to provide a relatively small-scale and highly efficient training deviation dataset. The related using area can be automatic speech scoring, evaluation, derivation-L2 teaching, Education of Chinese as a Foreign Language, etc. LACTIC is an annotated non-native speech database for Chinese, which is fully open-source. The domain of recording texts is diversified, including interactive Q&A, music search, SNS messages, home command, control, etc. Recordings are conducted in a quiet indoor environment. The sentence transcription accuracy is higher than 98%. The corpus by Magic Data Technology Co., Ltd., contains 755 hours of scripted read speech data from 1080 native speakers of Mandarin Chinese spoken in mainland China. MAGICDATA Mandarin Chinese Read Speech Corpus An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions.Ĥ. ![]() The authors collected the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. WenetSpeech is a multi-domain Mandarin corpus consisting of 10,000+ hours of high-quality labeled speech, 2,400+ hours of weakly labeled speech, and about 10,000 hours of unlabeled speech, with 22,400+ hours in total. The word & tone transcription accuracy rate is above 98%, through professional speech annotation and strict quality inspection for tone and prosody. Their auxiliary attributes such as gender, age group, and native accents are explicitly marked and provided in the corpus.Īccordingly, transcripts in Chinese character-level and pinyin-level are provided along with the recordings. ![]() The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers and a total of 88035 utterances. AISHELL-1 DatasetĪISHELL-1 is a corpus for speech recognition research and building speech recognition systems for Mandarin.ĪISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus that is used to train multi-speaker Text-to-Speech (TTS) systems. Here are our top picks for Mandarin Chinese Language datasets: 1.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |