import React from 'react';
import { useHistory } from "react-router-dom";
import { makeStyles } from "@material-ui/core/styles";
import {
  Paper,
  Link,
  Button,
  TextField,
  InputAdornment,
  IconButton,
  Typography,
} from "@material-ui/core";
import NavBar from "../components/NavBar";
import NavigationLinks from '../components/NavigationLinks'
import SearchIcon from '@material-ui/icons/Search';
import MenuBookIcon from '@material-ui/icons/MenuBook';
import NaturalLanguagesTable from './NaturalLanguagesTable'
import CodingTaggingTable from './CodingTaggingTable'
import Table1 from './Table1'
import Table2 from './Table2'
import Table3 from './Table3'

function IntroPage() {
  const classes = useStyles();
  return (
    <Paper className={classes.root} elevation={0}>
      <NavBar />
      <div className={classes.navigation}>
        <NavigationLinks stage='introduction' />
      </div>
      <div className={classes.introContainer}>
        <Typography variant="h5">Multi-ethnic Hong Kong Cantonese Corpus (MeHKCC)</Typography>
        <Typography variant="body1">The Multi-ethnic Hong Kong Cantonese Corpus (MeHKCC) is corpus of annotated child-directed speech (CDS) speech and adult-directed speech (ADS) in Hong Kong Cantonese (HKC) spoken by mothers in Hong Kong with different language backgrounds, native in (1) HKC, (2) Putonghua, and (3) South-Asian languages. The goal of developing this corpus is to provide a resource for accurate descriptions of the speech sound, vocabulary and syntax in the language input received by young children of these mothers staying in Hong Kong. </Typography>

        <Typography variant="h5">Participants</Typography>
        <Typography variant="h6">Mother speakers</Typography>
        <Typography variant="body1">We recruited three groups of mothers representing different language backgrounds, native in (1) HKC, (2) Putonghua, and (3) South-Asian languages and lived in Hong Kong during the time of the study. Participants were recruited through flyers, advertisements in website, and social media targeted for mothers living in Hong Kong with specified language backgrounds. For the second and the third groups, participants were also recruited via non-governmental organizations (NGOs) in Hong Kong. Written informed consents were obtained before the study. All mothers and/or fathers consented to release the audio recording and accompanying transcripts for the construction of a publicly accessible corpus. Each family was paid HK$800 (i.e., ~US$100) for the participation in the study. </Typography>

        <Typography variant="body1">
          All the mothers were interviewed verbally by Cantonese-English-Putonghua trilingual research assistants with the use of a written questionnaire in either Chinese or English regarding their language backgrounds and language use. All the three groups of mothers reported themselves as multilinguals with difference levels of proficiency.
        </Typography>

        <Typography variant="body1">
        </Typography>

        <NaturalLanguagesTable />

        <Typography variant="body1">
          The first group consisted of 32 local mothers in Hong Kong who speak HKC since birth. All of them considered their Cantonese proficiency as native level and reported that Cantonese was used as the major language in their daily life (e.g., watching news on TV, talking to friends).
        </Typography>

        <Typography variant="body1">The second group was composed of 27 mothers who reported that Putonghua was their strongest language. All except three of them acquired Putonghua before the age of 5 years old. The three mothers reported that they acquire Putonghua after the preschool years and their first languages are languages of their home towns (湖南話, 惠東方言, 潮汕). All these mothers spoke Putonghua to their friends, colleagues or family members at home. All except seven had been staying in Hong Kong for 5 to 20 years. The remaining seven had been stayed in Hong Kong for two to four years. Their own ratings of the Cantonese proficiency ranged from very good to fair.
        </Typography>
        <Typography variant="body1">The third group of mother consisted of 12 South Asian mothers who spoke a South Asian language (Urdu 8/12, Punjabi 2/12, Tamil 2/12) as their first language. Nine of them had been staying in Hong Kong for more than 20 years while the three had been in Hong Kong for less than 20 years. The native languages of the South Asian mothers included, Urdu, Punjabi, or Tamil. Their own ratings of the Cantonese proficiency within the group varied substantially from highly proficient to poor.
        </Typography>
        <Typography variant="h6">Child speakers</Typography>

        <Typography variant="body1">Child participants were typically developing children at the age of 6 to 59 months old at the time of study. There were a total of 31 boys and 37 girls. Their mothers were their main caregiver. All except three were born full-term. None of the children have any diagnosed developmental disorders. </Typography>

        <Typography variant="h5">Data Collection</Typography>

        <Typography variant="body1">Speech samples were collected between 2019 and 2021 in Hong Kong. Each mother participated in two parts of recording for CDS and ADS. CDS of each dyad was collected individually at The University of Hong Kong. The mother was requested to interact with their child in a mock living room at The Faculty of Education as they would normally do at home for 45 minutes to 1 hour. Fathers of a few cases in the third group also came along. They were allowed to stay inside the mock living room but were reminded to wait quietly. Age-appropriate toys and books were provided for interaction. Speech samples produced by the mothers, along with the child’s productions, were collected by a Zoom H5 digital recorder (holding in a shoulder bag carried by the mother) via a Sennheiser MKE 2 clip-on microphone clipped at mother’s collar level. </Typography>
        <Typography variant="body1">ADS was collected within the same day or in a second visit to the University within 2-week time. The samples comprised both dialogues and monologues in HKC. Dialogues included face-to-face interviews with the experimenter which consisted of questions and answers regarding child’s developmental and social history, daily routine and mother’s background, their job or their daily routine. Three mothers in the third group cannot carry out a dialogue solely in Cantonese. The conversation was then accompanied by English. Monologues were elicited via 4 tasks: a film description task, a map description task, a story retelling task and a single-word picture naming task. The monologue tasks provided a common basis for analysis while the dialogues allowed more rooms for improvisation. </Typography>
        <Typography variant="h5">Transcription and forced alignment</Typography>
        <Typography variant="body1">First, within each recording, the temporal boundaries of each utterance were demarcated using the software Phon (Hedlund & Rose, 2019) by a team of trained, native Cantonese speaking student research assistants who were proficient at transcribing Cantonese in Chinese and romanised script. Each utterance was transcribed orthographically using written conventions for Cantonese (in traditional Chinese characters), with word items that lack a common standardised form being represented in Jyutping (粵拼) phonetic romanisation. A team of research assistants who had extensive experience in transcription conducted a first-pass verification to assure accuracy of all the transcriptions. Where needed, novel lexical items were appended to the word- and/or syllable-based reference dictionaries utilized by SPPAS in Cantonese forced alignment. </Typography>
        <Typography variant="body1">Then orthographic annotations were submitted to SPPAS, which parsed utterances into both words and syllables. For each identified word and syllable, candidate phonetic forms were supplied from the reference dictionaries at this stage. Segmental units and their boundaries were then fit to each word-level annotation according to the Cantonese acoustic model in Lee et al. (2002), as implemented by SPPAS which is an open-source Python-based software package (Bigi, 2015, 2018). Given the set of known phonetic candidate forms for a given transcription, SPPAS will identify the best segmental representation with respect to the phonetic properties of phones and phonemes contained within in the Cantonese acoustic model. While the Cantonese acoustic model performs relatively well in identifying actual phonetic segments such as consonants and vowels produced during audio recordings, it cannot identify Cantonese tonal units in its SPPAS implementation. Using a separate Cantonese reference dictionary, canonical orthographic (Chinese script) and phonetic (Jyutping and IPA) representations of each word and syllable were then assigned to each annotation so that actual phonetic transcriptions from the forced alignment procedure could be compared against their corresponding dictionary citation forms. Finally, the annotations at the utterance, lexical, syllabic, and segmental levels were combined as separate annotation tiers using PRAAT software (Boersma & Weenink, 2020).
          Subsequent to the forced alignment process, acoustic data such as temporal onset and offset times and acoustic/spectral formant data were extracted using custom-made code in Praat, and ultimately, data on all phonetic segments, along with their acoustic properties, were combined using coding in R statistical computing software (R Core Team, 2020) for later analysis.
        </Typography>
        <Typography variant="h5">Coding and Tagging</Typography>
        <Typography variant="body1">Each transcript normally consisted 19 tiers in the PRAAT file: </Typography>

        <Typography variant="body1"></Typography>
        <CodingTaggingTable />

        <Typography variant="body1">It is noted that tiers that contain citation phonetic forms in Jyutping or IPA, paired curly braces ({ }) encompassing two or more items that are separated by a vertical bar (|) indicate the set of all possible citation forms that correspond to a particular item (word or syllable), e.g.:</Typography>
        <pre style={{ fontFamily: 'sans-serif' }}>

          <Typography variant="body1">{`(Ortho) – (Jyutping) – (CitIPA)`}</Typography>
          <Typography variant="body1">{`蛋 – {daan2|daan6} – taːn`}</Typography>
          <Typography variant="body1">{`牛奶 – {au4.laai5|ngau4.naai5} – {ɐu.laːi|ŋɐu.naːi}`}</Typography>
          <Typography variant="body1">{`呢 – {li1|nei1|nei4|ne1|ni1} – {li|nei|nɛ|ni}`}</Typography>
          <Typography variant="body1">{`其實 – kei4.sat9 – kʰei.sɐt    (word-based analysis)`}</Typography>
          <Typography variant="body1">{`其 – {gei1|kei4} – {kei|kʰei} (syllable-based analysis)`}</Typography>

        </pre>
        <Typography variant="body1">The annotated files were then formatted according to the Codes for the Human Analysis of Transcripts (CHAT, MacWhinney, 2000) for morphological tagging. We follow the major parts of speech and convention used in the Cantonese Aphasia Bank (Kong & Law, 2015) which is a corpus of conversational speech produced by Cantonese-speaking aphasia speakers. To ensure anonymity, we replaced child participants’ names with CHD and silenced out whole utterance with CHD, using SILENCE command of CLAN. After the morphological tagging, the files were further processed to automatically disambiguate morphemes that possess more than one parts of speech or meanings. </Typography>
        <Typography variant="h5">Basic Corpus Statistics </Typography>
        <Typography variant="body1">By the end of August 2021, speeches were analysed from 29 local mothers, 18 Putonghua-speaking mothers and 12 South Asian mothers. In total, the corpus contained 86 hours and 14 minutes of interaction. Mothers’ production in the CDS and ADS consisted of 211,567 word tokens in 44,354 utterances and 166,024 tokens in 24,084 utterances respectively. Table 1 summarizes the details of the corpus. </Typography>
        <Table1 />
        <Typography variant="caption">Table 1: Details of the Corpus</Typography>
        <Typography variant="h5">Lexical Characteristics</Typography>
        <Typography variant="body1">We calculated the number of word tokens, types, and type-token ratio of the CDS in our data by using the freq command in CLAN. </Typography>
        <Table2 />
        <Typography variant="caption">Table 2: Lexical characteristics of the Corpus</Typography>
        <Typography variant="h5">Utterance-Level Characteristics</Typography>
        <Typography variant="body1">Mean length of Utterance (MLU) is an index of syntactic complexity.</Typography>
        <Table3 />
        <Typography variant="caption">Table 3: MLU of the Corpus</Typography>
        <Typography variant="h5">Access to the MeHKCC Corpus</Typography>
        <Typography variant="body1">For more information on the annotation and analysis, or access to the wav-files, please contact Carol To at <a href="mailto:tokitsum@hku.hk">tokitsum@hku.hk</a>. We will respond to your enquiry or request as soon as possible.</Typography>
        <Typography variant="h5">Terms of use</Typography>
        <Typography variant="body1">The MeHKCC will be strictly for research and/or educational purposes. Users of the corpus should acknowledge that the database is proprietary to its developers. Contents of the corpus are copyrighted by its developers and all rights are reserved. Users shall not duplicate, distribute, sell, commercially exploit, create derivative works from, or otherwise make available the MeHKCC or information contained therein, in any form or medium, to any third party.</Typography>
        <Typography variant="h5">Honor theses</Typography>
        <Typography variant="body1">
          <ul>
            <li>CHAU, H. Y. H. (2020). Comparison of sentence final particles between Cantonese child-directed and adult-directed speech. Honors thesis. The University of Hong Kong, Academic Unit of Human Communication, Development, and Information Sciences.</li>
            <li>CHIM, R. H. Y. (2021). Acoustic properties of Cantonese functional aspect markers and lexical items in child directed speech: A Corpus Study. Honors thesis. The University of Hong Kong, Academic Unit of Human Communication, Development, and Information Sciences.</li>
            <li>CHOW, Vincci W. S. (2021). Comparison of sentence final particles between Cantonese child-directed and adult-directed speech. Honors thesis. The University of Hong Kong, Academic Unit of Human Communication, Development, and Information Sciences.</li>
            <li>LAU, C. C. Y. (2020). Comparison of syllable fusion in Cantonese child-directed speech and adult-directed speech: A corpus study. Honors thesis. The University of Hong Kong, Academic Unit of Human Communication, Development, and Information Sciences.</li>
            <li>LEUNG, Y. S. (2021). Patterns of phonological transfer: Evidence from L1-Putonghua/L2-Cantonese late-bilinguals. Honors thesis. The University of Hong Kong, Academic Unit of Human Communication, Development, and Information Sciences.</li>
          </ul>
        </Typography>
        <Typography variant="h5">Funding & Acknowledgements</Typography>
        <Typography variant="body1">The project was partly funded by the Research and Development Projects 2018-2019 of the Standing Committee of Language Education and Research (SCOLAR); Education Bureau, HKSAR Government (Project No. : EDB(LE)/P&R/EL/175/17) and Yuen Research Fund from the University of Chicago.</Typography>
        <Typography variant="body1">We thank our post-doctoral fellow Dr. Jonathan Yip for carrying out all the programming for auto-segmentation, forced alignment, and acoustic analyses. We are grateful to the mothers and their children for their participation in our study. Last but not least, we also thank Charlene Zhao, Bijou Tsang, Christy Lai, Erika Wong, Niko Wong, Ruby Tin, Henry Chau, for their help with transcription, preparation and analyses of the data, and Kapo Chow for developing the search engine and website.</Typography>

        <Typography variant="h5">References cited</Typography>

        <Typography variant="body1">
          <ul>
            <li>Bigi, Brigitte. (2015). SPPAS - Multi-lingual approaches to the automatic annotation of speech. In "The Phonetician" - International Society of Phonetic Sciences, ISSN 0741-6164, No. 111-112 / 2015-I-II: 54-69.
              Bigi, Brigitte. (2018). SPPAS - The automatic annotation and analysis of speech [Computer Software], ver. 1.9.7. Laboratoire Parole et Langage,  Université Aix-Marseille, Aix-en-Provence, France. Retrieved from: http://www-sppas.org/</li>

            <li>Boersma, Paul & Weenink, David. (2020). Praat: doing phonetics by computer [Computer program]. Version 6.1.11. Retrieved from http://www.praat.org/.</li>

            <li>Hedlund, Gregory & Yvan Rose. 2020. Phon 3.1 [Computer Software]. Retrieved from https://phon.ca/.</li>

            <li>Lee, Tan, Lo, W.K., Ching, P.C., & Meng, Helen. (2002). Spoken language resources for Cantonese speech processing. Speech Communication, vol. 36 (3-4): 327-342.</li>
            <li>MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk. 3rd Edition. Mahwah, NJ: Lawrence Erlbaum Associates.</li>
            <li>R Core Team. (2020). R: A language and environment for statistical computing [Computer Software], ver. 3.6.3. R Foundation for Statistical Computing, Vienna, Austria. Retrieved from https://www-R-project.org/.</li>
          </ul>
        </Typography>
      </div>
    </Paper>
  );
}

export default IntroPage;

const useStyles = makeStyles((theme) => ({
  root: {
    border: 0,
    margin: 0,
    height: "100vh",
    padding: 0,
    display: "flex",
    flexDirection: "column",
  },
  navigation: {
    padding: theme.spacing(1)
  },
  landingContainer: {
    display: 'flex',
    flexDirection: 'column',
    alignItems: 'center',
    justifyContent: 'center',
    flex: 1
  },
  introContainer: {
    // flex: 2,
    justifyContent: 'flex-start',
    padding: theme.spacing(3),
    "& > p": {
      marginTop: theme.spacing(2)
    },
    "& > h5": {
      marginTop: theme.spacing(2),
      color: theme.palette.primary.main,
      fontWeight: 500
    },
    "& > h6": {
      marginTop: theme.spacing(2),
      color: theme.palette.primary.light
    },
    "& > table": {
      marginTop: theme.spacing(2),
    }
  },
  nextContainer: {
    flex: 1,
    display: 'flex',
    flexDirection: 'row',
    alignItems: 'flex-start',
  },
  icon: {
    width: 20,
    height: 20,
    marginRight: theme.spacing(0.5)
  },
  link: {
    display: 'flex',
    textDecoration: 'none',
    padding: theme.spacing(3)
  },
}));
