当前位置: 东星资源网 > 文档大全 > 分手信 > 正文

The,Study,on,Web-based,English-Chinese,Term,Translation,Pattern|based on

时间:2019-01-30 来源:东星资源网 本文已影响 手机版

  Abstract.This paper focuses on the Web-based English-Chinese OOV term translation pattern, and emphasizes particularly on the selection strategy based on the multi-feature representation for translation evaluation. Three kinds of feature, local feature, global feature and Boolean feature, are extracted from translation candidates based on the fusion strategy of multi-features. By utilizing the CoNLL 2003 corpus for the English Named Entity Recognition (NER) task, the experiments based on such a standard data source show the promising results. The established multi-feature representation mechanism for English-Chinese OOV term translation model can “filter” the most possible translation candidate with better ability.
   Key words: term translation; multi-feature representation; local feature
  
  1.Introduction
  In a real application of Cross-Language Information Retrieval (CLIR), most of user’s queries are generally composed of short terms, in which there are many Out-of-Vocabulary (OOV) terms like Named Entities (NEs), new terms, and terminologies and so on. The translation quality of OOV term directly influences the precision of querying relevant multilingual information. Therefore, OOV term translation has become a very important and challenging issue in CLIR. Recently, most of the researches on OOV term translation concentrate on mining useful information from Web [1]. However, in the process of translation acquisition, how to select the correct translation from a lot of returned Web information and locate the required translation resources rapidly is still the main goal for OOV term translation. Hence, the effect of the selection for translation candidates will directly decide the convenience degree for translation resource acquisition, and the applicability and satisfactory degree for OOV term translation. Finding the effective feature representation for translation candidates is the core part for Web-based OOV term translation.
  
  2.Related work and our approach
  In recent years, many researchers have utilized Web search engine to find translation candidate information for OOV terms on webpages [2, 3]. Al-Onaizan and Knight used Web statistics information to validate translation candidates generated by language model, and obtained the accuracy of 72.6% in Arabic-English OOV term translation [4]. Lu et al. utilized the statistics information about anchor texts in Web search results to recognize translation candidates, and got the accuracy of 63.6% in English-Chinese title query term translation [5]. Zhang and Vines extracted translation candidates for OOV query terms in CLIR by using co-occurrence statistics and distance information of bilingual word pair from Web, and improved the performance of English-Chinese/Chinese-English CLIR to some extent [6]. Zhang et al. searched translation candidates by using cross-language query expansion and Web, utilized vocabulary information, word frequency and distance to select correct translations for OOV terms, and obtained the Top-1 accuracy of 81.0% in Chinese-English OOV term translation [7]. Chen and Chen used the combination of Web statistics and vocabulary information, took the inverse search as a validation process, and acquired the Top-1 accuracy of 87.6% in Chinese-English OOV term translation [8]. Jiang et al. utilized the combination pattern of Web mining, transliteration and ranking based on Maximum Entropy (ME) to improve NE translation performance, but only focused on English-Chinese person name translation and got the Top-1 accuracy of 47.5% [9].
  Although the above methods can improve the performance of OOV term translation to some extent, there are still a common problem in OOV term translation based on Web search results. That is, the feature information for the evaluation of translation candidates is not enough and comprehensive. Most of OOV term translation methods implement the evaluation for translation candidates through mining simple local numerical features and Boolean features, that is, inherent features in translation candidates and their surrounding context features. However, in real Web documents, an OOV term always occurs several times in different Web documents. If only a certain Web document that an OOV term appears is explored, global information contained in Web document set will be ignored and the inconsistency and polysemy of translation candidates cannot be considered. As a result, the selected translation candidates are very limited, and some genuine and feasible translations for OOV term may often be omitted. Thereby, using only simple local numerical features and Boolean features may cause the serious problem of data sparseness. In fact, this problem can also be mitigated through separately using global numerical features and Boolean features from all the returned Web documents, but some important composition information about OOV term itself will be lost. Therefore, simple numerical features and Boolean features should be expanded and local features, global features and Boolean features are fused to overcome defects with each other.
  Through the investigation for the existing methods and existent problem, we establish an English-Chinese OOV term translation pattern based on Web mining. Translation candidates are chosen by the fusion strategy of multi-features. The representation forms of local feature, global feature and Boolean feature are constructed under the consideration of the complex characteristics of English/Chinese OOV term and the special expression manner of Web document information. Thereby, all kinds of effective features are integrated to the largest extent, each one can give full play to its own function and effect, and the relevance between translation pair can be evaluated more accurately.
  
  3.Multi-feature representation
  In English-Chinese OOV term translation model, based on the fusion strategy of multi-features, translation candidates are selected and confirmed by utilizing various contextual features from Web documents. Multi-features can be used for the reliability evaluation of a translation candidate, including Local Feature (LF), Global Feature (GF) and Boolean Feature (BF). LFs are constructed based on neighboring tokens and the token itself, GFs are extracted from other occurrences of the same or similar token in the Web document set, and BFs are equivalent to some heuristic rules designed for the particular relationship between original OOV term and candidate terms from Web documents.
  3.1.Local feature representation
  There are two types of contextual information to be considered when extracting local features, namely internal lexical information and external contextual information. Candidate local features are extracted from the training corpus with feature templates. A feature template is a pattern to extract features.
  (1) Term length (Len) ? Aims to consider the length of translation candidate.
  (2) Phonetic Value (PV) ? Aims to investigate the phonetic similarity between OOV term and its translation candidate. In order to establish association between character string of English term and Pinyin expression of Chinese term, a mapping mechanism need to be constructed for term syllabification. A syllable of a Chinese term corresponds to a Pinyin string of a character inside the term, while a syllable of an English term means a pronunciation combination of several letters of the term and often has the corresponding Chinese Pinyin string. Because the associated syllabification representation can often be found between Chinese and English syllable with less ambiguities, syllabification becomes an effective channel in English/Chinese phonetic feature expression. This PV feature means that for measuring the edit distance similarity between the syllabification sequences of OOV term and its translation candidate, the appropriate processing will be executed according to the specific linguistic rules.
  (3)Length Ratio of OOV term and its translation candidate (LR) ? Aims to explore the composition possibility that the extracted key term can be regarded as the translation for OOV term. Usually for an OOV term, its corresponding translation should have similar length. Therefore Chinese term is segmented into significant pieces first, and the length of this term is the number of pieces. For an English term, the number of words is counted as the length. If there is only one word which is composed of capital letters, it will be considered as an abbreviation form and its length is defined as the number of capital letters. In this way the OOV term and its translation candidate can be segmented into meaningful fragments and every piece can be mapped with the word in corresponding language. The value of LR is close to 1 as possible.
  (4)Phonetic and Semantic Integration Feature (P&S_IF) ? Aims to consider the phonetic information and word sense of the OOV term and its translation candidate synthetically. This feature is set up for some multi-word OOV terms. Each constituent part can be translated by using phonetic information or word sense.
  3.2.Global feature representation
  The common case in Web-based OOV term translation is that the translation candidate mentioned in the previous part of Web documents will often occur repeatedly with the same or similar form in the latter part. When determining the final translation, the contextual information from the same Web document and other documents may play a very important role. In order to make full use of global information and solve the evaluation problem of candidates with the same or similar forms, global features are constructed based on many characteristics of Web document. (1) Global Term Frequency (G_Freq) ? Aims to utilize the frequency information that OOV term and its translation candidates appear in Web document set. It always includes four parameters, that is, FreqSOOV, TFTOOV, DFTOOV and CO_Freq. FreqSOOV denotes the frequency of the OOV term SOOV in all the webpage snippets of search results. TFTOOV indicates the number of the translation candidate TOOVs in all the snippets. DFTOOV represents the number of snippets that contains TOOV. CO_Freq means the number of snippets that contain both SOOV and TOOV, namely co-occurrence frequency.
  (2) Chi-Square ( ) Feature Value (CV) ? Aims to evaluate the semantic similarity between OOV term and its translation candidates by using their occurrence information in returned Web documents.
  (3) Co-occurrence Distance (CO_Dist) ? Aims to investigate the distance between the OOV term and its candidate under the global environment of Web documents. It is the average distance between the OOV term and its candidate. This co-occurrence distance is often closer. Some translation pairs appear in the form of annotation accompanied with brackets and other annotation words.
  (4) Rank Value (RV) ? Aims to consider the rank information for translation candidates in Web document set. This feature includes five parameters, that is, Top_Rank, Average_Rank,Simple_Rank, R_Rank and DF_Rank. Top_Rank (T_Rank) is the rank of snippet that first contains the candidate TOOV. This value indicates the rank given by search engine. Average_Rank (A_Rank) is the average position of TOOV in snippets of search results. Simple_Rank (S_Rank) is computed based on S_Rank(TOOV)=TFTOOV(TOOV)*Len(TOOV), which aims at investigating the impact of the frequency and length information of TOOV on ranking. R_Rank is utilized as a comparison basis. DF_Rank (D_Rank) is similar to S_Rank and computed as D_Rank(TOOV)=DFTOOV(TOOV)*Len(TOOV).
  3.3.Boolean feature representation
  Boolean features are mainly used to explore the different occurrence forms with higher possibility for translation candidates in Web documents. This feature is binary feature with the form of heuristic rule. It can not only be created by considering neighbor terms of the candidates within a certain distance near the OOV term, but also be established to form co-occurrence features by combining the translation candidates and their neighboring terms.
  (1) Position Distance with OOV Term (PD_SOOV) ? If TOOV occurs close to SOOV (e.g., within 10 characters), then this feature is labeled as “1”, else “-1” instead.
  (2) Neighbor Relation with OOV Term (NR_SOOV) ? If TOOV occurs prior or next to SOOV , then this feature is labeled as “1”, else “-1” instead.
  (3) Bracket Neighbor Relation with OOV Term (BNR_SOOV)? If TOOV locates prior or next to SOOV and occurs with the form “TOOV (SOOV)” or “SOOV (TOOV)”, then this feature is labeled as “1”, else “-1” instead.
  (4) Special Mark Term (SMW) ? This is an intuitive feature. Within a certain co-occurrence distance (usually less than 10 characters) between the OOV term and its candidate, if there is such a term like “ ” (Full Name), “ ” (be named as … in Chinese), “ ” (or be called as …), “ ” (be also called as …), etc., or within 5 characters, if there are some punctuations like “( )”, “[ ]” and “( )”, etc., then this feature will be labeled as “1”, else “-1” instead.
  
  4.Experiment and analysis
  For the performance evaluation of English-Chinese OOV term translation, the English corpus provided by the NER task in CoNLL 2003 is used. As the representatives of OOV term, 2,741 NEs are selected from this corpus, including 1,006 Person Names (PRNs), 1,035 Location Names (LCNs) and 700 Organization Names (OGNs). Hence, the translation performance for NEs with different categories can be detected by the proposed model.
  With the purpose of investigating the effect of various features on translation acquisition, the whole translation performance is assessed by using each kind of feature representation. At the same time, in order to express the feature contribution and make comparison analysis more easily and conveniently, accuracy and recall are used as two important parameters in the experiment on feature contribution. Accuracy is defined as the percentage of the obtained correct translations in all the extracted translations that accord with the threshold for each single feature. Recall is defined as the percentage of the obtained correct translations in all the extracted translations that accord with the threshold for each single feature and can be taken as the correct translations. In addition, because there are a high proportion of NEs in OOV terms, the related experiment is built up especially on the important NEs with three categories, that is, PRN, LCN and OGN.
  That PV is the best one, and then DFTOOV, CO_Freq and TFTOOV. However, the contribution of single LR or CV is very weak. Meanwhile, it is also found that some noisy information is introduced and weakens the effect of PV. This is caused by removing the vowels from short PRNs or its phonetic forms. That the difference is among P&S_IF, PV and CO_Freq is not very obvious as that shown in PRN translation. The proportion of transliteration constituent in LCN is less than that in PRN, so that the translation result only based on phonetic similarity is not satisfactory. On the other hand, the proportion of sense translation increases in LCN, but much noisy information with meaningful fragments of sense translation is added into translation candidates as well.
  It can be viewed from the results above that the accuracy obtained by BNR is obviously the best one, but the corresponding recall is not high. The main reason is that many NEs do not appear in the form of “TOOV (SOOV)” or “SOOV (TOOV)”. In addition, the recall of SMW is lower, but its accuracy is not highly improved. This phenomenon is also due to introducing much noisy information, which always happens in PRN translation.
  
  5.Conclusion
  Traditional OOV term translation methods mainly concern two aspects, that is, transliteration and word sense translation. Nevertheless, increasing English OOV terms cannot be covered by English-Chinese bilingual dictionaries. More and more NEs and OOV terms with other categories cannot be measured by phonetic or meaning information separately. It shows that new translation manner is needed to obtain the required translation information. In this paper, the proposed model improves the acquirement ability for English-Chinese OOV term translation extraction through Web mining, and solves the translation pair selection and evaluation problem in a novel way by fusing multi-features.
  
  6.References
  [1]Richard Sproat, Tao Tao, and Chengxiang Zhai, “Named Entity Transliteration with Comparable Corpora”, Proceeding of COLING-ACL 2006, pp. 73-80, 2006.
  [2]Gaolin Fang, Hao Yu, and Fumihito Nishino, “Chinese-English Term Translation Mining based on Semantic Prediction”, Proceeding of COLING-ACL 2006, pp. 199-206, 2006.
  [3]Jian-Cheng Wu, and Jason S. Chang, “Learning to Find English to Chinese Transliterations on the Web”, Proceeding of EMNLP-CoNLL 2007, pp. 996-1004,2007.
  [4]Yaser Al-Onaizan, and Kevin Knight, “Translating Named Entities using Monolingual and Bilingual Resources”, Proceeding of ACL 2002, pp. 400-408, 2002.[5]Wen-Hsiang Lu, and Lee-Feng Chien, “Anchor Text Mining for Translation of Web Queries: A Transitive Translation Approach”, ACM Transactions on Information Systems, 22(2):242-269, 2004.
  [6]Ying Zhang, and Phil Vines, “Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval”, Proceeding of SIGIR 2004, pp. 162-169, 2004.
  [7]Ying Zhang, Fei Huang, and Stephan Vogel, “Mining Translations of OOV Terms from the Web through Cross-Lingual Query Expansion”, Proceeding of SIGIR 2005, pp. 669-670, 2005.
  [8]Conrad Chen, and Hsin-Hsi Chen, “A High-Accurate Chinese-English NE Backward Translation System Combining Both Lexical Information and Web Statistics”, Proceeding of COLING-ACL 2006, pp. 81-88, 2006.
  [9]Long Jiang, Ming Zhou, Lee-Feng Chien, and Cheng Niu, “Named Entity Translation with Web Mining and Transliteration”, Proceeding of IJCAI 2007, pp. 1629-1634, 2007.

标签:based English Study Web