The Haks Language Modeling System: Examples from Buddhist Texts in Sanskrit, Tibetan and Chinese Christopher Handy Michael Litchard handyca@mcmaster.ca http://handyc.sdf.org Inaugural NARNiHS Conference 23 July 2017
The Haks Language Modeling System: Examples from Buddhist Texts in Sanskrit, Tibetan and Chinese This study demonstrates a practical method for extracting recurrent strings from digitized texts in cases where grammar, vocabulary and other information about the texts are partly or entirely unknown.
The Haks Language Modeling System: Examples from Buddhist Texts in Sanskrit, Tibetan and Chinese Our method involves building concordances of words and phrases from digitized input sets of texts, using a simple but effective patternrecognition algorithm. The algorithm can be generalized to work with information in any language, but we restrict this study to just three major languages of the Buddhist tradition: Sanskrit, classical Tibetan and classical Chinese.
The Haks Language Modeling System: Examples from Buddhist Texts in Sanskrit, Tibetan and Chinese We utilize free text files available in online databases so that our examples can be verified easily. We also provide source code examples of our algorithm in C and Haskell, available on GitHub: https://github.com/handyc
The Haks Language Modeling System: Examples from Buddhist Texts in Sanskrit, Tibetan and Chinese Three languages of the Buddhist literary tradition Sanskrit, Tibetan and Chinese lack word boundary delimiters in their traditional manuscripts. A human being familiar with these languages can identify individual words in such manuscripts, but the texts have no spaces between words, such that we cannot locate words without prior knowledge of the language.
The Haks Language Modeling System: Examples from Buddhist Texts in Sanskrit, Tibetan and Chinese As a result of this problem, digital concordances for Sanskrit, Tibetan and Chinese often require some amount of manual part-of-speech tagging before sending datasets to the computer.
The Haks Language Modeling System: Examples from Buddhist Texts in Sanskrit, Tibetan and Chinese Our software runs through every possible string of n syllables in every text of a text corpus, where n is any number desired by the user. We refer to strings of connected syllables as n-grams, named by their specific n size as 1-gram, 2-gram, 3-gram, etc.
The Haks Language Modeling System: Examples from Buddhist Texts in Sanskrit, Tibetan and Chinese The program counts each n-gram in a given text, and creates a concordance file for that text. These concordance files are then combined to form a concordance file for the entire corpus.
The Haks Language Modeling System: Examples from Buddhist Texts in Sanskrit, Tibetan and Chinese For example, searching on n =4 among a sample set of major Sanskrit Mahāyāna texts, we found the sequence bo-dhi-sa-ttva appearing frequently in many different texts. If this sequence appears more frequently in Mahāyāna texts than in mainstream texts, it could be used as an identifier for the Mahāyāna genre.
Sanskrit Mahāyāna and Mūlasarvāstivāda Vinaya frequent n-grams Mahāyāna 4-gram Mahāyāna 8-gram MSV 4-gram MSV 8-gram bodhisattva (3343) nuttarāyāṁsamyaksaṁbo (451) kathayati (1367) bhagavataārocaya (166) tathāgata (2096) ttarāyāṁsamyaksaṁbodhau (436) bhagavanā (550) vaśeṣāmāpattimāpa (166) kulaputra (1346) nuttarāṁsamyaksaṁbodhi (320) gavānāha (520) ghāvaśeṣāmāpattimā (166) śatasaha (1298) samyaksaṁbodhimabhisaṁ (303) sakathaya (499) gavataārocayanti (166) bodhisattvā (1289) ttarāṁsamyaksaṁbodhima (302) samayena (458) nakālenatenasama (164) samyaksaṁbo (901) rāṁsamyaksaṁbodhimabhi (297) kathayanti (405) kālenatenasamaye (164) bodhisattvo (874) thāgatorhansamyaksaṁbu (266) saṃlakṣaya (372) lenatenasamayena (163) athakhalu (860) kṣetraparamāṇurajaḥ (262) damavoca (372) vobhagavataāroca (159) lokadhātu (812) ddhakṣetraparamāṇura (260) lakṣayati (356) kṣavobhagavataāro (159) sahasrāṇi (746) traparamāṇurajaḥsa (259) bhagavatā (346) tatprakaraṇaṃbhikṣavo (153) mahāmate (740) buddhakṣetraparamāṇu (250) midamavo (332) tprakaraṇaṃbhikṣavobha (152) buddhakṣetra (647) gavantametadavoca (241) ghāvaśeṣā (311) raṇaṃbhikṣavobhagava (152) tasahasrā (637) bhagavantametadavo (236) gṛhapati (303) karaṇaṃbhikṣavobhaga (152) dhisattvasya (616) ṭīnayutaśatasaha (224) praticchannā (298) bhikṣavobhagavataā (152) tadavoca (610) koṭīnayutaśatasa (224) tisakatha (292) ṇaṃbhikṣavobhagavata (151) sarvasattva (599) paramāṇurajaḥsamā (205) tadabhava (288) gavantamidamavoca (150) sattvomahā (578) myaksaṁbodhimabhisaṁbu (205) rvavadyāva (280) ārocayantibhagavā (135) tathāgatā (571) ṭīniyutaśatasaha (196) saṃghāvaśe (278)* tenakālenatenasa (135)
Tibetan 'dul ba (=vinaya, monastic law ) texts from the Derge Kanjur (Buddhist canon) 'dul ba text 2-gram 4-gram kl00001e1.txt DGE_SLONG (2671) SO_SOR_THAR_PA'I (672) kl00001e2.txt PA_DANG (1827) BCOM_LDAN_'DAS_KYIS (578) kl0001e3inc.txt DGE_SLONG (2031) BCOM_LDAN_'DAS_KYIS (523) kl0001e4inc.txt PA_DANG (1713) BCOM_LDAN_'DAS_KYIS (407) kl00002e1.txt DGE_SLONG (346) TSE_DANG_LDAN_PA (101) kl00003e1.txt DGE_SLONG (2089) TSE_DANG_LDAN_PA (647) kl00003e2inc.txt DGE_SLONG (2636) TSE_DANG_LDAN_PA (749) kl00003e3.txt DGE_SLONG (2550) BCOM_LDAN_'DAS_KYIS (879) kl00003e4.txt PA_DANG (1679) BCOM_LDAN_'DAS_KYIS (591) kl00004e.txt DGE_SLONG (465) DGE_SLONG_MA_GANG (211)
Tibetan 'dul ba (=vinaya, monastic law ) texts from the Derge Kanjur (Buddhist canon) 'dul ba text 2-gram 4-gram KL00001001(eTB).txt དག ས ང (2622) ས ས ར ཐར པའ (667) KL00001002(eTB).txt པ དང (1796) བཅ མ ལ ན འདས ཀ ས (568) KL00001003(eTB).txt དག ས ང (1993) བཅ མ ལ ན འདས ཀ ས (515) KL00001004(eTB).txt པ དང (1686) བཅ མ ལ ན འདས ཀ ས (404) KL00002001(eTB).txt དག ས ང (342) ཚ དང ལ ན པ (100) KL00003001(eTB).txt དག ས ང (2042) ཚ དང ལ ན པ (622) KL00003002(eTB).txt དག ས ང (2595) ཚ དང ལ ན པ (733) KL00003003(eTB).txt དག ས ང (2500) བཅ མ ལ ན འདས ཀ ས (858) KL00003004(eTB).txt པ དང (1656) བཅ མ ལ ན འདས ཀ ས (581) KL00004(eTB).txt དག ས ང (457) དག ས ང མ གང (211)
Tibetan mdo mang (=sūtra) texts from the Derge Kanjur (Buddhist canon) mdo mang text 4-gram 8-gram kl00094e.txt BYANG_CHUB_SEMS_DPA' (111) BYANG_CHUB_SEMS_DPA'_MCHOG_TU_DGA' _BA'I (25) kl00095e.txt BYANG_CHUB_SEMS_DPA' (608) SLONG_DAG_DE_LTAR_BYANG_CHUB_SEMS _DPA' (36) kl00096e.txt BYANG_CHUB_SEMS_DPA' (55) BYANG_CHUB_SEMS_DPA'I_RAB_TU_'BYUNG _BA (25) kl00097e.txt BYANG_CHUB_SEMS_DPA' (63) PA_JI_LTAR_NA_BYANG_CHUB_SEMS_DPA' (30) kl00098e.txt BYANG_CHUB_SEMS_DPA' (29) BYANG_CHUB_SEMS_DPA'_SEMS_DPA'_CHE N_PO (11) kl00108e.txt BYANG_CHUB_SEMS_DPA' (187) BYANG_CHUB_SEMS_DPA'_SEMS_DPA'_CHE N_PO (88) kl00353.txt BYANG_CHUB_SEMS_DPA' (353) BYANG_CHUB_SEMS_DPA'_SEMS_DPA'_CHE N_PO (22) kl00357.txt BYANG_CHUB_SEMS_DPA' (27) BYANG_CHUB_SEMS_DPA'_SEMS_DPA'_CHE N_PO (8)
Tibetan mdo mang (=sūtra) texts from the Derge Kanjur (Buddhist canon) mdo mang text 4-gram 8-gram KL00094(eTB).txt བ ང ཆ བ ཏ ན (977)* བ ང ཆ བ ཏ ན ས མས བས ད ད བད (679)* KL00095(eTB).txt བ ང ཆ བ ས མས དཔའ (591) ས ང དག ད ལ ར བ ང ཆ བ ས མས དཔའ (36) KL00096(eTB).txt བ ང ཆ བ ས མས དཔའ (55) བ ང ཆ བ ས མས དཔའ རབ ཏ འབ ང བ (24) KL00097(eTB).txt བ ང ཆ བ ས མས དཔའ (60) པ ཇ ལ ར ན བ ང ཆ བ ས མས དཔའ (29) KL00098(eTB).txt བ ང ཆ བ ས མས དཔའ (29) བ ང ཆ བ ས མས དཔའ ས མས དཔའ ཆ ན པ (11) KL00108(eTB).txt བ ང ཆ བ ས མས དཔའ (108) བ ང ཆ བ ས མས དཔའ ས མས དཔའ ཆ ན པ (87) KL00353(eTB).txt བ ང ཆ བ ས མས དཔའ (232) བ ང ཆ བ ཏ ས མས བས ད པར ག ར ཏ (9)* KL00357(eTB).txt བ ང ཆ བ ས མས དཔའ (26) བ ང ཆ བ ས མས དཔའ ས མས དཔའ ཆ ན པ (8)
Chinese texts from the Taishō canon (Chinese Buddhist canon) Text 1-gram 2-gram 4-gram 8-gram T01 有 者 (21400) (21078) (20109) (16553) 比丘 世尊 沙門 (7863) (7848) (4876) (3014) 所以者何 沙門瞿曇 尊者阿難 爾時世尊 (879) (799) (557) (544) 如來無所著等正覺 聞佛所說歡喜奉行 我聞一時佛遊 誦我聞一時佛 (246) (242) (216) (199) T02 時 如 (21383) (18528) (16702) (16470) 比丘 世尊 爾時 (11395) (7862) (7260) (6083) 爾時世尊 所說歡喜 聞佛所說 佛所說歡 (2592) (1803) (1756) (1699) 衛國祇樹給孤獨園 舍衛國祇樹給孤獨 聞佛所說歡喜奉行 我聞一時佛住 (1505) (1505) (1380) (1275) T03 如 無 (16212) (15204) (14458) (14028) 爾時 一切 (5812) (5750) (4446) (3834) 藐三菩提 三藐三菩 阿耨多羅 耨多羅三 (1094) (1094) (1093) (1093) 阿耨多羅三藐三菩 耨多羅三藐三菩提 成阿耨多羅三藐三 多羅三藐三菩提心 (1092) (1092) (349) (241) T04 者 人 (16280) (11786) (11086) (10931) 比丘 故 世尊 爾時 (3117) (2268) (2248) (2141) 故說曰 爾時世尊 即說偈言 亦復 (1052) (520) (347) (300) 舍衛國祇樹給孤獨 衛國祇樹給孤獨園 在舍衛國祇樹給孤 佛在舍衛國祇樹給 (138) (137) (137) (134)
Chinese texts from the Taishō canon (Chinese Buddhist canon) Text 1-gram 2-gram 4-gram 8-gram T08 菩 無 (30632) (29114) (28812) (27650) 波羅 羅蜜 般若 (16532) (16492) (16387) (12207) 般若波羅 若波羅蜜 摩訶 薩摩訶薩 (11852) (11851) (6444) (6439) 阿耨多羅三藐三菩 耨多羅三藐三菩提 須菩提摩訶薩 阿耨多羅三耶三菩 (2176) (2170) (878) (878) T09 一 無 佛 法 (19709) (18855) (16633) (15684) 一切 眾生 如來 (14366) (9990) (8800) (4534) 一切眾生 摩訶 薩摩訶薩 令一切眾 (2769) (2089) (2088) (1187) 阿耨多羅三藐三菩 耨多羅三藐三菩提 子為摩訶薩 佛子為摩訶 (544) (544) (267) (267) T10 一 無 切 諸 (30535) (29172) (21625) (20996) 一切 眾生 (21590) (17716) (11979) (6543) 一切眾生 摩訶 薩摩訶薩 一切諸佛 (3496) (2369) (2368) (1211) 阿耨多羅三藐三菩 耨多羅三藐三菩提 謂生心我 薩生心我已得 (495) (495) (316) (316) T11 無 如 (25990) (21719) (21493) (19870) 一切 如來 (10722) (7850) (7281) (6397) 薩摩訶薩 摩訶 文殊師利 一切眾生 (2303) (2303) (1033) (802) 阿耨多羅三藐三菩 耨多羅三藐三菩提 舍利子摩訶薩 得阿耨多羅三藐三 (770) (770) (331) (154)
Chinese texts from the Taishō canon (Chinese Buddhist canon) Text 1-gram 2-gram 4-gram 8-gram T12 如 無 (31334) (28306) (27667) (27300) 如來 眾生 (10028) (9206) (7684) (7624) 摩訶 薩摩訶薩 亦復 一切眾生 (1900) (1896) (1640) (1385) 阿耨多羅三藐三菩 耨多羅三藐三菩提 得阿耨多羅三藐三 善男子摩訶薩 (1052) (1050) (369) (334) T13 無 如 (23427) (22261) (21834) (19803) 一切 眾生 (9957) (9957) (7919) (7743) 摩訶 薩摩訶薩 一切眾生 阿耨多羅 (1686) (1680) (1109) (905) 阿耨多羅三藐三菩 耨多羅三藐三菩提 多羅三藐三菩提心 發阿耨多羅三藐三 (898) (898) (263) (248) T14 無 佛 南 如 (58875) (44990) (37115) (19559) 南無 佛南 如來 (36603) (28455) (9873) (7861) 如來南無 王佛南無 佛南無無 南無 (4964) (2403) (2148) (1460) 阿耨多羅三藐三菩 耨多羅三藐三菩提 中華電子佛典協會 曰智慧為六何謂 (607) (607) (332) (242) T15 無 如 (21311) (21014) (19035) (15385) 一切 眾生 (7368) (5846) (4631) (3574) 文殊師利 可思議 摩訶 薩摩訶薩 (1282) (1125) (645) (644) 阿耨多羅三藐三菩 耨多羅三藐三菩提 童子摩訶薩復 子摩訶薩復有 (359) (359) (171) (171)
Chinese texts from the Taishō canon (Chinese Buddhist canon) Text 1-gram 2-gram 4-gram 8-gram T21 一 二 如 (15785) (15538) (11616) (11430) 二合 一切 (7689) (5164) (4365) (3513) 娑嚩二合 此陀羅尼 嚩二合引 電子佛典 (1151) (784) (767) (684) 中華電子佛典協會 項本資料庫可自由 電子佛典普及版完 電子佛典協會版權 (456) (228) (228) (228) T22 比 丘 (38638) (37624) (33402) (25496) 比丘 諸比 丘尼 (37618) (9580) (8891) (8760) 若比丘尼 六群比丘 白佛佛言 告諸比丘 (1663) (1570) (1517) (1456) 默然故事持 從今戒應說 今戒應說若 戒應說若比 (509) (482) (449) (442) T23 比 丘 (30543) (28299) (23647) (22827) 比丘 苾芻 丘尼 (22825) (8356) (6004) (4509) 種種因緣 僧伽婆尸 伽婆尸沙 白佛佛言 (1156) (1054) (1054) (1011) 應說若復苾芻 學處應說若復 處應說若復苾 制其學處應說 (409) (372) (368) (363) T24 者 有 (27289) (23089) (18720) (16271) 苾芻 比丘 世尊 (8486) (6489) (5131) (4908) 爾時世尊 白佛佛言 時諸苾芻 波逸底迦 (695) (611) (601) (551) 波逸底迦若復苾芻 苾芻以緣白佛佛言 者波逸底迦若復苾 逸底迦若復苾芻尼 (279) (206) (191) (182)