CaCl2(CaCl2: Chinese Lexicon V2, Simple Chinese:CA中文语言词库) CaCl2 is originates from a Chinese natural language processing(NLP) researching project sponsored by a Chinese company. CaCl2 project is an important part of CaOCl (CaOCl: Open Chinese Lexical Analyzer) Project.
CaCl2 analyses the existing large volumes of textual data obtain from Internet and reformats data into massive entries , Catalogues and classifies the entries according to the financial industry classification standard [see reference.1],
In the natural language processing (NLP) tasks, the CaCl2 lexicon helps break down language into shorter, elemental pieces.(Aka. tokenization) the CaCl2 lexicon can be used for higher-level NLP tasks such as word segmentation, document summarization, contextual extraction, content categorization, etc.
CaCl2 project aims to build a consistent, complete and accurate industrial lexicon or dictionary collections for Internet. we make our best effort to achieve higher data integrity, provide a firm foundation for Chinese NLP works, Users would devote more attention to their business and research.
Date | All | Candidate | Released | Preview |
---|---|---|---|---|
2021-02-01 | 21,000,000 | 3,000,000 | 2,553,806 | 280,000 |
Date | Class | Industries | Released | Preview | Closing |
---|---|---|---|---|---|
2021-02-01 | Class-1 | 28 | 2 | 26 | 0 |
2021-02-01 | Class-2 | 104 | 5 | 99 | 0 |
**Detail Statistics data please refer to Statistics
Clone cacl2
git clone https://github.com/limccn/cacl2.git
or Download a dictionary
wget https://github.com/limccn/cacl2/blob/master/archive/v0.2/\[put dictionary code here].zip
CaCl2 dictionary has a well formatted, can be use in many lexiconic tools.
import jieba
dict_name = '480000.txt'
jieba.load_userdict(os.path.join(BASE_PATH_TO_DICT), dict_name))
<properties>
<entry key="ext_dict">480000.txt;480100.txt;</entry>
</properties>
Code | Name | Entries | Date | Version | Format | Download |
---|---|---|---|---|---|---|
480000 | Banking-Common | 40612 | 2021-02 | v0.2 | txt | 480000.zip |
480100 | Banking-Bank | 224433 | 2021-02 | v0.2 | txt | 480100.zip |
490000 | Financials-Common | 341235 | 2021-02 | v0.2 | txt | 490000.zip |
490100 | Financials-Securities | 311121 | 2021-02 | v0.2 | txt | 490100.zip |
490200 | Financials-Insurance | 31020 | 2021-02 | v0.2 | txt | 480200.zip |
Code | Name | Entries | Schedule | Version | Format | Download |
---|---|---|---|---|---|---|
490300 | Financials-Others | 10,000 | 2Q 2021 | v0.2 | txt | 490300.zip |
Before dictionary finally publish/release, we published a technical preview dictionary contains 10,000 entries for every class-1 industry. If you need further information about all entries, Please refer to Statistics
**Original raw data, please refer to /dicts **Detail Class-1 and 2 industries dictionaries, Please refer to Statistics
Word segmentation test use for different industrial textual data
Word segmentation test use Standard Chinese test dataset
Version | Date | Changelogs |
---|---|---|
0.2 | 2021 | latest |
0.1.1 | 2020 | Catalogues and classifies all entries into 28 class-1 industries and 240 class-2 industries |
0.1 | 2019 | First released version,contains overs 20 million entries,data mainly obtain from Baidu baike,Wikipedia |
Version | Circle | Date | Changelogs |
---|---|---|---|
v0.2.21.01 | monthly | 2021-02-01 | Release: banking and financials dictionary |
v0.2.20.12 | monthly | 2021-01-01 | v0.2 Initial version |
CaCl2 and its data comes from the information published on the Internet. CaCl2 does not guarantee the integrity and correctness of the data. CaCl2 does not constitute any investment suggestion. As Contributor, we have no positions in any stocks mentioned. We have no business relationship with any company whose stock is mentioned in this article.