Chinese characters identifier (Java)

Chinese characters identifier (Java)

Quick:

This JAVA program check if input word/Char is Chinese. visit ChineseCharIdentifier in github

Usage

use public static boolean isChineseChar(int codePoint) to check against code point of a character and public boolean isChineseWord(String s) to check against a word.

It is also possible to exclude filtering of particular regex patterns (e.g. accept some Alphanumeric words, like _NOUN_ in google data set)

Intro:

To work on a Chinese N-Gram Analysis with the google books Dataset, I spent sometime figure out how to identify a character is Chinese or not, which is supposed to be quite easy.

Stuff here is not very serious or standard, just hoped to help others to understand some basics and get quick solution.

Some Concepts:

To check whether a character is Chinese, we can use its code point, which can be obtained using Java’s Character.codePointAt()

Code Point

From Wikipedia

code point or code position is any of the numerical values that make up the code space.[1] For example, ASCII comprises 128 code points in the range 0hex to 7Fhex, Extended ASCII comprises 256 code points in the range 0hex to FFhex, and Unicode comprises 1,114,112 code points in the range 0hex to 10FFFFhex.

What are the range of Code Point for Chinese?

This question is harder.

For English it is easy, it is from a to z, A to Z.

As accordingly to the Unicode Scheme many characters are actually not only Chinese, but common in Japanese/(traditional) Korean as well. Thus Chinese characters are contained in the set called CJK, where this is a disjoint set, i.e. the range is not continous

CJK

so From the official FAQ:

A: It is a commonly used acronym for “Chinese, Japanese, and Korean”. The term “CJK character” generally refers to “Chinese characters”, or more specifically, the Chinese (= Han) ideographs used in the writing systems of the Chinese and Japanese languages, occasionally for Korean, and historically in Vietnam

Chinese Only!

Closer to what you want may be “Blocks Containing Han Ideographs”. This is hard to reach if you have been looking for “Chinese”

check out this SO Thread and the standard specification to get the correct range.

Full Unicode Chart

This is blogged in here

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s