How to Segment Chinese Texts: Putting in Spaces with Jieba

I’m dipping my toe into the Digital Humanities (DH) realm and playing around with DH tools like Topic Modeling. And I’m doing so with Mandarin texts. Or at least attempting to.

What I’ve found is that many of the DH tools rely on segmentation (essentially spaces between each word) in order to properly process text. With English texts, this is no problem; English is a segmented language. We put spaces between our text normally. Chinese texts, however, don’t usually have spaces between each character-word pairing.

For example, let’s take this first sentence from the abstract of one of the research articles I’m reading:


Lots of characters, no spaces.

My first instinct was to say, “oh, easily fixed. I’ll just use Word and replace all the no-spaces with spaces. Fixed!” Then the coffee kicked in and I remembered that in order to preserve the content of sentence or text, I would need to preserve character-word pairings. E.g. 世界面临 becomes 世界  面临 and not 世  界  面  临

If I was just working with a small paragraph of text, I could segment everything by hand. But, part of the DH appeal is that you have the tools to analyze a large amount of text at once. Thus, I had dozens of pages of text I needed to efficiently and accurately add spaces to.

Enter the Python module “Jieba“.

Jieba is the best segmentor I’ve run across because it allows you to add your own words to its already extensive dictionary (great for capturing jargon or slang).

There are a couple prerequisites for using Jieba:

  1. Have Python and know how to install modules
  2. Download and install the Jieba module
  3. Have the article or text you want segmented in plain text format (UTF-8 encoding). All the encoding nuances go a bit over my head, but the easy way to ensure this is to transfer (either by exporting or by copy-paste) your document to Notepad (or another plain text editor) and save as a .txt file, as shown below.


Once all of the above is set, you’re ready to open up your Python editor and segment some text. The tutorial on the Jieba site offers several examples of how to use the module, but I thought I’d share my specific code as well. I’ll post the full code first and then explain it below.

import jieba

# add in own dictionary of jargon and specific exceptions
jieba.suggest_freq((‘非’,’农业’), True)
jieba.suggest_freq((‘援’,’非’), True)

# read in your text file
with open(‘Mandarin Plaintext/Full Texts/12_IAE2009_FULLTEXT.txt’, ‘r’) as myfile:‘\n’, ”)

# segment and save
seg_list = jieba.cut(data, cut_all=False)
new_text = (“”.join(seg_list))
f = open(‘testtext22.txt’, ‘w’)

Let’s walk through it line by line:

import jieba

Normally a # sign signals a comment in Python. Here, it’s actually reminding Python that for the purpose of this module and these runs, everything is encoded using UTF-8. The second line tells Python to import the Jieba module itself.


I’m working with texts that focus specifically on China-in-African-agriculture. As such, I found there were a few words common to the texts that were not included in Jieba’s standard Chinese dictionary. (E.g. 技术示范中心 is all one word but Jieba would recognize it as several distinct words). The line of code above tells Jieba to load and incorporate my own dictionary (a .txt file with each new word on a new line).

jieba.suggest_freq((‘非’,’农业’), True)

There were also a couple words that Jieba wanted to consider one word that I needed to be two words. Most commonly was anything where 非 was used as shorthand for Africa. Unaltered, Jieba would segment 非农业 as one word (anti-agriculture), when really it was being used as 非(州)农业 , African agriculture. This command tells Jieba to put a space between 非 and 农业. Alternatively, if Jieba defaulted to separating the characters and I wanted them to be considered one unit, I’d alter the code inside the parentheses to read ‘非农业’ .

with open(‘Folder/Subfolder/YourTextFile.txt’, ‘r’) as myfile:‘\n’, ”)

This line tells Python to open the text file you’d like to segment as the object “myfile”. The replace modifier at the end of the line lets Python know it can remove any hard returns and quotation marks, as neither are necessary or useful for segmentation.

seg_list = jieba.cut(data, cut_all=False)
new_text = (“”.join(seg_list))
f = open(‘your-file-name-here.txt’, ‘w’)

Finally, we segment the text into an object called “seg_list” and then save it as a text object called “new_text”. The first line triggers the actual segmentation process. Jieba has a couple segmentation modes, by setting “cut_all = False” I indicate that I want Jieba to run its slower, more accurate mode. The next few lines make sure the output is read as text and save the newly segmented text as a .txt file. This new file.txt will save in the same folder as wherever your Python code file is saved.

That’s it!

And in the end, that example sentence from before quickly gains lots of spaces:

当前 ,  世界  面临  食品 、  气候变化  以及  金融  等  多重  危机 ,  而  正是  这些  危机  更加  突出  了  农业  在  发展中国家  至关重要  的  地位  。

Now I can use the segmented text in all sorts of interesting ways. But more on that later…


Stats on International Students Studying in China

While thinking about quantitative, non-economic variables that highlight China-Africa ties, I became side-tracked with the ambiguity over just how many African students were studying in China. There were official Ministry of Education (MOE) statistics, it seemed, just none consistently referenced.

In his 2011 HKU seminar presentation, Dr. Adams Bodomo referenced MOE reports on international students in China from 2006 and 2009. Using those pages as my starting point, I used Google (rather than the MOE’s internal search engine) to dig up reports from other years – 2003 up to 2016*, to be precise. I’ve summarized the major stats here, and included links for all the reports found at the bottom of this page.


For more discussion on scholarships specifically given to African students, see here. See also our article in The Conversation for a discussion on the dramatic growth of African students in China.

South Korea has consistently been the #1 country of origin for foreign students studying in China. The US, Japan, Thailand, Vietnam, Russia, Indonesia are all up there as well, with India and Pakistan climbing quickly up the ranks. Click here for a full breakdown by country.

Chinese Ministry of Education International Student Reports 2003-2016* and English Translations **

*  The 2010 and 2013 reports were not found. Student numbers for these years were calculated using the percent-growth reported in the 2011 and 2014 reports. Updated 10/17 to include 2016 report.

** These are my personal translations for informational purposes only. In the case of a discrepancy, please refer to the original Mandarin.

*** If you want to cross-reference with the original reports but don’t read Chinese, just do a ctrl-f search on the page for term you’re looking for. For example, to find each mention of Africa use 非洲 and it will jump to each mention of Africa in the report. 亚洲 for Asia, 欧洲 for Europe, 美洲 for the Americas, and 大洋洲 for Oceania.