How to Segment Chinese Texts: Putting in Spaces with Jieba

I’m dipping my toe into the Digital Humanities (DH) realm and playing around with DH tools like Topic Modeling. And I’m doing so with Mandarin texts. Or at least attempting to.

What I’ve found is that many of the DH tools rely on segmentation (essentially spaces between each word) in order to properly process text. With English texts, this is no problem; English is a segmented language. We put spaces between our text normally. Chinese texts, however, don’t usually have spaces between each character-word pairing.

For example, let’s take this first sentence from the abstract of one of the research articles I’m reading:


Lots of characters, no spaces.

My first instinct was to say, “oh, easily fixed. I’ll just use Word and replace all the no-spaces with spaces. Fixed!” Then the coffee kicked in and I remembered that in order to preserve the content of sentence or text, I would need to preserve character-word pairings. E.g. 世界面临 becomes 世界  面临 and not 世  界  面  临

If I was just working with a small paragraph of text, I could segment everything by hand. But, part of the DH appeal is that you have the tools to analyze a large amount of text at once. Thus, I had dozens of pages of text I needed to efficiently and accurately add spaces to.

Enter the Python module “Jieba“.

Jieba is the best segmentor I’ve run across because it allows you to add your own words to its already extensive dictionary (great for capturing jargon or slang).

There are a couple prerequisites for using Jieba:

  1. Have Python and know how to install modules
  2. Download and install the Jieba module
  3. Have the article or text you want segmented in plain text format (UTF-8 encoding). All the encoding nuances go a bit over my head, but the easy way to ensure this is to transfer (either by exporting or by copy-paste) your document to Notepad (or another plain text editor) and save as a .txt file, as shown below.


Once all of the above is set, you’re ready to open up your Python editor and segment some text. The tutorial on the Jieba site offers several examples of how to use the module, but I thought I’d share my specific code as well. I’ll post the full code first and then explain it below.

import jieba

# add in own dictionary of jargon and specific exceptions
jieba.suggest_freq((‘非’,’农业’), True)
jieba.suggest_freq((‘援’,’非’), True)

# read in your text file
with open(‘Mandarin Plaintext/Full Texts/12_IAE2009_FULLTEXT.txt’, ‘r’) as myfile:‘\n’, ”)

# segment and save
seg_list = jieba.cut(data, cut_all=False)
new_text = (“”.join(seg_list))
f = open(‘testtext22.txt’, ‘w’)

Let’s walk through it line by line:

import jieba

Normally a # sign signals a comment in Python. Here, it’s actually reminding Python that for the purpose of this module and these runs, everything is encoded using UTF-8. The second line tells Python to import the Jieba module itself.


I’m working with texts that focus specifically on China-in-African-agriculture. As such, I found there were a few words common to the texts that were not included in Jieba’s standard Chinese dictionary. (E.g. 技术示范中心 is all one word but Jieba would recognize it as several distinct words). The line of code above tells Jieba to load and incorporate my own dictionary (a .txt file with each new word on a new line).

jieba.suggest_freq((‘非’,’农业’), True)

There were also a couple words that Jieba wanted to consider one word that I needed to be two words. Most commonly was anything where 非 was used as shorthand for Africa. Unaltered, Jieba would segment 非农业 as one word (anti-agriculture), when really it was being used as 非(州)农业 , African agriculture. This command tells Jieba to put a space between 非 and 农业. Alternatively, if Jieba defaulted to separating the characters and I wanted them to be considered one unit, I’d alter the code inside the parentheses to read ‘非农业’ .

with open(‘Folder/Subfolder/YourTextFile.txt’, ‘r’) as myfile:‘\n’, ”)

This line tells Python to open the text file you’d like to segment as the object “myfile”. The replace modifier at the end of the line lets Python know it can remove any hard returns and quotation marks, as neither are necessary or useful for segmentation.

seg_list = jieba.cut(data, cut_all=False)
new_text = (“”.join(seg_list))
f = open(‘your-file-name-here.txt’, ‘w’)

Finally, we segment the text into an object called “seg_list” and then save it as a text object called “new_text”. The first line triggers the actual segmentation process. Jieba has a couple segmentation modes, by setting “cut_all = False” I indicate that I want Jieba to run its slower, more accurate mode. The next few lines make sure the output is read as text and save the newly segmented text as a .txt file. This new file.txt will save in the same folder as wherever your Python code file is saved.

That’s it!

And in the end, that example sentence from before quickly gains lots of spaces:

当前 ,  世界  面临  食品 、  气候变化  以及  金融  等  多重  危机 ,  而  正是  这些  危机  更加  突出  了  农业  在  发展中国家  至关重要  的  地位  。

Now I can use the segmented text in all sorts of interesting ways. But more on that later…


China’s Scholarships for African Students & FOCAC


From 2003 until 2008, the Chinese Ministry of Education (MOE) reports on international students in China included a by-region breakdown for Chinese government scholarship data.

Starting in 2006, the Chinese government included at each Forum on China-Africa Cooperation (FOCAC) summit scholarship targets for bringing African students to study in China. In order to evaluate how China has upheld these pledges, Dr. Moore and I used the 2003-2008 scholarship data to estimate the number of Chinese scholarships to African students from 2009 onwards. We created a range of possible values based on the assumption of limited growth (linear) and best-fitting growth (exponential). Based on these estimates, China is most likely upholding the FOCAC scholarship pledges. 


* We’re aware this upper boundary figure for the 2018 estimate is not plausible. The farther out the prediction, the more likely it is that the exponential curve no longer fits reality. The exponential curve, though the best-fit using the provided 2003-2008 scholarship data, is probably just capturing the early portion of a logarithmic function that we would expect for something that is tied to population growth. Thus the choice to include a range of scholarships-given using both linear and exponential future growth.

Still, as shown in the figure below, even using only the linear estimates keeps China’s provided scholarships in pace with FOCAC pledges.


The only known comparison we have is that at FOCAC in 2006, China declared they would “increase the number of Chinese government scholarships to African students from the current 2,000 per year to 4,000 per year by 2009.” According to the MOE, 1,861 African students received Chinese government scholarships in 2006, so the 2,000 estimate quoted in FOCAC was rounded up slightly.


After The Conversation article, some Twitter feeds and friends have led to a few more reports.

  1. The continued strength of China’s educational aid to Africa from the Institute of International and Comparative Education
  2. Guangzhou, that which African students love and hate from the Southern Metro Daily

Stats on International Students Studying in China

While thinking about quantitative, non-economic variables that highlight China-Africa ties, I became side-tracked with the ambiguity over just how many African students were studying in China. There were official Ministry of Education (MOE) statistics, it seemed, just none consistently referenced.

In his 2011 HKU seminar presentation, Dr. Adams Bodomo referenced MOE reports on international students in China from 2006 and 2009. Using those pages as my starting point, I used Google (rather than the MOE’s internal search engine) to dig up reports from other years – 2003 up to 2016*, to be precise. I’ve summarized the major stats here, and included links for all the reports found at the bottom of this page.


For more discussion on scholarships specifically given to African students, see here. See also our article in The Conversation for a discussion on the dramatic growth of African students in China.

South Korea has consistently been the #1 country of origin for foreign students studying in China. The US, Japan, Thailand, Vietnam, Russia, Indonesia are all up there as well, with India and Pakistan climbing quickly up the ranks. Click here for a full breakdown by country.

Chinese Ministry of Education International Student Reports 2003-2016* and English Translations **

*  The 2010 and 2013 reports were not found. Student numbers for these years were calculated using the percent-growth reported in the 2011 and 2014 reports. Updated 10/17 to include 2016 report.

** These are my personal translations for informational purposes only. In the case of a discrepancy, please refer to the original Mandarin.

*** If you want to cross-reference with the original reports but don’t read Chinese, just do a ctrl-f search on the page for term you’re looking for. For example, to find each mention of Africa use 非洲 and it will jump to each mention of Africa in the report. 亚洲 for Asia, 欧洲 for Europe, 美洲 for the Americas, and 大洋洲 for Oceania.

Article Reflections #9 – Agricultural Training

In a recent World Development article, Tugendhat and Alemu present primary research on China’s short-term technical and policy training courses on agriculture.


Who attends these courses? 

  • Technical civil-servants (~ 3 months, practical, hands-on training)
  • Senior officials (~ 2 to 4 weeks, more observation and policy but still some fieldwork)
  • Ministerial-level officials and secretaries (~ 14 days or less, networking, business, and policy

How do the short-term courses work? The short-term courses are “funded by the Department of Foreign Aid in MOFCOM, and the short-term courses are managed by MOFCOM’s ‘‘Academy for International Business Officials” (AIBO). A number of courses are also hosted at AIBO but more often funding is provided to other Chinese institutions such as universities, research centers, and relevant companies. Flights, accommodation and lodging are all paid for by MOFCOM, and the only costs borne by the participants or their ministries are the visa fees and stipends” (74-74). African ministries are allowed to pick which staff go (corruption/patronage or efficiency?) Almost all courses were taught in Chinese with an interpreter translating (generally into English or French). Authors observed a roughly even split between courses focused on technologies and technical methods and those focused on policy and management methods.

The authors consider three major questions:

  1. Do the training courses push a unified, central model of development. In other words, is there a Beijing consensus? While there were a few central messages pushed onto the courses (esp. China as a brethren developing country with technological experience emphasized with banquets and field trips to the countryside), there was no unified model of development or best practices dictated to course lectures. Instead, course content was largely left to the individual trainers. While course content does need approval from the central government, the lecturer’s interviewed said they rarely received comment on their submitted lesson plans.
  2. Do the training courses serve as vehicles for China’s commercial interests? The authors’ analysis shows course participants come from a diverse group of development countries and not just the resource-rich countries. However, a third of participants were offered the opportunity to buy goods connected with their training course.
  3. How do the training courses articulate China’s soft power?  The majority of the participants interviewed retained a positive impression of China. Both participants and organizers highlighted newly established relationships as the lasting impact of the courses.


The authors found no direct impact of the training courses:

“The greatest impediment to implementing the lessons from the training contexts in home contexts was either that courses were not relevant to the unique climate or socio-economic contexts the participants were from, or the job that they actually carried out” (78).

In other words, the courses weren’t geared to meet specific needs. Further, there seems to be limited options for follow-up work or funding for projects inspired by training course. Participants and lecturers alike stressed to the authors that relationship-building was more important than tangible impact of training. Tugendhat and Alemu conclude we’ll see long-term impacts from these training and they may be right. However, I can’t help but wonder if there was a more structured follow-up mechanism accompanying the training if we wouldn’t see more short-term impacts as well. It really just depends. Are the training courses about seeing China in a positive light or are they about equipping participants with the knowledge needed to improve their local situation?


Tugendhat, Henry, and Dawit Alemu. “Chinese agricultural training courses for African officials: Between power and partnerships.” World Development 81 (2016): 71-81.


Article Reflection #8 – Agri-tech Demo Centers

Starting with the 2006 Forum on China-Africa Cooperation summit, China announced that it would build agricultural technology demonstration centers (ATDCs) in partner African countries. ATDCs have been constructed in over 20 countries so far, and Xu et al.’s 2016 piece in World Development offers us great ethnographic insight into the reality of China’s ATDCs.

Summary: The paper starts with a review of China’s science and technology regime, providing the background context in which to see how ATDCs are an extension of China’s own experience of modernizing agriculture and thus China’s attempt to share that experience. The authors observed daily life at four ATDCs and it is through profiles of the managers and workers at the centers that we come to see the inherent political and social realities that get in the way of the ATDCs intended purpose of ag-tech transfer. As Xu et al. put it, “negotiations must take place about the meanings and implications of agriculture and technology, demonstration and extension, as well as aid and development” (89).

Reflections: ATDCs have a dual purpose to both share and demonstrate Chinese agri-tech to African users and to promote Chinese agribusiness. The interviews and narratives from Chinese managers and their African counterparts in Xu et al.’s paper reflect this conflict. If you’re looking for an example of why ‘the China model’ may not be as easy to export as is hoped/feared, this paper is a good place to start.


Xu, Xiuli, et al. “Science, technology, and the politics of knowledge: The case of China’s agricultural technology demonstration centers in Africa.” World Development 81 (2016): 82-91.