Blog

Stats on International Students Studying in China

While thinking about quantitative, non-economic variables that highlight China-Africa ties, I became side-tracked with the ambiguity over just how many African students were studying in China. There were official Ministry of Education (MOE) statistics, it seemed, just none consistently referenced.

In his 2011 HKU seminar presentation, Dr. Adams Bodomo referenced MOE reports on international students in China from 2006 and 2009. Using those pages as my starting point, I used Google (rather than the MOE’s internal search engine) to dig up reports from other years – 2003 up to 2016*, to be precise. I’ve summarized the major stats here, and included links for all the reports found at the bottom of this page.

allstudents5.png

AllScholarships

For more discussion on scholarships specifically given to African students, see here. See also our article in The Conversation for a discussion on the dramatic growth of African students in China.

South Korea has consistently been the #1 country of origin for foreign students studying in China. The US, Japan, Thailand, Vietnam, Russia, Indonesia are all up there as well, with India and Pakistan climbing quickly up the ranks. Click here for a full breakdown by country.

Chinese Ministry of Education International Student Reports 2003-2016* and English Translations **

*  The 2010 and 2013 reports were not found. Student numbers for these years were calculated using the percent-growth reported in the 2011 and 2014 reports. Updated 10/17 to include 2016 report.

** These are my personal translations for informational purposes only. In the case of a discrepancy, please refer to the original Mandarin.

*** If you want to cross-reference with the original reports but don’t read Chinese, just do a ctrl-f search on the page for term you’re looking for. For example, to find each mention of Africa use 非洲 and it will jump to each mention of Africa in the report. 亚洲 for Asia, 欧洲 for Europe, 美洲 for the Americas, and 大洋洲 for Oceania.

Advertisements

Visualizing Africa-China Ag. Trade

Though China’s largest overall agricultural trading partners are outside of Africa, there is a substantial volume of agricultural trade flowing from the continent to China. The UN commodities trading database (COMTRADE) tracks these flows over time and down to specific commodity (for the most part). There’s plenty of data to crunch here; there’s also plenty of data to visualize.

But how?

As we’re dealing with trade flows, my first though was to use a flow map to highlight which African countries were exporting more agricultural goods to China than others. I used ESRI ArcMap’s built in flow map tool to do so.

4blog_flows

While it gets the job done, the flow map is easily cluttered (even more so with labels, which I left off above). Talent with Illustrator or similar software would probably result in much cleaner flow maps. It’s a work in progress.

4blog_trade

I find this second version easier to read – one glance and I understand Zimbabwe exports the most to China re: agriculture goods. However, this simplified version looses the easy connection with place that you get from using country borders as the background.

4blog_partners

This third version eschews looking at trade ($) volume all together and instead focuses on major commodity type. Straightforward.

I used a combo of the second and third images for a recent conference poster. They’re effective, but I still think the flow map could be refined into a more visually striking representation of Africa-China ag. trade. Though, none of the versions shown above even start to touch on changes in trade over time. That’s the next hurdle.

Too Many Topics

My colleagues, Dr. Achberger and Junle Ma from CAU, and I are working on a topic modeling project that compares major themes from collections of English and Mandarin academic texts on the same subject. While we’re consolidating the majority of the work and results into an article for submission, there is a portion of the project delegated to blog posts. Namely: the topic modeling runs that ‘didn’t work.’

We’re using the MALLET software to train our topic models and MALLET, like most LDA topic models, makes the major assumption that the user knows the ‘correct’ number of topics present in your collection of text beforehand.

Oftentimes this assumption is not true. Thus begins the trial and error of figuring out what the ‘right’ number of topics is. What range to test (i.e. small, zero to a couple dozen, or large, into the hundreds or thousands) depends on the size of your corpus and what level of detail you want from the topics. For example, are you looking for every possible topic across the texts or just the major-level topics that themselves may contain several subtopics?

Given our small corpus and interest in broad themes, we ran experimental topic models between 5 and 20 topics. It very quickly became clear that 15+ topics was several topics too many.

Here, I share the results of our topic model run with 15 and 20 topics and highlight what a cohesive topic might look like and how our results did not always achieve that cohesion.

Results

EnglishFullText_15-and-20 and ChineseFullText_15-and-20

Let’s talk about what does work first. Here are two examples, one from each language, from the 15 topic run. The top 20 key words for each topic are listed. We can see that the key words for Topic #3 from the Mandarin run mostly work together to contextualize the state of African (agricultural) economies. From the English example, we can see how words like “migrants”, “farms”, “embassy”, and “vegetables” together form the Chinese Farmers (in Africa) topic.

exampleC-goodexampleE-good

I picked what I thought were the most cohesive topics for the examples above. From there, the key words become more and more muddled until any interpret able topic definition seems lost. Now we get to what doesn’t work.

Topic #8 in Chinese and #10 in English both feature a hodgepodge of words that, while one or three might relate to each other, do not work together as a whole to describe a topic. For example, from the Mandarin group, we can see how “field survey” might relate to “sustainability” but not as readily to “processing plant”. The English key words are even less easy to interpret, with several verbs and adjectives (e.g. “explained”, “providing”, “greater”).

exampleC-badexampleE-bad

Now this is not to say the above results are ‘garbage’; they just don’t work for our research purposes. As we’re looking for concrete, major themes from our collections of text, it seems we’d be better served running topic models with a smaller number of topics. And that’s exactly what we did.

MALLET Code

For those interested, here are the lines of code we fed MALLET in order to run the topic models:

bin\mallet train-topics  –input chinesefull.mallet  –num-topics 15 –output-state Cfull15.gz  –output-topic-keys Cfull15_keys.txt –output-doc-topics Cfull15_composition.txt

The code uses the base mallet file that contains all the articles’ full text as input. The number of topics is set at 15, and outputs both a key word and a document composition text file. The green text can be edited to change the name of the files and/or number of topics tested.

References

For further reading on LDA topic models: Blei, David M. “Probabilistic topic models.” Communications of the ACM 55.4 (2012): 77-84.

MALLET: McCallum, Andrew Kachites. “MALLET: A Machine Learning for Language Toolkit.” http://mallet.cs.umass.edu. 2002.

How to Segment Chinese Texts: Putting in Spaces with Jieba

I’m dipping my toe into the Digital Humanities (DH) realm and playing around with DH tools like Topic Modeling. And I’m doing so with Mandarin texts. Or at least attempting to.

What I’ve found is that many of the DH tools rely on segmentation (essentially spaces between each word) in order to properly process text. With English texts, this is no problem; English is a segmented language. We put spaces between our text normally. Chinese texts, however, don’t usually have spaces between each character-word pairing.

For example, let’s take this first sentence from the abstract of one of the research articles I’m reading:

当前,世界面临食品、气候变化以及金融等多重危机,而正是这些危机更加突出了农业在发展中国家至关重要的地位。

Lots of characters, no spaces.

My first instinct was to say, “oh, easily fixed. I’ll just use Word and replace all the no-spaces with spaces. Fixed!” Then the coffee kicked in and I remembered that in order to preserve the content of sentence or text, I would need to preserve character-word pairings. E.g. 世界面临 becomes 世界  面临 and not 世  界  面  临

If I was just working with a small paragraph of text, I could segment everything by hand. But, part of the DH appeal is that you have the tools to analyze a large amount of text at once. Thus, I had dozens of pages of text I needed to efficiently and accurately add spaces to.

Enter the Python module “Jieba“.

Jieba is the best segmentor I’ve run across because it allows you to add your own words to its already extensive dictionary (great for capturing jargon or slang).

There are a couple prerequisites for using Jieba:

  1. Have Python and know how to install modules
  2. Download and install the Jieba module
  3. Have the article or text you want segmented in plain text format (UTF-8 encoding). All the encoding nuances go a bit over my head, but the easy way to ensure this is to transfer (either by exporting or by copy-paste) your document to Notepad (or another plain text editor) and save as a .txt file, as shown below.

1_savetext

Once all of the above is set, you’re ready to open up your Python editor and segment some text. The tutorial on the Jieba site offers several examples of how to use the module, but I thought I’d share my specific code as well. I’ll post the full code first and then explain it below.

#encoding=utf-8
import jieba

# add in own dictionary of jargon and specific exceptions
jieba.load_userdict(‘chn-afr-dic.txt’)
jieba.suggest_freq((‘非’,’农业’), True)
jieba.suggest_freq((‘援’,’非’), True)

# read in your text file
with open(‘Mandarin Plaintext/Full Texts/12_IAE2009_FULLTEXT.txt’, ‘r’) as myfile:
data=myfile.read().replace(‘\n’, ”)

# segment and save
seg_list = jieba.cut(data, cut_all=False)
new_text = (“”.join(seg_list))
f = open(‘testtext22.txt’, ‘w’)
f.write(new_text.encode(‘utf-8’))
f.close()

Let’s walk through it line by line:

#encoding=utf-8
import jieba

Normally a # sign signals a comment in Python. Here, it’s actually reminding Python that for the purpose of this module and these runs, everything is encoded using UTF-8. The second line tells Python to import the Jieba module itself.

jieba.load_userdict(‘chn-afr-dic.txt’)

I’m working with texts that focus specifically on China-in-African-agriculture. As such, I found there were a few words common to the texts that were not included in Jieba’s standard Chinese dictionary. (E.g. 技术示范中心 is all one word but Jieba would recognize it as several distinct words). The line of code above tells Jieba to load and incorporate my own dictionary (a .txt file with each new word on a new line).

jieba.suggest_freq((‘非’,’农业’), True)

There were also a couple words that Jieba wanted to consider one word that I needed to be two words. Most commonly was anything where 非 was used as shorthand for Africa. Unaltered, Jieba would segment 非农业 as one word (anti-agriculture), when really it was being used as 非(州)农业 , African agriculture. This command tells Jieba to put a space between 非 and 农业. Alternatively, if Jieba defaulted to separating the characters and I wanted them to be considered one unit, I’d alter the code inside the parentheses to read ‘非农业’ .

with open(‘Folder/Subfolder/YourTextFile.txt’, ‘r’) as myfile:
data=myfile.read().replace(‘\n’, ”)

This line tells Python to open the text file you’d like to segment as the object “myfile”. The replace modifier at the end of the line lets Python know it can remove any hard returns and quotation marks, as neither are necessary or useful for segmentation.

seg_list = jieba.cut(data, cut_all=False)
new_text = (“”.join(seg_list))
f = open(‘your-file-name-here.txt’, ‘w’)
f.write(new_text.encode(‘utf-8’))
f.close()

Finally, we segment the text into an object called “seg_list” and then save it as a text object called “new_text”. The first line triggers the actual segmentation process. Jieba has a couple segmentation modes, by setting “cut_all = False” I indicate that I want Jieba to run its slower, more accurate mode. The next few lines make sure the output is read as text and save the newly segmented text as a .txt file. This new file.txt will save in the same folder as wherever your Python code file is saved.

That’s it!

And in the end, that example sentence from before quickly gains lots of spaces:

当前 ,  世界  面临  食品 、  气候变化  以及  金融  等  多重  危机 ,  而  正是  这些  危机  更加  突出  了  农业  在  发展中国家  至关重要  的  地位  。

Now I can use the segmented text in all sorts of interesting ways. But more on that later…

China’s Scholarships for African Students & FOCAC

AfricaScholarships

From 2003 until 2008, the Chinese Ministry of Education (MOE) reports on international students in China included a by-region breakdown for Chinese government scholarship data.

Starting in 2006, the Chinese government included at each Forum on China-Africa Cooperation (FOCAC) summit scholarship targets for bringing African students to study in China. In order to evaluate how China has upheld these pledges, Dr. Moore and I used the 2003-2008 scholarship data to estimate the number of Chinese scholarships to African students from 2009 onwards. We created a range of possible values based on the assumption of limited growth (linear) and best-fitting growth (exponential). Based on these estimates, China is most likely upholding the FOCAC scholarship pledges. 

FOCACpledges

* We’re aware this upper boundary figure for the 2018 estimate is not plausible. The farther out the prediction, the more likely it is that the exponential curve no longer fits reality. The exponential curve, though the best-fit using the provided 2003-2008 scholarship data, is probably just capturing the early portion of a logarithmic function that we would expect for something that is tied to population growth. Thus the choice to include a range of scholarships-given using both linear and exponential future growth.

Still, as shown in the figure below, even using only the linear estimates keeps China’s provided scholarships in pace with FOCAC pledges.

Figure2_scholarships_v2

The only known comparison we have is that at FOCAC in 2006, China declared they would “increase the number of Chinese government scholarships to African students from the current 2,000 per year to 4,000 per year by 2009.” According to the MOE, 1,861 African students received Chinese government scholarships in 2006, so the 2,000 estimate quoted in FOCAC was rounded up slightly.

Updates: 

After The Conversation article, some Twitter feeds and friends have led to a few more reports.

  1. The continued strength of China’s educational aid to Africa from the Institute of International and Comparative Education
  2. Guangzhou, that which African students love and hate from the Southern Metro Daily