Posts Tagged ‘weibo’

The paper “Censorship and deletion practices in Chinese social media” is available online at First Monday now. Beyond my earlier comments based only on the paper abstract and the Carnegie Mellon news release on March 7, 2012, some details of the paper are worth noting:

  • The Sina Weibo data were collected through Sina’s open API’s over the period from 27 June to 30 September 2011. By querying the public timeline at fixed intervals to retrieve a sample of messages, the study collected a total of 56,951,585 messages (approximately 600,000 messages per day). Using the same API, one can check if the message exists and can be read at the moment, or it  has been deleted at some point between now and its original date of publication. (If it has been deleted, Sina returns the message “target weibo does not exist”). Here is my questions: What type of “sample” is this? Does it contain all messages posted in this period of time? How large is average number of posts each day on Sina Weibo?
  • The baseline of message deletion rate is about 16.25%, estimated by the deletion rates for a random sample of 1,308,430 messages (published between 30 June and 25 July), of which 212,583 had been deleted.
  • Efforts are made to filter out spam messages in this study as well. Nice! Details are described as:

We filtered the entire dataset on three criteria: (1) duplicate messages that contained exactly the same Chinese content (i.e., excluding whitespace and alphanumerics) were removed, retaining only the original message; (2) all messages from individuals with fewer than five friends and followers were removed; and, (3) all messages with a hyperlink (http) or addressing a user (@) were removed if the author had fewer than 100 friends and followers.

  • To extracted terms from the messages, a Chinese–English dictionary as the union of the open source CC–CEDICT dictionary and all entries in the Chinese– language Wikipedia is constructed to overcome the challenge posed by the absence of whitespace separating words in Chinese.
  • Here is one for statisticians: To see if the high deletion rate of a certain message is “statistical significant”, the false discovery rate is used to adjust for the fact that tens of thousands of simultaneous hypothesis tests (multiple testing).
  • Twitter data are used for a baseline for comparison. The baseline was set by the 10,000 most frequent users in the gardenhose sample over the period 1–24 June 2011 writing tweets in Chinese not containing http or www (to filter spammers). From the reported table, one interesting term shown at the top of the sensitive terms with statistically significant higher rates of message deletion is “方滨兴 (Fang Bingxin)”,  the name of architect of the Great Firewall.

I will stop here. Interested? Read the whole article by yourself.

Read Full Post »

Carnegie Mellon Performs First Large-Scale Analysis of ‘Soft’ Censorship of Social Media in China” released on March 7, 2012 reports:

The study by Noah Smith, associate professor in the Language Technologies Institute (LTI); David Bamman, a Ph.D. student in LTI; and Brendan O’Connor, a Ph.D. student in the Machine Learning Department, appears in the March issue of First Monday, a peer-reviewed, online journal. ……

“You even see some weibos where the writer asks, ‘Is this going to be deleted?'” O’Connor said. In late 2010, New York Times columnist Nicholas Kristof opened an account on a Chinese microblog site; within an hour of sending a message about Falun Gong, his account was shut down.

To study this “soft” censorship, the CMU team analyzed almost 57 million messages posted on Sina Weibo, a domestic Chinese microblog site similar to Twitter that has more than 200 million users. They collected samples of weibos from June 27 to Sept. 30, 2011, using an application programming interface (API) that Sina Weibo provides to developers so they can build related services.

Using the same API, they later checked a random subset of weibos to see if they still existed and another subset that included terms known to be politically sensitive. If a weibo was deleted, Sina would return what the researchers came to regard as an ominous message: “target weibo does not exist.”

It seems like an interesting research and there are also some funny observations:

Censored terms are not always political. Following the March 2011 Fukushima nuclear disaster in Japan, weibos containing such politically innocuous terms as iodized salt and radioactive iodine had high deletion rates. The researchers believe these deletions were the result of government efforts to quash false rumors about the nuclear accident causing salt contamination.

To my knowledge, I think it is the first large scale study of this topic. I’m eager to read the article to find out the details, but it is not yet on the current version of First Monday (just not yet at 3:oo pm on March 9, 2012). I cannot find the paper on Noah Smith other the co-author’s webpages either.

However, at this very moment (3:oo pm on March 9, 2012), it has been republished and reblogged at China’s social networks hit by censorship, says study (BBC), “CMU study analyzes what China deletes” (post-gazette.com), CMU study analyzes what China deletes (Education News), and some others.

By no means I am questioning the findings or the methods of the research without reading it. I just wonder how much reporters (bloggers) read before they spread the news.

I will update after reading the paper when it shows up on First Monday (which is supposed to be March 5 anyway). See update here.

Read Full Post »