After a recent data analytics skill test for a job application that I’m sure I flunked, I realized I needed to brush up on my basics. So, I decided hey, why not look at novelupdates data! This is not the first time that this is being done. Dreams of Jianghu has been doing yearly reviews of NU trends on their site since 2018! Do check them out if you haven’t! https://dreamsofjianghu.ca/2018/08/06/asian-fan-translation-trends-2018/ https://dreamsofjianghu.ca/2019/11/09/asian-fan-translation-trends-2019/ Here’s a summary of what I have analyzed. In this thread, I will only be covering Section 1. *Things listed here are subjected to changes as I write the posts. 0. How did I collect the data? 1. General analysis Exploratory time analysis of projects Exploring projects by their ratings Exploring projects by their number of readers Final analysis with minimum chapters constraints Conclusion 2. Genre analysis Overview Biannual growth trend Common genre combinations Statistics of genre combinations 1. Ratings 2. Number of readers Wordclouds of novel descriptions 1. All novels 2. Xianxia and Xuanhuan 3. Yaoi and Shounen Ai 4. Fantasy 5. School Life Conclusion 3. Tag analysis Overview 1. Most Common Tags 2. Least Used Tags 3. Novels with Most Number of Tags Correlation of Number of Tags to Number of Readers [STATISTICALLY HEAVY] Most Common Tag Combinations 1. 3-Tag Combinations 2. 4-Tag Combinations Measuring Common 4-Tag Combinations 1. Ratings 2. Number of Readers Finding Subsets Between 2 Tags [MATH] Conclusion 4. Novel synopsis analysis Overview 1. Wordcloud of Word-Pairs 2. Distribution of Synopsis Length 3. Novels with the Longest Synopses Finding Similarities in Synopses 1. Brief Introduction to Word Embeddings 2. Visualizing in 2D 3. Visualizing in 3D Novel Recommender System Things to note: I will only be looking at CN/JP/KR novels as the make up the bulk of the projects on the site. All time series analyses will start from the 2nd half of 2015, which is roughly when Novel Updates was first created. All analyses are based on projects with chapters uploaded. This excludes projects with dead links/hidden chapters. As such, the values are a little undercut. Conclusions made are based on merely the data collected and the analyses conducted. ======================================= 0. How did I collect the data? The entire data collection process was completed between 19th and 20th February this year, and the collection script was written in Python. The script automates across all the project pages on Novel Updates, and scrapes off the data that I find would be useful for my analysis. To visualize this, an example of some of the data I scrape off a page is shown below: I used my translation project as an example, ‘cause I’m that narcissistic! ======================================= 1. General analysis As of 19th February 2020, there are a total of 6,054 projects on Novel Updates: Chinese: 3,128 Japanese: 2,463 Korean: 365 Others: 98 It was kind of surprising to see that there were only 365 Korean projects, seeing how they have been quite a hot topic in the past 2 years or so. Out of these 6,054 projects, only 5,683 projects contain active links. Let’s look at the line chart below that shows the number of projects over time, with respect to their country of origin. ------------------------------ 1.1. Exploratory time analysis of projects From this chart, it can be seen that Japanese projects have been on a steady rise, but Chinese projects have been rising on an increasing rate! On 13th November 2018, the number of Chinese projects officially surpassed Japanese projects! Let’s dive a little deeper to look at the rates of increase, by looking at how many projects are added bi-annually. Fumu. It seems like roughly about 200 JP projects are added every 6 months, while generally more and more CN projects are added, with a big spike between Dec 2018 and Jun 2019. The number of KR novels have also been showing a gradual increase since Jun 2018. Looking at the current trend, I’m predicting that we will see another 700-800 active projects being added by the end of this June, with possibly more Korean projects given their popularity. Anyway, while working on these charts, this got me thinking. How would the CN curve look like without official CN publisher-turned EN publishers (namely Webnovel and Tapread)? Webnovel’s first project upload was on 1st March 2017, So we should see the CN curve branching out in the 1st half of the year 2017. TapRead first project upload was on 25th February 2019. As TapRead has a small number of new projects, I don’t think it would affect the curve that much. Interesting, it seems like the sharp increases between 2017 and 2019 weren’t because of these 2 rising publishers after all. Maybe it’s due to an increasing number of smaller translation groups? I did not scrape information about the translation groups, so I will have to leave this analysis for another time! ------------------------------ 1.2. Exploring projects by their ratings Now then, let’s look at how the projects are distributed by their ratings. Do note that the minimum rating that you give is 1.0, and the maximum is 5.0. The average rating for a project on NU is 3.73. By language, the average ratings are: Chinese: 3.72 Japanese: 3.74 Korean: 3.83 Korean projects have a higher average rating than the other 2 languages, it seems. Let’s look at the actual distribution. Hmm. The chart above doesn’t really give a fair comparison to KR projects, as they are significantly smaller in number. Let’s instead scale the numbers according to each language, and look at the density plots! All the distributions are left-skewed, and Korean projects are generally rated near its mean score of 3.83. (Can be seen by its steeper slope near the mean.) This also means that readers generally do not give a Korean project a very high or low score that easily, as compared to the other 2 languages. Interesting! Chinese projects’ distribution has the shortest peak, which means that the ratings are spread more evenly than the other 2 languages. From the 4.5-5.0 range of the chart, you can also see that the Chinese distribution line is actually higher than the other 2 languages, which also means you can generally find more Chinese projects with that rating range. Japanese projects on the other hand, seem to have more novels within the 3.0-3.5 range, and less novels within the 4.1-4.5 range, when scaled to the other 2 languages. So with these charts alone, which translator/translation group, based on the language they’re translating, will thrive? I honestly think isn’t sufficient to conclude anything, so let’s move on to look at the distribution of readers. ------------------------------ 1.3. Exploring projects by their number of readers On average, a project on NU has 1876 readers. In terms of language: Chinese: 1736 Japanese: 2222 Korean: 2462 Let’s go straight into the scaled distribution based on the number of readers! Well, I can’t say that I’m not surprised to see such a biased right skew. The Chinese projects’ distribution has the tallest peak, while the Korean projects’ distribution has the smallest. This means Chinese projects are less fluid in terms of readers, usually close to its mean. There’s hardly any useful information to derive from here, so let’s move on. ------------------------------ 1.4. Final analysis with minimum chapters constraints Translation groups tend to pick up novels that are usually longer in length. After all, more chapters would mean longer series longevity, more time to build a fanbase, more clicks, and thus more views. In this final part, we will take a look at the distributions of projects with at least 100 chapters, and see if we can make any conclusions! Now, let’s look at the distribution of ratings again, this time with the constraint in place. Hardly any change could be seen to the Chinese and Japanese distributions, but the Korean distribution is skewed even more to the left! Generally, longer Korean projects have more ratings above 4.0 than longer Chinese and Japanese projects! For the curious, the average ratings for projects with more than 100 chapters are: Chinese: 3.74 Japanese: 3.83 Korean: 4.04 Now let’s look at the number of readers for projects with the constraint in place! It’s definitely much more readable than the previous unconstrained chart for sure! The right skew for Chinese projects is much more pronounced than Japanese and Korean projects it seems, with a large bulk of Chinese projects having around 0-5000 readership. The Japanese and Korean distributions seem to be a lot smoother, with readership count spreading across more. When it comes to Chinese translation groups starting out, I would expect that unless it’s a big hit, it would be hard to gain readership for the first 100 chapters or so. The average readership, for projects over 100 chapters are: Chinese: 3643 Japanese: 7327 Korean: 8418 ------------------------------ 1.5. Section 1 Conclusion There’s a general rising trend in the number of novels. I expect a bigger growth in KR projects. Though I believe there would still be a great number of CN projects in the following months, new readership for CN projects might be hard to obtain. This is most likely due to readers burnout. If committed to releasing long novels, a new Korean translation group would most likely outperform a new Chinese or Japanese translation group. And that’s the end of the 1st part of my analysis. If there are any questions, things I missed out, or things I probably misinterpreted, do let me know! Stay tuned for the next part! I guarantee that it's definitely more fun than just charts and numbers like this one! It will probably take me a few more hours to consolidate them all though... omg... Save me. *UPDATE: Click here for Part 2 on genre analysis! *UPDATE 2: Click here for Part 3 on tag analysis! *UPDATE 3: Click here for the final part on novel synopsis analysis!