Analyzing and Visualizing User-Generated Community Content with Python
Community and user-generated content — that’s a match truly made in heaven!
The content generated in online communities is generally candid in nature and quite valuable in terms of the insights they generate. However, there is one caveat, this data is unstructured which makes it harder to analyze and comprehend.
This flow of unstructured data in general on the web is increasing exponentially. For perspective, by 2025, IDG projects that there will be 163 zettabytes of data in the world, and estimates indicate that 80% of this data is unstructured.
Since Tribe is a Community Platform and the whole solution is designed to foster discussions to amplify content creation, we’re also a part of this movement. Although, Tribe already offers pre-built reports and enables deep integration with popular analytics tools such as Google Analytics and Amplitude, the text mining techniques on the community content can reveal additional valuable insights.
For example, here are some of the common questions you can answer:
- What are the most common terms used in the community content?
- What are the positive and negative words posted in the community?
- What is the overall community sentiment (useful for feedback and reviews posted in the community)?
In this post, I’ll cover how a community manager can answer these questions without getting the help of a data analyst or a programmer.
Read on to learn with greater details.
Setup and tools
We’re going to use the Python programming language for this study since it is now the most popular language in the data analysis and data science community. Also, it comes with a robust ecosystem of libraries for scientific computing. A very large community of users ensures that there is enough support and help whenever we run into any problem.
Now, the simplest way to set up everything in your system is to just go ahead install Anaconda. Simply put, it is a free and open-source Python distribution that comes pre-loaded with useful libraries for data science.
The installation process is simple and straightforward — just download the Python 3+ version and run the installer. Once done, open up the “Anaconda Navigator” which would be visible in your system.
When the Navigator opens up, you would see the following:

Now, we’ll work with Jupyter Notebook which is an open-source web application to create and share documents that contains live code, equations, visualizations, and narrative text.
Once you click on “Launch”, the web application will open up in your browser (localhost) and you will be able to see the following:

As shown in the screenshot, click on “New” and click on “Python 3”. This creates a new Jupyter Notebook with a bunch of menu options such as “File”, “Edit’, “Cell”, “Kernel”, “Help”. Check out the basics from the official knowledge base.

Here we need to understand two key elements:
- Cell
- Kernel
Markdown Cell is where you can write content in markdown format to format plain text. Code Cell allows you to run Python code via Kernel.
You can click on the “Run” button after entering the code or use the keyboard shortcut (shift+enter) to execute the code.
The dataset
Communities powered by Tribe allow you to export all the content generated inside the community. In this study, we will export the questions posted on Tribe Community.
You can export data from your community simply by clicking on “Reports” from the “Admin Panel”. Here is our community discussion on this topic.

Given below is the complete list of data fields available with datasets available for exported questions:
Question ID |
Title |
Description |
Language |
Status |
Is Verified |
User Fullname |
User Username |
User Title |
User Reputation |
User External ID |
Group Id |
Group Name |
Group Slug |
Banner URL |
Followers Count |
Views Count |
Answers Count |
Hidden Answers Count |
Comments Count |
Edits Count |
Upvotes Count |
Downvotes Count |
Asks Count |
Creation Date |
Update Date |
Analysis and visualization
The new Jupyter Notebook should have a Code cell. Also, you can always insert a new Code Cell.

Now we’re ready to get to the most interesting part of the study, i.e., writing actual Python code to start analyzing the data.
Importing libraries
The first step here is to import the libraries that we’ll use for computation and visualization.
import pandas as pd # Data manipulation and analysis import numpy as np # Working with large, multi-dimensional arrays and matrices import matplotlib.pyplot as plt # Plotting library import seaborn as sns # Beautifying the visualizations
Reading the data
Now we import the question dataset exported from the community.
# Dataframe (df) to store the data df = pd.read_csv('/Users/preetish/Downloads/questions-export_2020-02-17_22-21-46.csv') # Confirm the data was stored correctly df.head()

Distribution of question length
Now we’ll work on a data frame with question text and length of the questions as an additional data field for this exploratory analysis.
df['length'] = df['title'].str.len() df = df.rename({'Title': 'title'}, axis='columns') dfq = df[['title','length']] dfq.head()

Now, let’s look at the length of the questions posted on Tribe Community. We’ll use Histogram to visualize the distribution of length.
Run the following code:
plt.style.use('fivethirtyeight') dfq['length'].plot.hist(bins=20,title='Distribution of question length',color = "#03A57B") plt.xlabel('Question length') plt.ylabel('Frequency') plt.savefig('word-distribution-question.png',dpi=600,bbox_inches = "tight")

This will create the plot and save it on your machine so you can use the same in other documents.
As we can see, the majority of the questions are between 50 to 100 characters. There are some outlier questions with 600 character length.
Most frequently used terms in questions
So, what are people really talking about in our community? This visualization will help us find the most frequently used terms when members ask questions in the community.
from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction import text #excluding ‘community’ and ‘tribe’ from the analysis by adding to the existing list of stop words cv = CountVectorizer(stop_words = text.ENGLISH_STOP_WORDS.union(["community","tribe"])) dfq.title = dfq.title.astype(str) words = cv.fit_transform(dfq.title) sum_words = words.sum(axis=0) words_freq = [(word, sum_words[0, idx]) for word, idx in cv.vocabulary_.items()] words_freq = sorted(words_freq, key = lambda x: x[1], reverse = True) frequency = pd.DataFrame(words_freq, columns=['word', 'freq']) plt.style.use('fivethirtyeight') color = plt.cm.gist_earth(np.linspace(0, 1, 25)) frequency.head(20).plot(x='word', y='freq', kind='bar', figsize=(15, 6), color = color) plt.title("Most Frequently Words in Questions - Top 20") plt.xlabel('Word') plt.ylabel('Frequency') plt.savefig(frequently_used_terms.png',dpi=600)

This shows that the frequently used terms are “post”, “add”, “user”, “possible”, “members”. So, the members are mostly asking about posting or creating content. The next step is to examine the questions with the term “post” and find out if we can uncover latent structures.
Word cloud
Word clouds are yet another way to visualize a cluster of words. In this analysis, we’re going to look at different terms used in the questions.
Before moving further, let’s install “wordcloud” package for Python by executing the following in the Terminal or Command Prompt:
/opt/anaconda3/bin/python -m pip install wordcloud
Now add the following in the Jupyter Code Cell and hit run.
from wordcloud import WordCloud wordcloud = WordCloud(background_color = 'lightcyan', width = 900, height = 900).generate_from_frequencies(dict(words_freq)) plt.style.use('fivethirtyeight') plt.figure(figsize=(10, 10))

plt.axis('off') plt.imshow(wordcloud) plt.title("WordCloud of Terms Used in Questions", fontsize = 20) plt.show()
Bigrams in questions
Bigrams help us identify a sequence of two adjacent words. In this analysis, we will produce a visualization of the top 20 bigrams.
Given below the Python code for Jupyter Notebook:
def get_top_n_bigram(corpus, n=None): vec = CountVectorizer(ngram_range=(2, 2), stop_words=text.ENGLISH_STOP_WORDS.union(["community","tribe"])).fit(corpus) bag_of_words = vec.transform(corpus) sum_words = bag_of_words.sum(axis=0) words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()] words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True) return words_freq[:n] common_words = get_top_n_bigram(dfq['title'], 20) df_b = pd.DataFrame(common_words, columns = ['bigrams' , 'count']) plt.style.use('fivethirtyeight') color = plt.cm.gist_earth(np.linspace(0, 1, 25)) df_b.plot(x='bigrams', y='count', kind='bar', figsize=(15, 6), color = color) plt.title("Bigrams in Questions - Top 20") plt.xlabel('Bigrams') plt.ylabel('Frequency') plt.savefig('bigrams-question.png',dpi=600,bbox_inches = "tight")
Once you execute this code, it’d generate the chart like the one given below:

We can see that the frequently occurring bigrams are about “content types”, “custom domain”, “home page”, “virtual currency, and “Google Analytics”. Now we dig deep into the questions with these bigrams and identify possible recurring themes. This insight can be used improve the content available in the community and the product.
Next up: sentiment analysis
This concludes the first part of this post which is exploratory in nature. Here we discussed different ways to visualize the frequent terms used in the questions. In the next part of this post will delve into Natural Language Processing (NLP) to perform sentiment analysis on the comments posted in the community. As mentioned earlier, this will be most useful for analyzing the reviews and feedbacks collected in a community.