Hey 👋 Join us on October 04th to explore the upcoming customization tools designed to craft unique content experiences. Reserve your seat

Analyzing and Visualizing User-Generated Community Content with Python

Community and user-generated content — that’s a match truly made in heaven!

The content generated in online communities is generally candid in nature and quite valuable in terms of the insights they generate. However, there is one caveat, this data is unstructured which makes it harder to analyze and comprehend.

This flow of unstructured data in general on the web is increasing exponentially. For perspective, by 2025, IDG projects that there will be 163 zettabytes of data in the world, and estimates indicate that 80% of this data is unstructured.

Since Tribe is a Community Platform and the whole solution is designed to foster discussions to amplify content creation, we’re also a part of this movement. Although, Tribe already offers pre-built reports and enables deep integration with popular analytics tools such as Google Analytics and Amplitude, the text mining techniques on the community content can reveal additional valuable insights.

For example, here are some of the common questions you can answer:

  • What are the most common terms used in the community content?
  • What are the positive and negative words posted in the community?
  • What is the overall community sentiment (useful for feedback and reviews posted in the community)?

In this post, I’ll cover how a community manager can answer these questions without getting the help of a data analyst or a programmer.

Read on to learn with greater details.

Setup and tools

We’re going to use the Python programming language for this study since it is now the most popular language in the data analysis and data science community. Also, it comes with a robust ecosystem of libraries for scientific computing. A very large community of users ensures that there is enough support and help whenever we run into any problem.

Now, the simplest way to set up everything in your system is to just go ahead install Anaconda. Simply put, it is a free and open-source Python distribution that comes pre-loaded with useful libraries for data science.

The installation process is simple and straightforward — just download the Python 3+ version and run the installer. Once done, open up the “Anaconda Navigator” which would be visible in your system.

When the Navigator opens up, you would see the following:

Now, we’ll work with Jupyter Notebook which is an open-source web application to create and share documents that contains live code, equations, visualizations, and narrative text.

Once you click on “Launch”, the web application will open up in your browser (localhost) and you will be able to see the following:

As shown in the screenshot, click on “New” and click on “Python 3”. This creates a new Jupyter Notebook with a bunch of menu options such as “File”, “Edit’, “Cell”, “Kernel”, “Help”. Check out the basics from the official knowledge base.

Here we need to understand two key elements:

  • Cell
  • Kernel

Markdown Cell is where you can write content in markdown format to format plain text. Code Cell allows you to run Python code via Kernel.

You can click on the “Run” button after entering the code or use the keyboard shortcut (shift+enter) to execute the code.

The dataset

Communities powered by Tribe allow you to export all the content generated inside the community. In this study, we will export the questions posted on Tribe Community.

You can export data from your community simply by clicking on “Reports” from the “Admin Panel”. Here is our community discussion on this topic.

Given below is the complete list of data fields available with datasets available for exported questions:

Question ID





Is Verified

User Fullname

User Username

User Title

User Reputation

User External ID

Group Id

Group Name

Group Slug

Banner URL

Followers Count

Views Count

Answers Count

Hidden Answers Count

Comments Count

Edits Count

Upvotes Count

Downvotes Count

Asks Count

Creation Date

Update Date

Analysis and visualization

The new Jupyter Notebook should have a Code cell. Also, you can always insert a new Code Cell.

Now we’re ready to get to the most interesting part of the study, i.e., writing actual Python code to start analyzing the data.

Importing libraries

The first step here is to import the libraries that we’ll use for computation and visualization.

import pandas as pd # Data manipulation and analysis

import numpy as np # Working with large, multi-dimensional arrays and matrices

import matplotlib.pyplot as plt # Plotting library

import seaborn as sns # Beautifying the visualizations

Reading the data

Now we import the question dataset exported from the community.

# Dataframe (df) to store the data

df = pd.read_csv('/Users/preetish/Downloads/questions-export_2020-02-17_22-21-46.csv')

# Confirm the data was stored correctly


Distribution of question length

Now we’ll work on a data frame with question text and length of the questions as an additional data field for this exploratory analysis.

df['length'] = df['title'].str.len()

df = df.rename({'Title': 'title'}, axis='columns')

dfq = df[['title','length']]


Now, let’s look at the length of the questions posted on Tribe Community. We’ll use Histogram to visualize the distribution of length.

Run the following code:


dfq['length'].plot.hist(bins=20,title='Distribution of question length',color = "#03A57B")

plt.xlabel('Question length')


plt.savefig('word-distribution-question.png',dpi=600,bbox_inches = "tight")
Question length distribution

This will create the plot and save it on your machine so you can use the same in other documents.

As we can see, the majority of the questions are between 50 to 100 characters. There are some outlier questions with 600 character length.

Most frequently used terms in questions

So, what are people really talking about in our community? This visualization will help us find the most frequently used terms when members ask questions in the community.

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction import text

#excluding ‘community’ and ‘tribe’ from the analysis by adding to the existing list of stop words

cv = CountVectorizer(stop_words = text.ENGLISH_STOP_WORDS.union(["community","tribe"]))

dfq.title = dfq.title.astype(str)

words = cv.fit_transform(dfq.title)

sum_words = words.sum(axis=0)

words_freq = [(word, sum_words[0, idx]) for word, idx in cv.vocabulary_.items()]

words_freq = sorted(words_freq, key = lambda x: x[1], reverse = True)

frequency = pd.DataFrame(words_freq, columns=['word', 'freq'])


color = plt.cm.gist_earth(np.linspace(0, 1, 25))

frequency.head(20).plot(x='word', y='freq', kind='bar', figsize=(15, 6), color = color)

plt.title("Most Frequently Words in Questions - Top 20")




This shows that the frequently used terms are “post”, “add”, “user”, “possible”, “members”. So, the members are mostly asking about posting or creating content. The next step is to examine the questions with the term “post” and find out if we can uncover latent structures. 

Word cloud

Word clouds are yet another way to visualize a cluster of words. In this analysis, we’re going to look at different terms used in the questions.

Before moving further, let’s install “wordcloud” package for Python by executing the following in the Terminal or Command Prompt:

/opt/anaconda3/bin/python -m pip install wordcloud

Now add the following in the Jupyter Code Cell and hit run.

from wordcloud import WordCloud

wordcloud = WordCloud(background_color = 'lightcyan', width = 900, height = 900).generate_from_frequencies(dict(words_freq))


plt.figure(figsize=(10, 10))


plt.title("WordCloud of Terms Used in Questions", fontsize = 20)


Bigrams in questions

Bigrams help us identify a sequence of two adjacent words. In this analysis, we will produce a visualization of the top 20 bigrams.

Given below the Python code for Jupyter Notebook:

def get_top_n_bigram(corpus, n=None):

vec = CountVectorizer(ngram_range=(2, 2), stop_words=text.ENGLISH_STOP_WORDS.union(["community","tribe"])).fit(corpus)

bag_of_words = vec.transform(corpus)

sum_words = bag_of_words.sum(axis=0)

words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]

words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)

return words_freq[:n]

common_words = get_top_n_bigram(dfq['title'], 20)

df_b = pd.DataFrame(common_words, columns = ['bigrams' , 'count'])


color = plt.cm.gist_earth(np.linspace(0, 1, 25))

df_b.plot(x='bigrams', y='count', kind='bar', figsize=(15, 6), color = color)

plt.title("Bigrams in Questions - Top 20")



plt.savefig('bigrams-question.png',dpi=600,bbox_inches = "tight")

Once you execute this code, it’d generate the chart like the one given below:

We can see that the frequently occurring bigrams are about “content types”, “custom domain”, “home page”, “virtual currency, and “Google Analytics”. Now we dig deep into the questions with these bigrams and identify possible recurring themes. This insight can be used improve the content available in the community and the product.

Next up: sentiment analysis

This concludes the first part of this post which is exploratory in nature. Here we discussed different ways to visualize the frequent terms used in the questions. In the next part of this post will delve into Natural Language Processing (NLP) to perform sentiment analysis on the comments posted in the community. As mentioned earlier, this will be most useful for analyzing the reviews and feedbacks collected in a community.


Marketing at Tribe. I raise ARR for a living! Love motorbikes and new cuisines.

mindvally logo
tim hortons
leboncoin logo
ada logo
convertkit logo
what3words logo

Join top businesses empowering their community with Tribe Platform.

Create your Tribe