Forum Posts Word Count

Author: badger

Posts

Total: 17
badger
badger's avatar
Debates: 0
Posts: 2,087
3
3
3
badger's avatar
badger
3
3
3
So I was thinking forum post count is a bit of an abstraction away from real life. Word count however is something we see in a whole lot of places.

I wrote a script to get word count for people's forum posts. It's not exactly correct. I'm only splitting on spaces, so words split by forward slashes or colons will get by. Also, I suck at web development and the only response I could get from this place was html, so that was a pain. (Was gonna do debates and debate comments and questions too until I realised how much of a pain it was.)

RationalMadman has written 983,946 words on the DebateArt forums. That's roughly equivalent to 14 theses or 11 novels, or the entire Harry Potter series.

That's fun, right. 
RationalMadman
RationalMadman's avatar
Debates: 566
Posts: 19,930
10
11
11
RationalMadman's avatar
RationalMadman
10
11
11
I never related to the way boys or men talk so little when they speak. It's like they say shit in the most basic and dull way possible, I'm rather verbose and I like being it.

That doesn't mean I speak amazingly 'well' to women but the reverse is very true, when I hear the average female preacher or lecturer on a topic I am more likely to comprehend her way of teaching, this happened to me my entire school life through to university and I could never pinpoint why until I realised it's because women word things more strung along where men leave gaps often expecting you to fill in the blanks (men literally talk in bullet points quite often).
RationalMadman
RationalMadman's avatar
Debates: 566
Posts: 19,930
10
11
11
RationalMadman's avatar
RationalMadman
10
11
11
Not that I am 'embarassed' to post that much, rather I know I do that most places I verbally/textually interact but I must say that you include quotes in that count for sure and links etc. I am not downplaying the count being so high but some of it is due to quoting.
PREZ-HILTON
PREZ-HILTON's avatar
Debates: 18
Posts: 2,806
3
4
9
PREZ-HILTON's avatar
PREZ-HILTON
3
4
9
-->
@badger
RationalMadman has written 983,946 words on the DebateArt forums
I am working on fixing this by having Mike make the lifetime word count of an individual 500,000. 
Best.Korea
Best.Korea's avatar
Debates: 271
Posts: 7,855
4
6
10
Best.Korea's avatar
Best.Korea
4
6
10
RationalMadman has written 983,946 words on the DebateArt forums. That's roughly equivalent to 14 theses or 11 novels, or the entire Harry Potter series.
Thats a lot.


Best.Korea
Best.Korea's avatar
Debates: 271
Posts: 7,855
4
6
10
Best.Korea's avatar
Best.Korea
4
6
10
Shila would surpass everyone if she continued her 80 posts per day.
Intelligence_06
Intelligence_06's avatar
Debates: 167
Posts: 3,837
5
8
11
Intelligence_06's avatar
Intelligence_06
5
8
11
-->
@PREZ-HILTON
I am working on fixing this by having Mike make the lifetime word count of an individual 500,000. 
So a word limit. Upper bound or lower bound.

If I misunderstood anything, feel free to correct me.

Intelligence_06
Intelligence_06's avatar
Debates: 167
Posts: 3,837
5
8
11
Intelligence_06's avatar
Intelligence_06
5
8
11
The main point for posting here is definitely not solely posting for posting. On the contrary, that is spam. We are never meant to just post, we see stuff and we present our opinion(s) and that is a post or more. That is how it works.

I suggest the default forums leaderboard ranking should be based on likes/posts ratio. At least clickbaity titles are better than spamming videos on youtube.com.


Intelligence_06
Intelligence_06's avatar
Debates: 167
Posts: 3,837
5
8
11
Intelligence_06's avatar
Intelligence_06
5
8
11
Actually, having the leaderboard based on the aggregate number of likes is probably better. 
BearMan
BearMan's avatar
Debates: 16
Posts: 1,067
3
4
11
BearMan's avatar
BearMan
3
4
11
-->
@badger
send script 
i can help figure out debate comments + questions if u want
badger
badger's avatar
Debates: 0
Posts: 2,087
3
3
3
badger's avatar
badger
3
3
3
-->
@BearMan
import urllib.request
import re
from bs4 import BeautifulSoup

word_count = 0


def count_words(text):
words = text.split()
return len(words)


def get_post_text(html_string, thread_id, post_id):
soup = BeautifulSoup(html_string, 'html5lib')
post_link = soup.find('a', href=f'/forum/topics/{thread_id}/post-links/{post_id}', rel='nofollow')
post_text_div = post_link.find_next('div', class_='forum-topic-show__post-text', itemprop="text")
i = count_words(post_text_div.text)

return i


# o = urllib.request.urlopen("https://www.debateart.com/participants/RationalMadman/forum_posts")
# b = o.read()
# s = b.decode("utf-8")
# matches = re.findall("a href=\"/forum/topics/(\\d+)/post-links/(\\d+)", s, re.DOTALL)
# for match in matches:
# topic = match[0]
# post = match[1]
# url = "https://www.debateart.com" + "/forum/topics/" + str(topic) + "/post-links/" + str(post)
# o = urllib.request.urlopen(url)
# b = o.read()
# s = b.decode("utf-8")
# html_string = s
# i = get_post_text(html_string, match[0], match[1])
# word_count += i

curr = 859


while urllib.request.urlopen(f"https://www.debateart.com/participants/RationalMadman/forum_posts?page={curr}"):
b = o.read()
s = b.decode("utf-8")
matches = re.findall("a href=\"/forum/topics/(\\d+)/post-links/(\\d+)", s, re.DOTALL)
for match in matches:
topic = match[0]
post = match[1]
url = "https://www.debateart.com" + "/forum/topics/" + str(topic) + "/post-links/" + str(post)
o = urllib.request.urlopen(url)
b = o.read()
s = b.decode("utf-8")
html_string = s
i = get_post_text(html_string, match[0], match[1])
word_count += i
print(curr)
print(word_count)
curr += 1


print(word_count)
Just takes too long. Site is all php and html. All you can get back is full page html on every request, then need to search that. 7k lines on every post.
sadolite
sadolite's avatar
Debates: 0
Posts: 2,928
3
2
4
sadolite's avatar
sadolite
3
2
4
If you wrote one word every second it would take 11 days to write  983,946 words. With that said, over a few years , Meh. 
BearMan
BearMan's avatar
Debates: 16
Posts: 1,067
3
4
11
BearMan's avatar
BearMan
3
4
11
-->
@badger
github?
indentation is being screwed up
badger
badger's avatar
Debates: 0
Posts: 2,087
3
3
3
badger's avatar
badger
3
3
3
-->
@BearMan
Simple loops dude. Indent everything under the while loops once. Indent under the for loop once more down until word_count += i. The first commented out bit is to get the first page of comments on user profile. The while loops gets everything else from page=2. curr was set to 800 there because I did it in increments. Set it to 2 to run from beginning. 
badger
badger's avatar
Debates: 0
Posts: 2,087
3
3
3
badger's avatar
badger
3
3
3
Everything under the for loop in the first comment out part is indented once. 
badger
badger's avatar
Debates: 0
Posts: 2,087
3
3
3
badger's avatar
badger
3
3
3
import urllib.request
import re
from bs4 import BeautifulSoup

word_count = 0


def count_words(text):
       words = text.split()
       return len(words)


def get_post_text(html_string, thread_id, post_id):
        soup = BeautifulSoup(html_string, 'html5lib')
        post_link = soup.find('a', href=f'/forum/topics/{thread_id}/post-links/{post_id}', rel='nofollow')
        post_text_div = post_link.find_next('div', class_='forum-topic-show__post-text', itemprop="text")
        i = count_words(post_text_div.text)

       return i


# b = o.read()
# s = b.decode("utf-8")
# matches = re.findall("a href=\"/forum/topics/(\\d+)/post-links/(\\d+)", s, re.DOTALL)
# for match in matches:
#           topic = match[0]
#           post = match[1]
#           url = "https://www.debateart.com" + "/forum/topics/" + str(topic) + "/post-links/" + str(post)
#           o = urllib.request.urlopen(url)
#           b = o.read()
#           s = b.decode("utf-8")
#           html_string = s
#           i = get_post_text(html_string, match[0], match[1])
#           word_count += i

curr = 859


           o = urllib.request.urlopen(f"https://www.debateart.com/participants/RationalMadman/forum_posts?page={curr}")
           b = o.read()
           s = b.decode("utf-8")
           matches = re.findall("a href=\"/forum/topics/(\\d+)/post-links/(\\d+)", s, re.DOTALL)
           for match in matches:
                    topic = match[0]
                    post = match[1]
                    url = "https://www.debateart.com" + "/forum/topics/" + str(topic) + "/post-links/" + str(post)
                    o = urllib.request.urlopen(url)
                    b = o.read()
                    s = b.decode("utf-8")
                    html_string = s
                    i = get_post_text(html_string, match[0], match[1])
                   word_count += i
           print(curr)
           print(word_count)
           curr += 1


print(word_count)

badger
badger's avatar
Debates: 0
Posts: 2,087
3
3
3
badger's avatar
badger
3
3
3
Also the "while urllib.request.urlopen(f"https://www.debateart.com/participants/RationalMadman/forum_posts?page={curr}"):" always returns true no matter the val of curr because the site just gives a pop up. I just found the curr for RM's number of pages of posts on his profile and took the word_count from under it. You might want to fix that too or input manually or whatever. 

Script is honestly not worth it, searching html is dumb.