文字列について記号やhttpリンクの置換を行う上では文字列の正規表現についてはやらないといかんなと思い。

参考サイト

やってみる

「文字列から数字だけ抽出したいな～～」というときに、「0,1,2,,,,9」のどれかに一致するやつを検索するより「\d」と一文字で表せたら楽だよね（それを連続しているもの単位で抽出できたら尚更）という話と理解。まずはhttpリンクを抽出。

r'https?://\S+|www\.\S+'は縦線「|」は和集合なのでhttps?://\S+とwww\.\S+のどちらかに一致するものを抽出。?は直前文字が0or1文字なのでhttporhttps、://はそのままで、\S+は任意の空白文字が１回以上の繰り返し。www\.\S+は\.が任意の一文字\S+が任意の空白以外文字。違いは。。？

template = re.compile(r'https?://\S+|www\.\S+')
tmp0 = df["text"].apply(lambda x: template.search(x))

# 具体的な文字列で一応確認
text = tmp0.dropna().iloc[0].group() # http://www.laayouneinvest.ma/fr/index.asp)
template.sub(r"", text) # ''

a-zA-Z\dについて、a-zは小文字アルファベットとA-Zは大文字アルファベットと\dは数字0-9。^はそれ以外なので、記号をすべて落とす。個人的に！は記号だけど残してよいのでは？と思う。！！！が多いと何となく攻撃的に見えそう。実際に！を１つ以上含むコメントと含まないものでは、含むものの方がスコアyが大きい（攻撃的）。ということで、！だけは残す。ただ、１０個も連続しているやつはなんか邪魔になりそうなので一つにしてしまう。

text = "dfd.adjaofspfdu304738"
text = re.sub(r"[^a-zA-Z\d]", " ", text) # 'dfd adjaofspfdu304738'

df["extram_cnt"] = df["text"].apply(lambda x: x.count("!"))
df["extram_cnt_01"] = df["extram_cnt"].apply(lambda x: "not_zero" if x != 0 else "zero")
agg_d = {
    "y": ["mean", "count"]
}

tmp = df.groupby("extram_cnt_01").agg(agg_d)
tmp.columns = ["_".join(i) for i in tmp.columns]
tmp

f:id:iiiiikamirin:20220120110216p:plain

text = "dfd!!adjaofspfd@@@u304738"
text = re.sub(r"[^a-zA-Z\d]", " ", text) # 'dfd  adjaofspfd   u304738'
text = re.sub(r"[^a-zA-Z\d|!]", " ", text) # 'dfd!!adjaofspfd   u304738'

text = "dfhafos!!!fhdof!!!"
text = re.sub(r"!+", r"!", text) # 'dfhafos!fhdof!'

あとは連続するスペースを削除。

text = "dfd!!adjaofs      pfd@@@u304738"
text = re.sub(' +', ' ', text) # 'dfd!!adjaofs pfd@@@u304738'

最後にユニコードによって絵文字を除去。日本語やハングルも削除できた。ユニコードの詳細はちょっとわからず。。これは所与として受け入れる！一応みてみると下２つのユニコードだけでできた。絵文字を含むテキストは1259 (/159571)。日本語もそれほど汚い文字は入ってなかったので削除してしまってよいでしょう！

def emoji_flg(text):
    
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"
                               u"\U0001F300-\U0001F5FF"
                               u"\U0001F680-\U0001F6FF"
                               u"\U0001F1E0-\U0001F1FF"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               "]+", flags = re.UNICODE)
#     text = emoji_pattern.sub(r'', text)
    text = emoji_pattern.search(text)
    return text if text is None else text.group()
df["emoji_flg"] = df["text"].apply(emoji_flg)
len(df.emoji_flg.dropna()) # 1259

あと、タグを外す。そのためにbeautifulsoupのget_textを使用。<*>みたいなやつが抜ける。

from bs4 import BeautifulSoup

def f_test(text):
    soup = BeautifulSoup(text, 'lxml')
    only_text = soup.get_text()
    return True if text == only_text else False

df["bs_text"] = df["text"].apply(f_test)
bs_diff = df[~df.bs_text].copy()

i = 0
text = bs_diff.iloc[i,0]
soup = BeautifulSoup(text, 'lxml')
only_text = soup.get_text()
print(f"text:{text}")
print(f"only_text:{only_text}")

text:1)I am an admin, so I guess you were only re-stating the obvious, and 2) the article has been restored and reverted to the more complete version.   - '''' - <*>
only_text:1)I am an admin, so I guess you were only re-stating the obvious, and 2) the article has been restored and reverted to the more complete version.   - '''' -

ぶぶぶろぐ

Jigsaw Rate Severity of Toxic Comments（文字列の正規表現）

参考サイト

やってみる