-
-
Save Alex-Just/e86110836f3f93fe7932290526529cd1 to your computer and use it in GitHub Desktop.
import re | |
# http://stackoverflow.com/a/13752628/6762004 | |
RE_EMOJI = re.compile('[\U00010000-\U0010ffff]', flags=re.UNICODE) | |
def strip_emoji(text): | |
return RE_EMOJI.sub(r'', text) | |
print(strip_emoji('🙄🤔')) |
Thanks a lot
It works for me
Thanks, works very well
In this question on stackoverflow, an user said that this function doesn't cover all emojis, so it is better to use:
def strip_emoji(text):
RE_EMOJI = re.compile(u'([\U00002600-\U000027BF])|([\U0001f300-\U0001f64F])|([\U0001f680-\U0001f6FF])')
return RE_EMOJI.sub(r'', text)
for the record, this is the pattern we are using
# https://en.wikipedia.org/wiki/Unicode_block
EMOJI_PATTERN = re.compile(
"["
"\U0001F1E0-\U0001F1FF" # flags (iOS)
"\U0001F300-\U0001F5FF" # symbols & pictographs
"\U0001F600-\U0001F64F" # emoticons
"\U0001F680-\U0001F6FF" # transport & map symbols
"\U0001F700-\U0001F77F" # alchemical symbols
"\U0001F780-\U0001F7FF" # Geometric Shapes Extended
"\U0001F800-\U0001F8FF" # Supplemental Arrows-C
"\U0001F900-\U0001F9FF" # Supplemental Symbols and Pictographs
"\U0001FA00-\U0001FA6F" # Chess Symbols
"\U0001FA70-\U0001FAFF" # Symbols and Pictographs Extended-A
"\U00002702-\U000027B0" # Dingbats
"\U000024C2-\U0001F251"
"]+"
)```
@mgaitan it works perfectly for me, thanks a lot 💖
def add_space_between_emojies(text):
# Ref: https://gist.github.com/Alex-Just/e86110836f3f93fe7932290526529cd1#gistcomment-3208085
# Ref: https://en.wikipedia.org/wiki/Unicode_block
EMOJI_PATTERN = re.compile(
"(["
"\U0001F1E0-\U0001F1FF" # flags (iOS)
"\U0001F300-\U0001F5FF" # symbols & pictographs
"\U0001F600-\U0001F64F" # emoticons
"\U0001F680-\U0001F6FF" # transport & map symbols
"\U0001F700-\U0001F77F" # alchemical symbols
"\U0001F780-\U0001F7FF" # Geometric Shapes Extended
"\U0001F800-\U0001F8FF" # Supplemental Arrows-C
"\U0001F900-\U0001F9FF" # Supplemental Symbols and Pictographs
"\U0001FA00-\U0001FA6F" # Chess Symbols
"\U0001FA70-\U0001FAFF" # Symbols and Pictographs Extended-A
"\U00002702-\U000027B0" # Dingbats
"])"
)
text = re.sub(EMOJI_PATTERN, r' \1 ', text)
return text
EDIT:
i deleted last one "\U000024C2-\U0001F251"
, because it matches persian characters, that makes bug for me
hello, I credited your work for a workaround in a youtube-dl issue:
ytdl-org/youtube-dl#5042 (comment)
it has helped a lot, thank you.
In case someone like has from __future__ import unicode_literals
at the top, then you need to escape "-" like this:
EMOJI_PATTERN = re.compile(
"["
"\U0001F1E0-\U0001F1FF" # flags (iOS)
"\U0001F300-\U0001F5FF" # symbols & pictographs
"\U0001F600-\U0001F64F" # emoticons
"\U0001F680-\U0001F6FF" # transport & map symbols
"\U0001F700-\U0001F77F" # alchemical symbols
"\U0001F780-\U0001F7FF" # Geometric Shapes Extended
"\U0001F800-\U0001F8FF" # Supplemental Arrows-C
"\U0001F900-\U0001F9FF" # Supplemental Symbols and Pictographs
"\U0001FA00-\U0001FA6F" # Chess Symbols
"\U0001FA70-\U0001FAFF" # Symbols and Pictographs Extended-A
"\U00002702-\U000027B0" # Dingbats
"\U000024C2-\U0001F251"
"]+"
)
or you will got a bad character range
like in this SO
Thanks for you help.
def add_space_between_emojies(text):
'''
>>> add_space_between_emojies('Python is fun 💚')
'Python is fun '
'''
from advertools.emoji import EMOJI
EMOJI_PATTERN = EMOJI
text = re.sub(EMOJI_PATTERN, r'', text)
return text
Sorry to say this but I think @mgaitan's regex is not perfect.
The recent emoji character includes various combinations and patterns so it would be more complex expression.
And this would be good implementation example by javascript: https://github.com/mathiasbynens/emoji-regex
@clichedmoog you are totally right, everything here is a simplification
. For a complete/accurate emoji remover for python I recommend the library https://github.com/bsolomon1124/demoji which download the latest emoji specification to build the pattern. It's not super fast but it's exhaustive.
Does not work when the emoji is at the end of a sentence.