decoding utf-8 type emoji codes and special characters from facebook data

Hi, I recently downloaded the messenger data from facebook in form of ".json" format.
This format was new for me and it was quiet interesting to load,play around the file and make it like a conversation.
The problem is with decoding the emojis. I have no idea about the format. It looked something like this..
"\u00f0\u009f\u0098\u0082 \u00f0\u009f\u0098\u0082" which, the actual emoji I used is ??.
In matlab as shown in the figure it shows some rubbish "ð ð".
After a long research in the internet, I came to know that it is Unicode-8 format. So, I tried to read the file using unicode-8 format by looking at some answers form matlab central..
clear; clc
fname = 'message_keller.json';
fid = fopen(fname, 'rb');
raw = fread(fid, '*uint8')';
str = native2unicode(raw,'UTF-8');
fclose(fid);
val = jsondecode(str);
But it still was showing "ð ð".
The above link was the method I found for decoding. But that was for powershell.
Can anyone help me decode the unicode so that it can be viewed in matlab and other softwares (curently I am planning to export the conversation to excel)..?

4 Comments

Can you attach an example of the JSON file, so we can see how the file is actually encoded?
If the json actually contains the string "\u00f0\u009f\u0098\u0082 \u00f0\u009f\u0098\u0082" then your approach to decoding the unicode is indeed completely wrong.
I have attached the mat file because I had to remove some private information.
just load the mat file into a variable called "val"
and you can use the code below to create the conversation view
%
i=1;
Yang={};
Addy={};
Convo={};
for k=length(val.messages):-1:1
if isfield(val.messages{k,1},'content')==1
if strcmp(val.messages{k,1}.sender_name,char('Yang'))
Yang(i,1) = cellstr(val.messages{k,1}.content);
else
Addy(i,1) = cellstr(val.messages{k,1}.content);
end
i=i+1;
else
continue
end
end
for k=1:length(Addy)
Convo(k,2) = Addy(k,1);
end
for i=1:length(Yang)
Convo(i,1) = Yang(i,1);
end
And yes I know I am wrong. I have no idea. That's why I need help figuring out :)
I wanted the raw json, not the stuff you've parsed when it is too late to get the right characters. You can just replace the confidential bits with xs or dots.
Or just provide the actual portion of the raw json that correspond to an actual message, e.g, one of the
{"message":{"sender_name":"Don't care","timestamp_ms":whatever,"content":"this is what I need","type":"Generic"}}
section.
Opps. Sorry about that. Now I have attached the raw json file. You can look just by double clicking it.
Also, As I have mentioned before, "\u00f0\u009f\u0098\u0082" is the emoji code for ? - laughing emoji. I did not phrase it. It is in non phrased form. In the conversation I used it twice and that is why it repeats and looks like this "\u00f0\u009f\u0098\u0082 \u00f0\u009f\u0098\u0082"
I have even checked it in notepad++ the code is same..
In the new json file you can find these codes.
"\u00f0\u009f\u0098\u009b" - ?
"\u00f0\u009f\u0091\u008d" - ?
"\u00e3\u0080\u0082" - 。
and again
"\u00f0\u009f\u0098\u0082" - ?
This is the Screen shot of the conversation
Screen shot pulled from Notepad++ which you can also find it in the raw json file I have attached..

Sign in to comment.

Answers (0)

Products

Asked:

on 12 Oct 2018

Commented:

on 12 Oct 2018

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!