Access data in a nested structure array and generate an excel file

12 views (last 30 days)
Hello everyone,
I have a bunch of .json files in a folder called "JSON files". Each file is an article. I would like to put the paper_id, title, abstract, and authors' affiliation in an excel files.
files = dir('JSON files/*.json');
for i=1:numel(files)
filename = fullfile(files(i).folder, files(i).name);
data = jsondecode(fileread(filename));
end
the "data" in the above code for each i looks like
data =
struct with fields:
paper_id: '0015023cc06b5362d332b3baf348d11567ca2fbb'
metadata: [1×1 struct]
abstract: [2×1 struct]
body_text: [20×1 struct]
bib_entries: [1×1 struct]
ref_entries: [1×1 struct]
back_matter: [1×1 struct]
and
data.metadata =
struct with fields:
title: 'The RNA pseudoknots in foot-and-mouth disease virus are dispensable for genome replication but essential for the production of infectious virus. 2 3'
authors: [17×1 struct]
and
data. metadata.authors
ans =
17×1 struct array with fields:
first
middle
last
suffix
affiliation
email
for this specific article the affiliation is empty. However most of the .json files have an authors' affiliation. Moreover, each author may have different affiliation from her co-authors. I would like to keep all afiliations.
Any idea how I can solve this problem? Any input would be greatly appreciated!

Accepted Answer

Sindar
Sindar on 6 May 2020
Edited: Sindar on 6 May 2020
[Edited with fixed answer, x2]
This throws no errors and appears to produce a correct excel file from your sample data:
files = dir('JSON files/*.json');
% start off a table to put info in, with the correct size and column names
mytable=table('Size',[numel(files) 4],'VariableTypes',{'string';'string';'string';'string'},'VariableNames',{'pid';'title';'abstract';'affiliations'});
for ind=1:numel(files)
% load data
filename = fullfile(files(ind).folder, files(ind).name);
data = jsondecode(fileread(filename));
% for the ind-th row, fill in the pid column
mytable{ind,'pid'} = {data.paper_id};
% title column
mytable{ind,'title'} = {data.metadata.title};
% abstract column
% if the abstract is empty, it will get left as <missing> and print an empty cell in excel
if ~isempty(data.abstract)
this_abstract = {data.abstract.text};
% concatenate abstract lines with spaces
mytable{ind,'abstract'} = {strjoin(this_abstract,' ')};
end
% affiliations column
% check which authors have affiliation info
tmp=false([0 0]);
for ind_A = 1:length(data.metadata.authors)
tmp(ind_A) = ~isempty(fieldnames(data.metadata.authors(ind_A).affiliation));
end
% put all affiliations in a single struct array
if ~isempty(tmp) && nnz(tmp)>0
this_affiliation = [data.metadata.authors(tmp).affiliation];
% put all institutions in a single cell array
this_affiliation = {this_affiliation.institution};
% remove duplicates, sort, and make sure the first element is an empty string
this_affiliation = unique([{''} this_affiliation]);
% combine all the affiliations, separating by '; ' (ignore empty first string)
mytable{ind,'affiliations'} = {strjoin(this_affiliation(2:end),'; ')};
end
% clear temporary variable (tmp especially)
clear tmp this_abstract this_affiliation
end
% write table to myData.xls
writetable(mytable,'myData.xls')
If you have a lot of files and don't want to store all the data in a table before writing, use the append option:
mytable=table('Size',[1 4],'VariableTypes',{'string';'string';'string';'string'},'VariableNames',{'pid';'title';'abstract';'affiliations'});
writetable(mytable,'myData.xls');
for ind
...
mytable{1,'pid'} = {data.paper_id};
...
writetable(mytable,'myData.xls','WriteMode','Append','WriteVariableNames',false)
end
  16 Comments
Susan
Susan on 6 May 2020
Glad to hear this data set may help with that! Thanks again for your help.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!