XML parsing vs regexp

3 views (last 30 days)
Sebastian Holmqvist
Sebastian Holmqvist on 30 Jul 2012
I'm trying to extract values from two elements in a pretty large xml. I'm stuck between doing it "the right way" and doing it "the fast way". I.e parsing vs regexp.
elem_num = 1e4;
%%Create sample xml string
xml_str = cell(1, elem_num+2);
xml_str(1) = {''};
for i=1:elem_num
xml_str(i+1) = {'<elem><aa>abc</aa><ab>def</ab></elem>'};
end
xml_str(elem_num+2) = {''};
xml_str = cell2mat(xml_str);
%%Convert string to stream and parse
stream = java.io.StringBufferInputStream(xml_str);
factory = javaMethod('newInstance', ...
'javax.xml.parsers.DocumentBuilderFactory');
builder = factory.newDocumentBuilder;
document = builder.parse(stream);
%%Parse DOM properly
tic;
aa_list = document.getElementsByTagName('aa');
aa_num = aa_list.getLength;
aa = cell(1, aa_num);
for i=1:aa_num
aa(i) = aa_list.item(i-1).getTextContent;
end
ab_list = document.getElementsByTagName('ab');
ab_num = ab_list.getLength;
ab = cell(1, ab_num);
for i=1:ab_num
ab(i) = ab_list.item(i-1).getTextContent;
end
toc;
%%Use regexp
tic;
aa_regexp = regexp(xml_str, '(abc)', 'tokens');
ab_regexp = regexp(xml_str, '(def)', 'tokens');
toc;
As you can see in my code, parsing might be the correct way of handling xml, but takes ages to compute compared to regexp.
% XML Parsing: Elapsed time is 3.222058 seconds.
% Regexp: Elapsed time is 0.050301 seconds.
Any tips on how to speed this up? E.g another parser, a better way of doing it etc?

Answers (1)

Walter Roberson
Walter Roberson on 30 Jul 2012
Often, when HTML or XML are analyzed in terms of extended regular expressions, the implementations are vulnerable to alternative representations of the closing quote on strings, failing to detect a close quote that HTML or XML say is there. The earlier problem was with "double byte character sets", so people learned to deal with that. But then people were caught off-guard with Unicode Code Point representations of the double-quote, such as via a \u or \ux escape sequence.
  2 Comments
Sebastian Holmqvist
Sebastian Holmqvist on 30 Jul 2012
Ah, very informative, thanks!
Any ideas on how to get the parsing sped up? 3 seconds for 1e4 repeated elements is killing me atm..
Walter Roberson
Walter Roberson on 30 Jul 2012
Sorry I have never used the parser.

Sign in to comment.

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!