findElement

Find elements in HTML tree

Description

example

subtrees = findElement(tree,selector) returns the elements in tree matching the CSS selector.

Examples

collapse all

Read HTML code from the URL https://www.mathworks.com/help/textanalytics using the webread function.

url = "https://www.mathworks.com/help/textanalytics";
code = webread(url);

Parse the HTML code using htmlTree.

tree = htmlTree(code);

Find all the hyperlinks in the HTML tree using findElement. The hyperlinks are nodes with element name "A".

selector = "A";
subtrees = findElement(tree,selector);

View the first few subtrees.

subtrees(1:10)
ans = 
  10×1 htmlTree:

    <A class="svg_link navbar-brand" href="https://www.mathworks.com?s_tid=gn_logo"><IMG alt="MathWorks" class="mw_logo" src="/images/responsive/global/pic-header-mathworks-logo.svg"/></A>
    <A class="mwa-nav_login" href="https://www.mathworks.com/login?uri=http://www.mathworks.com/help/textanalytics/index.html">Sign In</A>
    <A href="https://www.mathworks.com/products.html?s_tid=gn_ps">Products</A>
    <A href="https://www.mathworks.com/solutions.html?s_tid=gn_sol">Solutions</A>
    <A href="https://www.mathworks.com/academia.html?s_tid=gn_acad">Academia</A>
    <A href="https://www.mathworks.com/support.html?s_tid=gn_supp">Support</A>
    <A href="https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc">Community</A>
    <A href="https://www.mathworks.com/company/events.html?s_tid=gn_ev">Events</A>
    <A href="https://www.mathworks.com/company/aboutus/contact_us.html?s_tid=gn_cntus">Contact Us</A>
    <A href="https://www.mathworks.com/store?s_cid=store_top_nav&amp;s_tid=gn_store">How to Buy</A>

Extract the text from the subtrees using extractHTMLText. The result contains the link text from each link on the page.

str = extractHTMLText(subtrees);
str(1:10)
ans = 10×1 string array
    ""
    "Sign In"
    "Products"
    "Solutions"
    "Academia"
    "Support"
    "Community"
    "Events"
    "Contact Us"
    "How to Buy"

Input Arguments

collapse all

HTML tree, specified as a scalar htmlTree object.

CSS selector, specified as a string scalar or a character vector. For more information, see CSS Selectors.

Output Arguments

collapse all

Matching HTML subtrees, returned as an htmlTree array.

More About

collapse all

HTML Elements

A typical HTML element contains the following components:

  • Element name – Name of the HTML tag. The element name corresponds to the Name property of the HTML tree.

  • Attributes – Additional information about the tag. HTML attributes have the form name="value", where name and value denote the attribute name and value respectively. The attributes appear inside the opening HTML tag. To get the attribute values from an HTML tree, use getAttribute.

  • Content – Element content. The content appears between opening and closing HTML tags. The content can be text data or nested HTML elements. To extract the text from an htmlTree object, use extractHTMLText. To get the nested HTML elements of an htmlTree object, use the Children property.

For example, the HTML element <a href="https://www.mathworks.com">Home</a> comprises the following components:

ComponentValueDescription
Element nameaElement is a hyperlink
AttributeAttribute namehrefHyperlink reference
Attribute value"https://www.mathworks.com"Hyperlink reference value
ContentHomeText to display

CSS Selectors

CSS selectors specify patterns to match elements in a tree.

This table shows some examples showing how to extract different HTML elements from an HTML tree:

TaskCSS SelectorExample
Find all paragraph (<p>) elements."p"findElement(tree,"p")
Find all paragraph (<p>) and list item (<li>) elements."p,li"findElement(tree,"p,li")
Find all paragraph (<p>) elements that are inside table (<table>) elements."table p"findElement(tree,"table p")
Find all hyperlink (<a>) elements with hyperlink reference attribute (href) values ending with ".pdf"."a[href$="".pdf""]"findElement(tree,"a[href$="".pdf""]")
Find all paragraph (<p>) elements that are the first child of their parent."p:first-child"findElement(tr,"p:first-child")
Find all paragraph (<p>) elements that are the first paragraph element of their parent."p:first-of-type"findElement(tr,"p:first-of-type")
Find all emphasis (<em>) elements where the parent is a paragraph (<p>) element."p > em"findElement(tr,"p > em")
Find all paragraph (<p>) elements appearing immediately after a heading 1 (<h1>) element"h1 + p"findElement(tr,"h1 + p")
Find all empty elements.":empty"findElement(tr,":empty")
Find all nonempty label (<label>) elements."label:not(:empty)"findElement(tr,"label:not(:empty)")

The findElement function supports all of CSS level 3, except for the selectors ":lang", ":checked", ":link", ":active", ":hover", ":focus", ":target", ":enabled", and ":disabled".

For more information about CSS selectors, see [1].

References

Introduced in R2018b