2.10. Pandas Read XML

  • File paths works also with URLs

  • io.StringIO Converts str to File-like object

  • pd.read_xml()

2.10.1. SetUp

>>> import pandas as pd
>>>
>>> pd.set_option('display.width', 250)
>>> pd.set_option('display.max_columns', 20)
>>> pd.set_option('display.max_rows', 30)
>>>
>>>
>>> DATA = """<?xml version="1.0"?>
... <catalog>
...    <book id="bk101">
...       <author>Gambardella, Matthew</author>
...       <title>XML Developer's Guide</title>
...       <genre>Computer</genre>
...       <price>44.95</price>
...       <publish_date>2000-10-01</publish_date>
...       <description>An in-depth look at creating applications
...       with XML.</description>
...    </book>
...    <book id="bk102">
...       <author>Ralls, Kim</author>
...       <title>Midnight Rain</title>
...       <genre>Fantasy</genre>
...       <price>5.95</price>
...       <publish_date>2000-12-16</publish_date>
...       <description>A former architect battles corporate zombies,
...       an evil sorceress, and her own childhood to become queen
...       of the world.</description>
...    </book>
...    <book id="bk103">
...       <author>Corets, Eva</author>
...       <title>Maeve Ascendant</title>
...       <genre>Fantasy</genre>
...       <price>5.95</price>
...       <publish_date>2000-11-17</publish_date>
...       <description>After the collapse of a nanotechnology
...       society in England, the young survivors lay the
...       foundation for a new society.</description>
...    </book>
... </catalog>
... """
>>>

2.10.2. Read XML

>>> pd.read_xml(DATA)
      id                author                  title     genre  price publish_date                                        description
0  bk101  Gambardella, Matthew  XML Developer's Guide  Computer  44.95   2000-10-01  An in-depth look at creating applications\n   ...
1  bk102            Ralls, Kim          Midnight Rain   Fantasy   5.95   2000-12-16  A former architect battles corporate zombies,\...
2  bk103           Corets, Eva        Maeve Ascendant   Fantasy   5.95   2000-11-17  After the collapse of a nanotechnology\n      ...

2.10.3. XML and XSLT

>>> from io import StringIO
>>> from lxml.etree import XML, XSLT, parse
>>>
>>>
>>> TEMPLATE = """
...     <html xsl:version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
...         <table>
...             <thead>
...                 <tr>
...                     <th>Id</th>
...                     <th>Author</th>
...                     <th>Title</th>
...                     <th>Genre</th>
...                     <th>Price</th>
...                     <th>Publish Date</th>
...                     <th>Description</th>
...                 </tr>
...             </thead>
...             <tbody>
...
...                 <xsl:for-each select="catalog/book">
...                     <tr>
...                         <td><xsl:value-of select="@id"/></td>
...                         <td><xsl:value-of select="author"/></td>
...                         <td><xsl:value-of select="title"/></td>
...                         <td><xsl:value-of select="genre"/></td>
...                         <td><xsl:value-of select="price"/></td>
...                         <td><xsl:value-of select="publish_date"/></td>
...                         <td><xsl:value-of select="description"/></td>
...                     </tr>
...                 </xsl:for-each>
...
...             </tbody>
...         </table>
...     </html>
... """
>>>
>>> transform = XSLT(XML(TEMPLATE))
>>> data = parse(StringIO(DATA))
>>> html = str(transform(data))
>>> dfs = pd.read_html(html)
>>> result = dfs[0]
>>>
>>> result
      Id                Author                  Title     Genre  Price Publish Date                                        Description
0  bk101  Gambardella, Matthew  XML Developer's Guide  Computer  44.95   2000-10-01  An in-depth look at creating applications  wit...
1  bk102            Ralls, Kim          Midnight Rain   Fantasy   5.95   2000-12-16  A former architect battles corporate zombies, ...
2  bk103           Corets, Eva        Maeve Ascendant   Fantasy   5.95   2000-11-17  After the collapse of a nanotechnology  societ...
>>>
>>> type(result) is pd.DataFrame
True
>>>
>>> len(result) > 0
True
>>>
>>> result.columns
Index(['Id', 'Author', 'Title', 'Genre', 'Price', 'Publish Date', 'Description'], dtype='object')
>>>
>>> result['Title']
0    XML Developer's Guide
1            Midnight Rain
2          Maeve Ascendant
Name: Title, dtype: object

2.10.4. Assignments

Code 2.41. Solution
"""
* Assignment: Pandas Read XSLT Plants
* Complexity: medium
* Lines of code: 4 lines
* Time: 5 min

English:
    1. Read data from `DATA` as `result: pd.DataFrame`
    2. Use XSLT transformation
    3. Make sure that columns and indexes are named properly
    4. Calculate average cost of flower
    5. Run doctests - all must succeed

Polish:
    1. Wczytaj dane z `DATA` jako `result: pd.DataFrame`
    2. Użyj transformaty XSLT
    3. Upewnij się, że nazwy kolumn i indeks są dobrze ustawione
    4. Wylicz średni koszt kwiatów
    5. Uruchom doctesty - wszystkie muszą się powieść

Hints:
    * `pip install --upgrade lxml`

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> assert result is not Ellipsis, \
    'Assign result to variable: `result`'
    >>> assert type(result) is pd.DataFrame, \
    'Variable `result` has invalid type, should be `pd.DataFrame`'

    >>> result
         English Name              Latin Name   Cost
    0       Bloodroot  Sanguinaria canadensis  $2.44
    1       Columbine    Aquilegia canadensis  $9.37
    2  Marsh Marigold        Caltha palustris  $6.81
    3         Cowslip        Caltha palustris  $9.90
"""

import pandas as pd
from io import StringIO
from lxml.etree import XML, XSLT, parse

DATA = """
    <CATALOG>
        <PLANT>
            <COMMON>Bloodroot</COMMON>
            <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
            <ZONE>4</ZONE>
            <LIGHT>Mostly Shady</LIGHT>
            <PRICE>$2.44</PRICE>
            <AVAILABILITY>031599</AVAILABILITY>
        </PLANT>
        <PLANT>
            <COMMON>Columbine</COMMON>
            <BOTANICAL>Aquilegia canadensis</BOTANICAL>
            <ZONE>3</ZONE>
            <LIGHT>Mostly Shady</LIGHT>
            <PRICE>$9.37</PRICE>
            <AVAILABILITY>030699</AVAILABILITY>
        </PLANT>
        <PLANT>
            <COMMON>Marsh Marigold</COMMON>
            <BOTANICAL>Caltha palustris</BOTANICAL>
            <ZONE>4</ZONE>
            <LIGHT>Mostly Sunny</LIGHT>
            <PRICE>$6.81</PRICE>
            <AVAILABILITY>051799</AVAILABILITY>
        </PLANT>
        <PLANT>
            <COMMON>Cowslip</COMMON>
            <BOTANICAL>Caltha palustris</BOTANICAL>
            <ZONE>4</ZONE>
            <LIGHT>Mostly Shady</LIGHT>
            <PRICE>$9.90</PRICE>
            <AVAILABILITY>030699</AVAILABILITY>
        </PLANT>
    </CATALOG>
"""

TEMPLATE = """
    <html xsl:version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
        <table>
            <thead>
                <tr>
                    <th>English Name</th>
                    <th>Latin Name</th>
                    <th>Cost</th>
                </tr>
            </thead>

            <xsl:for-each select="CATALOG/PLANT">
                <tr>
                    <td><xsl:value-of select="COMMON"/></td>
                    <td><xsl:value-of select="BOTANICAL"/></td>
                    <td><xsl:value-of select="PRICE"/></td>
                </tr>
            </xsl:for-each>

        </table>
    </html>
"""


# XSLT transformed DATA
# type: pd.DataFrame
result = ...