geekdoc-python-zh/docs/pythonlibrary/python-parsing-xml-with-lxm...

10 KiB
Raw Permalink Blame History

Python:用 lxml 解析 XML

原文:https://www.blog.pythonlibrary.org/2010/11/20/python-parsing-xml-with-lxml/

上次,我们看了 Python 的一个内置 XML 解析器。在本文中,我们将看看有趣的第三方包,来自 codespeak 的 lxml。它使用了 ElementTree API 等。lxml 包支持 XPath 和 XSLT包括一个用于 SAX 的 API 和一个与 C/Pyrex 模块兼容的 C 级 API。我们只是用它做一些简单的事情。

无论如何,对于本文,我们将使用 minidom 解析示例中的例子,看看如何用 lxml 解析这些例子。下面是一个 XML 示例,它来自一个为跟踪约会而编写的程序:


 <appointment><begin>1181251680</begin>
        <uid>040000008200E000</uid>
        <alarmtime>1181572063</alarmtime>
        <state><location><duration>1800</duration>
        <subject>Bring pizza home</subject></location></state></appointment> 
    <appointment><begin>1234360800</begin>
        <duration>1800</duration>
        <subject>Check MS Office website for updates</subject>
        <location><uid>604f4792-eb89-478b-a14f-dd34d3cc6c21-1234360800</uid>
        <state>dismissed</state></location></appointment> 

上面的 XML 显示了两个约会。从纪元开始的开始时间以秒为单位uid 是基于开始时间和一个密钥(我认为)的哈希生成的;报警时间是自该时期以来的秒数,但应该小于开始时间;而状态就是任命有没有被打盹儿,有没有被辞退。其余的就不言自明了。现在让我们看看如何解析它。


from lxml import etree
from StringIO import StringIO

#----------------------------------------------------------------------
def parseXML(xmlFile):
    """
    Parse the xml
    """
    f = open(xmlFile)
    xml = f.read()
    f.close()

    tree = etree.parse(StringIO(xml))
    context = etree.iterparse(StringIO(xml))
    for action, elem in context:
        if not elem.text:
            text = "None"
        else:
            text = elem.text
        print elem.tag + " => " + text   

if __name__ == "__main__":
    parseXML("example.xml")

首先,我们导入所需的模块,即来自 lxml 包的 etree 模块和来自内置 StringIO 模块的 StringIO 函数。我们的 parseXML 函数接受一个参数:所讨论的 XML 文件的路径。我们打开文件,阅读并关闭它。现在有趣的部分来了!我们使用 etree 的 parse 函数来解析从 StringIO 模块返回的 XML 代码。出于我不完全理解的原因parse 函数需要一个类似文件的对象。

无论如何,接下来我们迭代上下文(即 lxml.etree.iterparse 对象)并提取标记元素。我们添加条件语句 if 来用单词“None”替换空字段以使输出更加清晰。仅此而已。

解析图书示例

这个例子的结果有点蹩脚。大多数情况下,您希望保存提取的数据并对其进行处理,而不仅仅是将其输出到 stdout。因此对于我们的下一个例子我们将创建一个数据结构来包含结果。这个例子的数据结构将是一个字典列表。我们将在这里使用 MSDN 图书的例子:


 <book id="bk101"><author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description></book> 
   <book id="bk102"><author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description></book> 
   <book id="bk103"><author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-11-17</publish_date>
      <description>After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.</description></book> 
   <book id="bk104"><author>Corets, Eva</author>
      <title>Oberon's Legacy</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-03-10</publish_date>
      <description>In post-apocalypse England, the mysterious 
      agent known only as Oberon helps to create a new life 
      for the inhabitants of London. Sequel to Maeve 
      Ascendant.</description></book> 
   <book id="bk105"><author>Corets, Eva</author>
      <title>The Sundered Grail</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-09-10</publish_date>
      <description>The two daughters of Maeve, half-sisters, 
      battle one another for control of England. Sequel to 
      Oberon's Legacy.</description></book> 
   <book id="bk106"><author>Randall, Cynthia</author>
      <title>Lover Birds</title>
      <genre>Romance</genre>
      <price>4.95</price>
      <publish_date>2000-09-02</publish_date>
      <description>When Carla meets Paul at an ornithology 
      conference, tempers fly as feathers get ruffled.</description></book> 
   <book id="bk107"><author>Thurman, Paula</author>
      <title>Splish Splash</title>
      <genre>Romance</genre>
      <price>4.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>A deep sea diver finds true love twenty 
      thousand leagues beneath the sea.</description></book> 
   <book id="bk108"><author>Knorr, Stefan</author>
      <title>Creepy Crawlies</title>
      <genre>Horror</genre>
      <price>4.95</price>
      <publish_date>2000-12-06</publish_date>
      <description>An anthology of horror stories about roaches,
      centipedes, scorpions  and other insects.</description></book> 
   <book id="bk109"><author>Kress, Peter</author>
      <title>Paradox Lost</title>
      <genre>Science Fiction</genre>
      <price>6.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems 
      of being quantum.</description></book> 
   <book id="bk110"><author>O'Brien, Tim</author>
      <title>Microsoft .NET: The Programming Bible</title>
      <genre>Computer</genre>
      <price>36.95</price>
      <publish_date>2000-12-09</publish_date>
      <description>Microsoft's .NET initiative is explored in 
      detail in this deep programmer's reference.</description></book> 
   <book id="bk111"><author>O'Brien, Tim</author>
      <title>MSXML3: A Comprehensive Guide</title>
      <genre>Computer</genre>
      <price>36.95</price>
      <publish_date>2000-12-01</publish_date>
      <description>The Microsoft MSXML3 parser is covered in 
      detail, with attention to XML DOM interfaces, XSLT processing, 
      SAX and more.</description></book> 
   <book id="bk112"><author>Galos, Mike</author>
      <title>Visual Studio 7: A Comprehensive Guide</title>
      <genre>Computer</genre>
      <price>49.95</price>
      <publish_date>2001-04-16</publish_date>
      <description>Microsoft Visual Studio 7 is explored in depth,
      looking at how Visual Basic, Visual C++, C#, and ASP+ are 
      integrated into a comprehensive development 
      environment.</description></book> 

现在让我们解析它,并把它放到我们的数据结构中!


from lxml import etree
from StringIO import StringIO

#----------------------------------------------------------------------
def parseBookXML(xmlFile):

    f = open(xmlFile)
    xml = f.read()
    f.close()

    tree = etree.parse(StringIO(xml))
    print tree.docinfo.doctype
    context = etree.iterparse(StringIO(xml))
    book_dict = {}
    books = []
    for action, elem in context:
        if not elem.text:
            text = "None"
        else:
            text = elem.text
        print elem.tag + " => " + text
        book_dict[elem.tag] = text
        if elem.tag == "book":
            books.append(book_dict)
            book_dict = {}
    return books

if __name__ == "__main__":
    parseBookXML("example2.xml")

这个例子与上一个非常相似,所以我们只关注这里的不同之处。在开始迭代上下文之前,我们创建了一个空的 dictionary 对象和一个空的 list。然后在循环内部我们像这样创建字典:


book_dict[elem.tag] = text

文本为 elem.text 或“无”。最后如果标签碰巧是“book ”,那么我们在一本书的末尾,需要将字典添加到我们的列表中,并为下一本书重置字典。如你所见,这正是我们所做的。更现实的例子是将提取的数据放入 Book 类。我以前用 json 提要做过后者。

重构代码

正如我警惕的读者所指出的,我写了一些相当糟糕的代码。所以我对代码进行了一些清理,希望这样会好一点:


from lxml import etree

#----------------------------------------------------------------------
def parseBookXML(xmlFile):
    """"""

    context = etree.iterparse(xmlFile)
    book_dict = {}
    books = []
    for action, elem in context:
        if not elem.text:
            text = "None"
        else:
            text = elem.text
        print elem.tag + " => " + text
        book_dict[elem.tag] = text
        if elem.tag == "book":
            books.append(book_dict)
            book_dict = {}
    return books

if __name__ == "__main__":
    parseBookXML("example.xml")

如您所见,我们完全放弃了 StringIO 模块,将所有文件 I/O 内容放在 lxml 方法调用中。其余都一样。很酷吧?像往常一样,巨蟒摇滚!

包扎

你从这篇文章中学到什么了吗我当然希望如此。Python 在其标准库内外都有很多很酷的解析库。一定要检查它们,看看哪一个最适合你的编程方式。

进一步阅读

  • lxml 官方网站
  • 一篇关于 lxml 的 IBM 文章
  • StringIO 文档