Python数据科学教程:Python库的各种内置函数处理非结构化数据
行和列格式的数据或可以轻松转换为行和列以便以后可以正确放入数据库的数据,这是称为结构化数据。如CSV、TXT、XLS文件等。这些文件有分隔符,固定宽度或可变宽度,缺失值显示为分隔符之间的空格。
但有时我们获取的数据的行没有固定宽度,或者它们只是 HTML、图像或 PDF 文件。该数据称为非结构化数据。尽管可以通过处理 HTML 标签来处理 HTML 文件,但来自 Twitter 的提要或来自新闻提要的文本文档可能不需要处理没有分隔符的标签。在本例中,我们使用不同 Python 库中的各种内置函数来处理文件。
读取数据
在下面的示例中,我们获取一个文本文件并读取该文件,分隔文件中的每一行。然后可以将输出分解为更多行和单词。原始文件是一个文本文件,其中包含描述 Python 语言的段落。
filename = 'path\input.txt'
with open(filename) as fn:
# Read each line
ln = fn.readline()
# Keep count of lines
lncnt = 1
while ln:
print("Line {}: {}".format(lncnt, ln.strip()))
ln = fn.readline()
lncnt += 1
Python执行上述代码时,会产生以下结果。
Line 1: Python is an interpreted high-level programming language for general-purpose programming. Created by Guido van Rossum and first released in 1991, Python has a design philosophy that emphasizes code readability, notably using significant whitespace. It provides constructs that enable clear programming on both small and large scales.
Line 2: Python features a dynamic type system and automatic memory management. It supports multiple programming paradigms, including object-oriented, imperative, functional and procedural, and has a large and comprehensive standard library.
Line 3: Python interpreters are available for many operating systems. CPython, the reference implementation of Python, is open source software and has a community-based development model, as do nearly all of its variant implementations. CPython is managed by the non-profit Python Software Foundation.
Shell统计词频
可以使用计数器功能来统计文件中单词的出现频率,如下图。
from collections import Counter
with open(r'pathinput2.txt') as f:
p = Counter(f.read().split())
print(p)
Python当我们运行上面的代码时,它会产生以下结果。
Counter({'and': 3, 'Python': 3, 'that': 2, 'a': 2, 'programming': 2, 'code': 1, '1991,': 1
版权声明
本文仅代表作者观点,不代表Code前端网立场。
本文系作者Code前端网发表,如需转载,请注明页面地址。
发表评论:
◎欢迎参与讨论,请在这里发表您的看法、交流您的观点。