从 IPython 笔记本运行 MRJob

Run MRJob from IPython notebook(从 IPython 笔记本运行 MRJob)

本文介绍了从 IPython 笔记本运行 MRJob的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 IPython 笔记本运行 mrjob 示例

I'm trying to run mrjob example from IPython notebook

from mrjob.job import MRJob


class MRWordFrequencyCount(MRJob):

def mapper(self, _, line):
    yield "chars", len(line)
    yield "words", len(line.split())
    yield "lines", 1

def reducer(self, key, values):
    yield key, sum(values)  

然后用代码运行它

mr_job = MRWordFrequencyCount(args=["testfile.txt"])
with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        key, value = mr_job.parse_output_line(line)
        print key, value

并得到错误:

TypeError: <module '__main__' (built-in)> is a built-in class

有没有办法从 IPython notebook 运行 mrjob?

Is there way to run mrjob from IPython notebook?

推荐答案

我还没有找到完美的方法",但你可以做的一件事是创建一个笔记本单元格,使用 %%file 魔术,将单元格内容写入文件:

I haven't found the "perfect way" yet, but one thing you can do is create one notebook cell, using the %%file magic, writing the cell contents to a file:

%%file wordcount.py
from mrjob.job import MRJob

class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values):
        yield key, sum(values)

然后让 mrjob 在稍后的单元格中运行该文件:

And then have mrjob run that file in a later cell:

import wordcount
reload(wordcount)

mr_job = wordcount.MRWordFrequencyCount(args=['example.txt'])
with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        key, value = mr_job.parse_output_line(line)
        print key, value

请注意,我调用了我的文件 wordcount.py 并且我从 wordcount 模块导入了类 MRWordFrequencyCount -- 文件名和模块必须匹配.Python 还会缓存导入的模块,当您更改 wordcount.py 文件时,iPython 不会重新加载模块,而是使用旧的缓存模块.这就是我将 reload() 调用放在那里的原因.

Notice that I called my file wordcount.py and that I import the class MRWordFrequencyCount from the wordcount module -- the filename and module has to match. Also Python caches imported modules and when you change the wordcount.py-file iPython will not reload the module but rather used the old, cached one. That's why I put the reload() call in there.

参考:https://groups.google.com/d/味精/mrjob/CfdAgcEaC-I/8XfJPXCjTvQJ

更新(更短)
对于较短的第二个笔记本单元,您可以通过从笔记本中调用 shell 来运行 mrjob

Update (shorter)
For a shorter second notebook cell you can run the mrjob by invoking the shell from within the notebook

! python mrjob.py shakespeare.txt

参考:http://jupyter.cs.brynmawr.edu/hub/dblank/公共/Jupyter%20Magics.ipynb

Reference: http://jupyter.cs.brynmawr.edu/hub/dblank/public/Jupyter%20Magics.ipynb

这篇关于从 IPython 笔记本运行 MRJob的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!

本文标题为:从 IPython 笔记本运行 MRJob