本来已经有人写了python脚本从ted上下载字幕了,但是他的网站被墙同时有些ted的地址他解析不了,
所以我将他的python 脚本下载了下来,修改了一下。
谢谢: http://tedtalksubtitledownload.appspot.com/
source 如下:
#! /usr/bin/env python import simplejson import pdb import urllib import sys import re def getFormatedTime(intvalue): mils = intvalue00 segs = (intvalue/1000)` mins = (intvalue/60000)` hors = (intvalue/3600000) return "d:d:d,d"%(hors,mins,segs,mils) def availableSubs(subs): a = subs.find("LanguageCode") if a == -1: return [] subs = subs[a+len("LanguageCode"):] return [re.search(""([^A-Z]+)"", subs).group(1)] + availableSubs(subs) def getVideoParameters(urldirection): ht = urllib.urlopen(urldirection).read() var = re.search('flashVars = {/n([^}]+)}', ht) if var: var = var.group(1) else: return None var = [a.replace('/t', '') for a in var.split('/n')] # debug pdb.set_trace() for a in range(len(var)): if var[a]: var[a] = var[a][:var[a].rfind(',')] resultado = [] for a in var: l = a.find(':') if l != -1: resultado.append((a[:l], a[l+1:])) return dict(resultado) def downloadSub(idtalk, lang, timeIntro): print("Downloading subtitles for language %s"%lang) c = simplejson.load(urllib.urlopen('http://www.ted.com/talks/subtitles/id/%d/lang/%s'%(idtalk, lang))) salida = file('subs_%s_%s.srt'%(idtalk,lang), 'w') conta = 1 c = c['captions'] for linea in c: salida.write("%d/n"%conta) conta += 1 salida.write("%s --> %s/n"%(getFormatedTime(timeIntro+linea['startTime']), getFormatedTime(timeIntro+linea['startTime']+linea['duration']))) salida.write("%s/n/n"%(linea['content'].encode('utf-8'))) salida.close() def main(tedurl): print("Loading information about TED talk number %s..."%tedurl) vidpar = getVideoParameters(tedurl) if not vidpar: print("There was a problem fetching information about that TED Talk") sys.exit(1) print("Download all subtitles (write 'all' when prompted) or only one (specify wich)?") a = raw_input() availables = availableSubs(vidpar['languages']) idtalk = vidpar['ti'] idtalk = int(idtalk[1:3]) if a == "all": for lang in availables: downloadSub(idtalk, lang, int(vidpar['introDuration'])) else: while a not in availables: print("We're sorry, the only available languages are:") for a in availables: print("/t"+a) a = raw_input() downloadSub(idtalk, a, int(vidpar['introDuration'])) if __name__ == "__main__": if len(sys.argv) < 2: print("Usage: %s tedurl"%sys.argv[0]) else: main(sys.argv[1])
要使用它的话,需要先下载simplejson包,地址是: http://pypi.python.org/pypi/simplejson/
在通过http代理上网的环境中也可以使用。
具体使用例子如下:
D:/Document and Setting/test/My Documents/Downloads/TEDTalkSubtitles>TEDTalkSub itles.py http://www.ted.com/talks/barry_schwartz_on_the_paradox_of_choice.html Loading information about TED talk number http://www.ted.com/talks/barry_schwar z_on_the_paradox_of_choice.html... Download all subtitles (write 'all' when prompted) or only one (specify wich)? chi_hans Downloading subtitles for language chi_hans D:/Document and Setting/test/My Documents/Downloads/TEDTalkSubtitles>dir ドライブ D のボリューム ラベルは programe です ボリューム シリアル番号は 447B-7E2B です D:/Document and Setting/test/My Documents/Downloads/TEDTalkSubtitles のディレク トリ 2011/04/15 14:16 <DIR> . 2011/04/15 14:16 <DIR> .. 2011/04/15 14:34 31,879 subs_93_chi_hans.srt 2011/04/15 14:16 31,928 subs_93_eng.srt 2011/04/15 14:26 2,639 TEDTalkSubtitles.py 3 個のファイル 66,446 バイト 2 個のディレクトリ 13,469,048,832 バイトの空き領域 D:/Document and Setting/test/My Documents/Downloads/TEDTalkSubtitles>
refs:
http://pythonconquerstheuniverse.wordpress.com/category/the-python-debugger/
http://meyerweb.com/eric/tools/dencoder/