正则表达式 - PDF 免费下载

2017-09-17 爬虫俱乐部 1

*cntrade: 下载股票交易数据 *chinafin: 下载上市公司财务数据 cnintraday: 下载上市公司分时交易数据 *cnstock: 下载股票代码 *chinagcode 与 chinaaddress: 通过百度地图 API 将中文地址与经纬度相互转换 *subinfile: 修改文本文件 *wordcovert: docx doc rtf pdf 等文件之间相互转换 psemail: 发送邮件 eventstudy: 事件研究 ttable2: 分组 t 检验 * 结果输出 :reg2docx sum2docx corr2docx t2docx 即将发布 : table2docx fillup wordcloud mapscatter 2017-09-17 爬虫俱乐部 2

最初目的 : 批量修改 do 文件路径名字由来 :subinstr() substitute in file 功能 :1 保留含有指定字符串或匹配到正则表达式的行 2 替换指定的子字符串或正则表达式匹配到的子字符串 3 删除空行意外发现 : 简化处理源代码, 提取信息的程序类似 :file 命令 rewrite 2017-09-17 爬虫俱乐部 3

1.0 版 :infix 版 1.1 版 : 修正了部分 bug, 如运行过程中临时文件 read-only 的问题 2.0 版 :mata 版 2017-09-17 爬虫俱乐部 4

subinfile filesource, [options] options: 1 index(string) specifies the line which contains it will be kept. Those lines without the key string specified by index() option will be dropped. 2 indexregex specifies that the contents you specify in index() is to be interpreted as a regular expression. 3 from(string) and to(string) specifies the string which is to be replaced whereas the to() option specifies the new string which will be used to replace the old one. 4 fromregex specifies that the contents you specify in from() is to be interpreted as a regular expression. 5 dropempty drops the empty line. If you specify both from() and dropempty, Stata will first replace the string you specify and then drop the empty line. 6 save(string) specifies the path and the file name to be saved. If you do not sepcify the format of the file, it will be saved as.txt by default. 7 replace permits save to overwrite an existing file which is not read-only. If you do not specify the option save(string), the original file will be replaced. If you sepcify the option index(string), from(string) and dropempty in one command at the same time, the option index(string) will be executed first, then from(string), and dropempty at last. 2017-09-17 爬虫俱乐部 5

1 获取页面链接 2 获取网页源代码 :copy curl 3 读入源代码 :infix import delimited fileread() 4 处理源代码 : subinfile 1 保留信息所在行 2 提取所需信息或删除多余内容 2017-09-17 爬虫俱乐部 6

1 新浪财经高管任职数据 : http://vip.stock.finance.sina.com.cn/corp/go.php/vci_corpmanager/stockid/600900.phtml 2 Statalist: https://www.statalist.org/forums/forum/general-stata-discussion/general 3 Stata 命令 : https://ideas.repec.org/s/boc/bocode.html 4 NBER 论文信息 : http://www.nber.org/papers/w20001 2017-09-17 爬虫俱乐部 7

新浪财经高管任职数据抓取 ( 单网页 ): 1 获取网页链接: 从多个链接中寻找规律, 部分网页抓取需要两次爬虫, 第一次抓取网页链接 ( 深交所年报中国土地市场网 ) 新浪财经高管任职数据 : 长江电力 : http://vip.stock.finance.sina.com.cn/corp/go.php/vci_corpmanager/stockid/600900.phtml 万科 A: http://vip.stock.finance.sina.com.cn/corp/go.php/vci_corpmanager/stockid/000002.phtml 东信和平 : http://vip.stock.finance.sina.com.cn/corp/go.php/vci_corpmanager/stockid/002017.phtml 2017-09-17 爬虫俱乐部 8

2 可行性分析 : 能否从源代码中找到所需要提取的信息 2017-09-17 爬虫俱乐部 9

3 获取网页源代码 copy 命令 curl copy "http://vip.stock.finance.sina.com.cn/corp/go.php/vci_corpmanager/stockid/600900.phtml " temp.txt, replace 2017-09-17 爬虫俱乐部 10

4 读入源代码 (1)gb2312 读入 Stata14 要先进行转码 : unicode encoding set gb18030 unicode translate temp.txt, transutf8 // 文件前不能跟路径, 必须在工作路径下 unicode erasebackups, badidea // 固定用法, 删除备份文件 (2) 使用 infix 或者 import delimited 命令读入源代码 : infix strl v 1-100000 using temp.txt, clear import delimited using temp.txt, clear delimiters("asgdhjbaiucbiuabconobwivquviqcboqn", asstring) encoding("utf-8") 注 : 源代码只有一行情况下直接用 fileread() 函数读入 ; 源代码最后一行有所需信息时需要先先使用 file 命令加上一个回车符 : tempname temp file open `temp' using temp.txt, write text append file write `temp' _n file close `temp' 2017-09-17 爬虫俱乐部 11

5 保留需要提取的信息所在的行 : keep if index(v, "</div></td>") drop if index(v, "</strong>") v == "</div></td>" 2017-09-17 爬虫俱乐部 12

6 删除标签 ( 尖括号内的字符 ), 提取所需信息 : replace v = ustrregexra(v, "<.*?>", "") 2017-09-17 爬虫俱乐部 13

7 将提取到的信息进行整理: (1)post 命令 (2) 也可以用如下命令 : forvalues j = 1/3 { gen v`j' = v[_n + `j'] } keep if mod(_n, 4) == 1 rename (v - v3) ( 姓名职务起始日期终止日期 ) 2017-09-17 爬虫俱乐部 14

2017-09-17 爬虫俱乐部 15

2017-09-17 爬虫俱乐部 16

2017-09-17 爬虫俱乐部 17

NBER 论文信息抓取 ( 单个网页 ): 1 获取网页链接 : http://www.nber.org/papers/w20001 http://www.nber.org/papers/w20002 http://www.nber.org/papers/w20003 每个网页只有最后的编号不同 2017-09-17 爬虫俱乐部 18

例四 NBER 论文信息抓取 ( 单个网页 ): 2 可行性分析 : 抓取题目作者编号时间摘要链接, 都出现在源代码中 2017-09-17 爬虫俱乐部 19

3 获取网页源代码 : copy "http://www.nber.org/papers/w20001" temp.txt, replace 4 读入网页源代码 : infix strl v 1-100000 using "temp.txt", clear 2017-09-17 爬虫俱乐部 20

5 保留提取信息所在的行 : keep if ustrregexm(v, `"(</h1>) (</b><br>) (</small></a></p>)"') /// index(v[_n - 1], "</h1>") /// index(v[_n - 1], `"<p style="margin-left: 40px; margin-right: 40px; text-align: justify">"') 2017-09-17 爬虫俱乐部 21

6 删除标签 ( 尖括号内的字符 ), 提取所需信息 : replace v = ustrregexra(v, `".+href="(.*?\.pdf)".+"', "$1") replace v = ustrregexra(v, "<.*?>", "") 2017-09-17 爬虫俱乐部 22

7 整理抓取的信息 : sxpose, clear rename (_var1 - _var6) (Title Author NBER_No IssuedTime Abstract URL) 2017-09-17 爬虫俱乐部 23

1 利器主要体现在处理源代码中每行一条所需信息时 2 保留所需信息特征不在该行时无法处理 2017-09-17 爬虫俱乐部 24

1 from() 选项中的文本替换成不同的内容, 多个 from 对应多个 to( 新版功能 ) 2 正则表达式的扩充 (pcre) 3 删除掉对应的行 ( 新版功能 ) 2017-09-17 爬虫俱乐部 25