Linux命令 xml
##
xmllint + xpath = 命令行网页信息提取
昨天看到一个玩意儿叫 XPath ,就是 @Anho 同学他们搞的那个,用来提取网页信息的,我觉得 xpath 是个好东西!
然后我想到 libxml2 提供了一个命令行工具叫 xmllint 。我发现它有个 --xpath 的选项可以执行 xpath ,还有个 --html 选项可以解析 html 。经过一些实验,我得到了下面的命令:
在 123cha 上查询 8.8.8.8 ,并提取结果(ul#csstb):
curl http://www.123cha.com/ip/?q=8.8.8.8 2>/dev/null | xmllint --html --xpath "//ul[@id='csstb']" - 2>/dev/null | sed -e 's/<[^>]*>//g'
结果:
[您的查询]:8.8.8.8
本站主数据:
美国
本站辅数据:Google Public DNS提供:hypo
美国 Google免费的Google Public DNS提供:zwstar参考数据一:美国
参考数据二:美国
这意味着什么?这意味着各种简单的网页信息提取任务可以用 shell 脚本完成。如果你以前为了实现同样的功能,动辄 BeautifulSoup 的话,可以换了。
A command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern templates. It can also create new or transformed XML/HTML/JSON documents.
github.com/benibela/xidel
Print all urls found by a google search.
xidel http://www.google.de/search?q=test --extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"
Print the title of all pages found by a google search and download them:
xidel http://www.google.de/search?q=test --follow "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']" --extract //title --download '{$host}/'
Generally follow all links on a page and print the titles of the linked pages:
With XPath: xidel http://example.org -f //a -e //title
With CSS: xidel http://example.org -f "css('a')" --css title
With Templates: xidel http://example.org -f "{.}*" -e "
Another template example:
If you have an example.xml file like
You can read the imporant part like: xidel example.xml -e "
(and this will also check, if the element containing "ood" is there, and fail otherwise)
Calculate something with XPath using arbitrary precision arithmetics:
xidel -e "(1 + 2 + 3) * 1000000000000 + 4 + 5 + 6 + 7.000000008"
Print all newest Stackoverflow questions with title and url:
xidel http://stackoverflow.com -e "{title:=text(), url:=@href}*"
Print all reddit comments of an user, with HTML and URL:
xidel "http://www.reddit.com/user/username/" --extract "
Check if your reddit letter is red:
Webscraping, combining CSS, XPath, JSONiq and automatically form evaluation:
xidel http://reddit.com -f "form(css('form.login-form')[1], {'user': '$your_username', 'passwd': '$your_password'})" -e "css('#mail')/@title"
Using the Reddit API:
xidel -d "user=$your_username&passwd=$your_password&api_type=json" https://ssl.reddit.com/api/login --method GET 'http://www.reddit.com/api/me.json' -e '($json).data.has_mail'
Use XQuery, to create a HTML table of odd and even numbers:
Windows cmd: xidel --xquery "
{$i} | {if ($i mod 2 = 0) then 'even' else 'odd'}} |
Linux/Powershell: xidel --xquery '
{$i} | {if ($i mod 2 = 0) then "even" else "odd"}} |
(Xidel itself supports ' and "-quotes on all platforms, but ' does not escape <> in Windows' cmd, and " does not escape $ in the Linux shells)
Export variables to bash
eval "$(xidel http://site -e 'title:=//title' -e 'links:=//a/@href' --output-format bash)"
This sets the bash variable $title to the title of the page and $links becomes an array of all links there.
Reading JSON:
Read the 10th array element: xidel file.json -e '$json(10)'
Read all array elements: xidel file.json -e '$json()'
Read property "foo" and then "bar" with JSONiq notation: xidel file.json -e '$json("foo")("bar")'
Read property "foo" and then "bar" with dot notation: xidel file.json -e '($json).foo.bar'
Read property "foo" and then "bar" with XPath-like notation: xidel file.json -e '$json/foo/bar'
Mixed example: xidel file.json -e '$json("abc")()().xyz/(u,v)'
This would read all the numbers from e.g. {"abc": [[{"xyz": {"u": 1, "v": 2}}], [{"xyz": {"u": 3}}, {"xyz": {"u": 4}} ]]}.
All selectors are sequence-transparent, i.e. you can use the same selector to read something from one value as to read it from several values. Arrays are converted to sequences with ()
Convert table rows and columns to a CSV-like format:
xidel http://site -e '//tr / join(td, ",")'
join((...)) can generally be used to output some values in a single line. The function name is an abbreviation for the XPath function string-join. In the example tr / join calls join for every row.
Modify/Transform an HTML file, e.g. to mark all links as bold:
Windows cmd:
xidel --html your-file.html --xquery "transform(/, function($e) {
$e / if (name() = 'a') then
<a style='{join((@style, 'font-weight: bold'), '; ')}'>{@* except @style, node()}</a>
else .
})" > your-output-file.html
Linux/Powershell:
xidel --html your-file.html --xquery 'transform(/, function($e) {
$e / if (name() = "a") then
<a style="{join((@style, "font-weight: bold"), "; ")}">{@* except @style, node()}</a>
else .
})' > your-output-file.html
This example combines three important syntaxes:
transform(/, function($e) { .. }: This applies an anonymous function to every element in the HTML document, whereby that element is stored in variable $e and is replaced by the return value of the function.
{@* except @style, node()} : This creates a new a-element that has the same children, descendants and attributes as the current element, but removes the style-attribute.
style="{join((@style, "font-weight: bold"), "; ")}": This creates a new style-attribute by appending "font-weight: bold" to the old value of the attribute. A separating "; " is inserted, if (and only if) that attribute already existed.
You may also want to read the readme file of Xidel, the complete list of available functions, the documentation of my template language and XPath/XQuery 3.0 library. Or look at its results on the XQuery Testsuite.
Downloads
The following Xidel downloads are available on the sourceforge download page:
Operating System Filename Size
Windows: 32 Bit xidel-0.9.6.win32.zip 801.8 kB
Universal Linux: 64 Bit xidel-0.9.6.linux64.tar.gz 1.3 MB
Universal Linux: 32 Bit xidel-0.9.6.linux32.tar.gz 852.8 kB
Source: xidel-0.9.6.src.tar.gz 1.9 MB
Debian: 64 Bit xidel_0.9.6-1_amd64.deb 967.1 kB
Debian: 32 Bit xidel_0.9.6-1_i386.deb 659.8 kB
Mac 10.8 externally prebuilt version and compile instructions.
Usually you can just extract the zip/deb and call Xidel, or copy it to some place in your PATH,
because it consists of a single binary without any external dependencies, except the standard system libraries (i.e. Windows API or respectively libc).
However, for https connections on Linux openssl (including openssl-dev) and libcrypto are also required.
You can also test it online on a webpage or directly by sending a request to the cgi service like http://www.videlibri.de/cgi-bin/xidelcgi?data=
The source is stored in a mercurial repository together with the VideLibri source.
You can compile it with FreePascal and a regular expression library, preferably my copy of FLRE.
To compile it on the command line, call fpc xidel.pas and pass the paths to all directories using the -Fu option.
Alternatively, it can be compiled using Lazarus. For this install components/pascal/internettools.lpk and components/pascal/internettools_utf8.lpk in Lazarus, then open programs/internet/xidel/xidel.lpi and click on Run\Compile.
Pronounciation: To say the name "Xidel" in English, you say "excited" with a silent "C" and "D", followed by an "L". In German, you just say it as it is written.