远程获得服务器html文挡,并解析其内容(初步)

    技术2022-05-19  18

    基本技术:  基于tcp协议的winsocket 通信    COM对象(具体就是MSHTML)

     

    基本流程: 通过基于tcp协议的socket,与服务器端通信,用"get" 命令得到想要的html文档.

                       然后通过使用MSHTML对象提供的接口,对目标文档进行分析,提取相应元

                      素内容.ps:socket用的通信方式是阻塞式.

     

    伪代码:   WSAStartup()  (winsock网络协议的初始化)

                                                        ||

               gethostbyname()  (对得到的域名进行解析,得到服务器的ip地址,端口一般是默认的80)   

                                                        ||                

                   socket()  (创建连接通信用的SOCKET对象)

                                                        ||

                    connect()     (通过socket对象连接服务器)

                                                        ||

               构造"get"包 (完整的get命令包的格式如下:GET   http://127.0.0.1/1.html )/r/n

                                                        ||

                                      send()    (向服务器发送请求包)     

                                                        ||

                              recv()   (得到服务器上指定文档 ,并存入数据缓冲区)

                                                        ||

                        Shutdown() ; WSACleanup() (关闭连接,清除SOCKET资源)

                                                        ||

             CoInitialize()  (初始化COM库,为使用COM对象--MSHTML做准备,这个可以放到使用

                                           COM对象之前的任何地方)

                                                        ||                          MSHTML::IHTMLDocument2Ptr pDoc;                          MSHTML::IHTMLDocument3Ptr pDoc3;                          MSHTML::IHTMLElementCollectionPtr pCollection;                          MSHTML::IHTMLElementPtr pElement;

                        CoCreateInstance()           (创建MSHTML对象,并传出Document接口指针)

                      SafeArrayCreateVector()     (把得到的文档数据缓冲的内容存储到SafeArray中并读入到  document中);

                                                         ||

                       pCollection = pDoc->getElementsByTagName(L"P")  (取得文档中元素为"P"的集合)

                                                         ||

                                  然后对pCollection 进行遍历,取出需要的东西

                        (几个关键的接口:      pElement = pCollection->item(i, (long)0);

                                                             pElement->getAttribute("id", 2)

                                                             BSTR tmp;                                                        pElement->get_innerHTML(&tmp);

                    并且注意:COM使用的字符串都是基于UNICODE的,要注意对字符串的转换:

                                  _com_util::ConvertBSTRToString()   (将UNICODE 转换成ASIIC)

     

    本文来自博客,转载请标明出处:http://blog.csdn.net/aganpro/archive/2004/07/19/45204.aspx


    最新回复(0)