Rencently, my two teammates and I is doing a project, a simplified Chinese search engine for children(in primary school). We call it "kidsearch".

Since our project will be based on Baidu search engine. I'd like to have a simple analysis of Baidu search engine.

First, Baidu is not for children to use totally. Baidu, as a commercial company, provides the public a free service of searching. It is natural that not all the contents shown on the search engine are what people need. Some of them are shown because of benefits and some other factors.Perhaps it doesn't have a great impact on adults who can distinguish the contents of good or bad. But the impact will be obvious when it comes to children. For example,we can search these keys on Baidu : "波"(notice its pictures),"交换群"(notice its results),"医院"(notice its advertisements). And these are some normal words. Don't mention the results of some even worse key words. These results of searching not just inappropriate, some of them even harmful. So, the situation has to be fixed, which is also the purpose of our project "kidsearch".

Actually, seaching on the Internet for children is easier to that for adults. So the problem is also simplified. We can just use Baidu as a tool(not exagerated), rearrange the result, fix the inproper or useless entries, and add some contents suitable for children. The search engine will be really better for children after we do some fix on it.

So, what are the contents appropriate to children?

Based on the thoughts above, I concluded the requirements of children, which are what children may need.(Perhaps it doesn't cover all at present and we will perfect it in the future)

1.Notion -- encyclopedia

2.Material -- picture, music, video

3.Entertainment -- game

4.Study -- homework, knowledge

Moreover, there are some kinds of content that children don't need:

1.advertisement

2.adult(mature) content

3.sexual or homosexual content

4.sidebar(ad. or adult content or useless for children mostly)

Now that we have known what children need, what we should do next is to tackle them one by one.

What the technology we will use?

After tried many approaches, such as PHP, Java, Python, etc. I decided to use Python to do this job because it's really convenient to do the crawl job. Although it is a bit more difficult to make webpages than PHP, it doesn't matter too much.

Besides, there are huge amount of extended library to use with Python, such as requests, flask, django, jieba, etc. I have tried all of them preliminarily.

More details will be illustrated later. And our aim is to create a search engine which children can use and like to use.

[0.0]Analysis of Baidu search engine的更多相关文章

  1. 开源搜索 Iveely Search Engine 0.6.0 发布 -- 黎明前的娇嫩

    快两年了,Iveely Search Engine已经走过了5个版本的岁月,虽出生“贫寒”,没有任何开源基金会的支持,没有优秀的“干爹.干妈”,它凭着它的爱好者的支持,0.6.0终于破壳而出,7年前, ...

  2. Iveely Search Engine 0.4.0 的发布

    千呼万唤始出来,Iveely Search Engine 0.4.0 的发布   经过无数个夜晚的奋战,以及无数个夜晚的失眠,Iveely Search Engine 0.4.0 终于熬出来了,这其中 ...

  3. dotnet cli 5.0 新特性——dotnet tool search

    dotnet cli 5.0 新特性--dotnet tool search Intro .NET 5.0 SDK 的发布,给 dotnet cli 引入了一个新的特性,dotnet tool sea ...

  4. 微软的一篇ctr预估的论文:Web-Scale Bayesian Click-Through Rate Prediction for Sponsored Search Advertising in Microsoft’s Bing Search Engine。

    周末看了一下这篇论文,觉得挺难的,后来想想是ICML的论文,也就明白为什么了. 先简单记录下来,以后会继续添加内容. 主要参考了论文Web-Scale Bayesian Click-Through R ...

  5. Search Engine Hacking – Manual and Automation

    Search Engine Hacking – Manual and Automation Ethical Hacking Boot Camp OUR MOST POPULAR COURSE! CLI ...

  6. Known BREAKING CHANGES from NH3.3.3.GA to 4.0.0

    Build 4.0.0.Alpha1 =============================   ** Known BREAKING CHANGES from NH3.3.3.GA to 4.0. ...

  7. Empirical Analysis of Beam Search Performance Degradation in Neural Sequence Models

    Empirical Analysis of Beam Search Performance Degradation in Neural Sequence Models  2019-06-13 10:2 ...

  8. 未能加载文件或程序集“Microsoft.SqlServer.Management.Sdk.Sfc, Version=11.0.0.0, Culture=neutral, PublicKeyToken...

    刚开始看老师 用VS新建一个“ADO.NET 实体数据模型” 但是一直报错:未能加载文件或程序集“Microsoft.SqlServer.Management.Sdk.Sfc, Version=11. ...

  9. 42 Bing Search Engine Hacks

    42 Bing Search Engine Hacks November 13, 2010 By Ivan Remember Bing, the search engine Microsoft lau ...

随机推荐

  1. [反汇编练习] 160个CrackMe之009

    [反汇编练习] 160个CrackMe之009. 本系列文章的目的是从一个没有任何经验的新手的角度(其实就是我自己),一步步尝试将160个CrackMe全部破解,如果可以,通过任何方式写出一个类似于注 ...

  2. HDU 1863 畅通工程(最小生成树,prim)

    题意: 给出图的边和点数,要求最小生成树的代价,注:有些点之间是不可达的,也就是可能有多个连通图.比如4个点,2条边:1-2,3-4. 思路: 如果不能连通所有的点,就输出‘?’.之前以为每个点只要有 ...

  3. datatable 的ajax修改参数,post可以传参处理

          datatables常用参数记录 {                "searchable": false,                "orderabl ...

  4. django - django 承接nginx请求

    # -*- coding: utf-8 -*- import os import sys import tornado.ioloop import tornado.web import tornado ...

  5. VirtualBox的工作原理&参考网上文章

    事先申明,我这里有好多东西都是看网上的,文末给出参考博客链接. 1.在设置里面为什么要选择桥接网络?baidu之后,了解到是虚拟机工作原理的不同,也就是说有好几种工作模式. bridged(桥接模式) ...

  6. cbitmap 获取RGB CBitMap的用法

    MFC提供了位图处理的基础类CBitmap,可以完成位图(bmp图像)的创建.图像数据的获取等功能.虽然功能比较少,但是在对位图进行一些简单的处理时,CBitmap类还是可以胜任的.很多人可能会采用一 ...

  7. Linux下GPIO驱动

    编写驱动程序,首先要了解是什么类型的设备.linux下的设备分为三类,分别为:字符设备,块设备和网络设备.字符设备类型是根据是否以字符流为数据的交换方式,大部分设备都是字符设备,如键盘,串口等,块设备 ...

  8. Linux iostat监测IO状态

    Linux iostat监测IO状态 http://www.orczhou.com/index.php/2010/03/iostat-detail/

  9. HDU 5679 Substring 后缀数组判重

    题意:求母串中有多少不同的包含x字符的子串 分析:(首先奉上FZU官方题解) 上面那个题就是SPOJ694 ,其实这两个题一样,原理每次从小到大扫后缀sa数组,加上新的当前后缀的若干前缀,再减去重复的 ...

  10. redo文件一

    redo log files and redo log buffer redo log files的作用的是确保数据库崩溃之后能正确的恢复数据库,恢复数据库到一,致性的状态 redo log file ...