>100 Views
November 01, 20
スライド概要
Trustworthiness analysis of web search results
明治大学 総合数理学部 先端メディアサイエンス学科 中村聡史研究室
ECDL 2007, Budapest, Hungary Trustworthiness Analysis of Web Search Results Satoshi Nakamura Shinji Konishi, Adam Jatowt, Hiroaki Ohshima, Hiroyuki Kondo, Taro, Tezuka, Satoshi Oyama, Katsumi Tanaka Kyoto University [email protected]
Background • People use Web search engines to obtain information and knowledge in daily life • Trustworthiness of Web search results has become crucial – There are many commercial pages, phishing sites, spam weblogs, pages contain viruses, … – Widespread use of SEO (Search Engine Optimizer)
Background: SEO Problems • How to earn much money by GoogleAdsense? – The page is ranked higher in Web search by SEO – The page has little or no content How to detect such a site before visiting??
Which search result is trustworthy??? • Users cannot judge each Web search result is majority or minority in the Web, contains typical query topics in the Web, is supported uniformly throughout the world before visiting them The system should provide such additional information in the Web search result
Goal of our work • Find what kind of additional information is useful for users to trust Web search result Official site Government Wikipedia
Objective of our work • Survey about Web search engines’ users – Which factors cause search engine users trust Web search results • Enhance Web search by additional information based on survey results – Early prototyping of our system
Survey about Web search engines’ users • Questionnaire – 26 questions about Web search • Situation, motivation, user’s trust level, about search ranking, about additional information, future search, … – Date: 2006/12/25 - 2006/12/26 – Subjects: 1000 Internet users (Japanese) Age Male Female 20-29 125 125 30-39 125 125 40-49 125 125 50-59 125 125
Situations when users search Web • Browsing Web, Doing research are major situations • People often search Web without particular reason
Reasons for searching Web • Obtain detailed information or explanation of query • Make comparison is third major reason
How many results do users check? • More than 50% of users check only top five search results • Only about 20% of users actually go further than top five search results Top 5 business model, SEO problem – Low ranked Web pages are ignored by more than 50% of users
Ranking algorithm??? • 18% of users believe that money paid to search engines is main reason influencing ranking of Web search results Money paid
Users’ trust level of search results • 56.7% of users trust Web search results • 10.4% of users don’t trust Web search results Trust 56.7% Not trust 10.4%
Characteristics of pages that users trust • Owner (author) information is important Positive factor: for trusting Author or owner of information, relevance to the search query, creation date Negative factor: for not trusting Spelling error, grammatical mistakes, biased information, uniqueness
Failure by believing search result • 12.3% of users failed by believing search result – 3.5% of users accessed adult contents, pages containing viruses, or phishing sites – 5.2% of users failed in the real world because of believing Web search results which include old information, mistakes, biased content and so on • Restaurant is closed, tasteless, …
About additional information • What additional information should be provided? – Contents date – Related words – Information about page author or owner – Scoring reflecting trustworthiness of page – Page type – Thumbnail image of page – Third party evaluations People require various additional information
Future search that people want Additional Information (48.1%) Context-aware search Domain focused search (45.7%) Clustering Automatic analysis of trust level
Prototype System • The purpose of our system is not to determine the trustworthiness of content by itself • Our system provides supplement information for users to judge trustworthiness
Additional Information • Topic majority (43.4% respondents) – The number of similar pages to search results that exist in WWW or in the set of pages related to query • Topic coverage (63.2% respondents) – The number of topics in the page of search result • Locality of link sources – The page is supported by wide area or small area? • Other information – – – – Topic details (72.6% respondents) Publisher information (85.1% respondents) Number of social bookmarks (38.3% respondents) Last modified-date (61.1% respondents)
Topic majority in the Web When the user inputs Q as a query … Wikipedia by Q A Q B Search results by Q Page 1 A X Q Page 2 A Q C Page 3 B Z Q DF(A&B&X&Q) < DF(A&B&C&Q) >> DF(A&B&Z&Q) 100 hits 500 hits 20 hits
Topic coverage When the user inputs Q as a query … Wikipedia by Q A B Q C F D E Search results by Q Page 1 A B X Y 50% Page 2 F Q Y B W X 17% Page 3 A Z Q E C F D B 100% Q
Locality of supporting pages
Locality of supporting pages • Locality of supporting page (L) p, pi : Web pages d(p, pi): distance between p & pi n: number of linked pages • Process of obtaining geographical coordinates – System obtains linked URLs by link: operator – System converts URLs to IP address by DNS – System obtains geographical coordinates by IP address and GeoLite City by MaxMind
Examples of locality of supporting pages • Google search engine: – L = 2.939 (http://www.google.com) • Government of South Africa: – L = 2.427 (http://www.gov.za) • Government of Australia: – L = 2.792 (http://www.australia.gov.au) • Alachua County Today (local news site in Florida): – L = 42.240 (http://www.alachuatoday.com)
Screenshot of prototype system Coverage Majority
Displaying locality of supporting pages Powered by GoogleMap
Performance of our system • The average processing time of top 10 pages for each query is 7.2 seconds and that of top 50 pages is 28 seconds • Time analysis for locality support Plan to implement Ajax based system which processes additional information and shows them sequentially
Wikipedia is trustworthy site? • Problem of using Wikipedia – Students who studied in Middlebury College used Wikipedia to make a report of history • About war of Shimabara – Wikipedia’s text sometimes includes mistakes
Conclusion • Surveyed about Web search engines’ users – We understand the way they search the web – How they determine the trustworthiness of search results – Additional information is required • Enhanced Web search by displaying additional information based on survey results – Topic majority, topic coverage, locality of supporting pages, other information – Supporting information that our system provides must be computed in real-time when users execute queries
Future work • Plan to do experimental test about additional information in the Web search result • Plan to survey about Web2.0 and Search2.0 • Plan to improve the algorithm to calculate the topic majority, topic coverage, and so on
Future work • Integrate this work and other our lab’s works – SBRank (Social Bookmark Rank) • Use number of social bookmarks to calculate the majority of minority [Yanbe et al, JCDL2007, ICWE2007] – Journey to the past • Time analysis using Internet Archive [Jatowt et al, ACM HyperText 2006] – Honto? search • Obtain aggregate knowledge from the search results [Yamamoto et al, APWeb2007]
Koszonom Szepen!! Thank you!! • Please check our paper or contact us if you are interested in our work Satoshi Nakamura Kyoto University [email protected] http://calendar2.org/ http://webox.biz/